Provenance & Traceability
1. Scope and objectives
Some use cases require additional metadata alongside the shared data for auditing and compliance purposes. Depending on the specific scenario, it may be necessary to track transactions occurring within the data space or identify who has accessed certain data.
The need for observability, traceability, and provenance tracking is particularly common in highly regulated industries or when managing high-value data.
It is essential to differentiate between two phases: the control phase, which involves transactions related to data-sharing contracts, and the actual data-sharing phase, where data is exchanged. Observability refers to the ability to monitor and manage data-sharing contracts, while data provenance tracking focuses on monitoring the sharing and usage of the actual data.
Both aspects fall within the scope of this framework and may be subject to regulatory or contractual compliance. Regardless, ensuring observability and provenance tracking is the responsibility of each participant and requires the implementation of robust data governance processes by all Data Space participants.
The use of data in artificial intelligence is a prime example where such mechanisms may be legally required. For instance, Article 12 of the AI Act mandates robust traceability of how data is procured and utilized. Current regulations emphasize the necessity of maintaining detailed records to ensure compliance and accountability in the handling and use of data.
This building block offers guidance for supporting observability, provenance, traceability, logging, audits, and related processes in a standardized manner. Additionally, it addresses the collection, storage, and processing of these types of data.
2. Capabilities
Data spaces must utilise and reuse existing provenance and traceability standards and guidelines instead of reinventing the wheel. Given that in a data space the requirements for provenance and traceability can arise from various avenues the relevant capabilities are needed. Depending on the requirements the capabilities could be:
Understanding of laws and/or regulations to determine the proper requirements for provenance and traceability
Understanding of data models and vocabulary relevant in the domain and selection and/or development of the relevant standard to be used
Designing appropriate solutions for storage and/or processing of the provenance and/or traceability data
Designing appropriate authorisations and usage policies for the provenance and/or traceability of data
Designing appropriate criteria and safeguards that ensure the validity of the data
Implementing Data Governance Processes by the Data Space Participants as part of the Participation management.
Provisioning of processes and rules for observability and data provenance tracking by the Data Space Governance Authority, if appropriate.
In this sense, Observability, Data Provenance, and Data Traceability are not limited to technical implementations, which are covered in this building block, but require processes and rules as part of the Data Governance by a Data Space participant and Governance Rules for Participation Management by the Data Space Governance Authority to achieve compliance with regulation and contracts.
Observability, Provenance and Traceability are important measures for improving trust amongst the participants in a Data Space. At the same time, a potential, independent 3rd party may be involved in such transactions as observer. Such parties need to be trusted. Observability data might be as sensitive as the data shared under the data sharing contract as it might divulge important information about business processes and connections between participants of a data space to third parties. Analyzing the observability data of an entire data space might yield detailed information on the business activities in the data space and thus reveal sensitive information. It is therefore of utmost importance to establish trust not just between the two parties sharing data but also between the 3rd party receiving observability data.
Therefore it is important that such 3rd party services are not centrally provided by an single entity and rather is a decentralised function defined within the data space along with governance.
The purpose of this building block is to provide an approach for observability, provenance, and traceability.
3. Specifications
The following subsections explain in further detail what one should keep in mind in relation to capabilities or use it as guidance to define the right capabilities needed in their data space.
From a requirements perspective, the following must be kept in mind:
Understanding the types of observability, provenance, and traceabilities helps in determining the relevant requirements from regulations/laws/contracts perspective.
Observability covers the interactions between participants regarding data sharing contracts, i.e. Control Plane activities as specified in the Dataspace Protocol. Provenance and Traceability cover the interactions of Data Planes regarding the actual sharing of data.
For appropriate selection of provenance and traceability data model (similar to w3c-prov, etc.) specific understanding is necessary, such as:
Domain: What industry or domain are you working in (e.g., healthcare, AI, supply chain, finance)?
Purpose: Why do you need provenance and traceability? Common goals include:
Regulatory compliance (e.g., GDPR, AI Act, FDA regulations).
Accountability and auditability.
Data integration and interoperability.
Enhancing transparency and trust.
Granularity: How detailed does the provenance information need to be? For example:
High-level (e.g., who accessed the data).
Fine-grained (e.g., every transformation or step in a data pipeline).
From an implementation perspective, the following must be kept in mind:
Understanding of architectural patterns and choice and availability of 3rd party services for observability and/or authorisations
Data storage, processing, and security models
Trust aspects regarding 3rd parties observing interactions or keeping log files of interactions, including technical measures, organizational measures, and regulatory or contractual obligations.
3.1 Types of Provenance & Traceability
The backward-looking direction of a data value chain is referred to as provenance tracking or data lineage, i.e. a Data Consumer can receive evidence on the origin of the data and the treatment of the data during its processing in the value chain. This can include aspects such as:
Ownership: Proof about ownership over a resource, for example: land ownership record from cadaster.
Custody: Proof about possessing a resource, for example: rental agreement of a tenant.
Location: Proof about a resource originating or its presence in a particular location, for example: cheese made in Gouda region can name itself as Gouda Cheese!
Data Rights: The general proof that the possession of data, its processing, and sharing respects the rights of all associated data rights holders.
The forward-looking direction of a data value chain is referred to as traceability, i.e., a data provider can receive evidence of what was done with the data. Traceability information then can be broadly categorized into:
Provenance Traceability: Any traceability data used to determine the provenance as explained above.
Non-Provenance Traceability: All other traceability data useful for other aspects than the provenance data.
Both aspects mentioned above can be implemented utilizing
centralized storage for provenance and traceability data
decentralized storage for provenance and traceability data
centralized storage for provenance and traceability data using 3rd party service
decentralized storage for provenance and traceability data using 3rd party service
This depends on the use case, given regulatory or contractual obligations, and general Data Governance aspects.
3.2 Types of Observability
Given that a dataspace provides the mechanisms to negotiate trust for data sharing and to agree on data sharing contracts while the actual data flow then happens on private channels, it is important to define what activities can be observed within a dataspace and which are outside of the scope of dataspace observability but might be important for the end to end trusted data sharing. Such contract negotiation mechanisms are defined under Control plane activities within Data Space protocol which data spaces could decide to use.
Also, it is important to distinguish between the observability of dataspace activities versus regular IT Operations telemetry. While both are important in an end-to-end solution only the observability of dataspaces can be defined generally through this architecture model. The observability of service telemetry is highly dependent on the specific implementation of the individual component. However, the principle of sharing observed telemetry through a data sharing contract in a dataspace applies to this data as well.
Observation of Dataspace Protocol activities & states: In this category, any states and state transitions of the Control Plane activities (Dataspace Protocol) can be observed. For this purpose, the connector/participant agent needs to keep log entries of state transition requests and successes and failures of those state transitions for any state machine of the Dataspace Protocol. This includes the following three state machines:
Publication and discovery of data, i.e. catalog functionalities
Data Sharing Contract negotiation
Transfer Process Management
Observation of service telemetry: End-to-end implementations will also require additional telemetry like service uptimes, performance data, and other measurements of the solution. However, those are implementation-specific and are not detailed in the Blueprint. They need to be agreed upon as an operational aspect of a specific Data Space.
3.3 Data Models for Provenance & Traceability
Provenance & traceability is linked to the notion of a ‘data product’. For each data product, there can be certain requirements for provenance & traceability either in the form of metadata (who is providing the data?, how is data quality assessed? etc.) or in the form of logs for the purposes of auditing, traceability, billing, etc.
This is expressed in the conceptual model below, in Figure 1.
Consequently, provenance & traceability is creating new data - requiring a data model so that it is interpretable by others. The advantage of this approach is that all the rules and applications of any data model also applies to the data models for Provenance and Traceability as well. This include better interoperability, access and usage conditions and rules, productisation, etc. It also means that all the data space technologies and building blocks could potentially also applied for the provenance and traceability data.
These data models could consider following two aspects:
Generic aspects: which apply in many different scenarios. Common ontologies such as W3C PROV-O and PAV - Provenance, Authoring and Versioning, could be useful in such scenarios.
Data spaces specific aspects: which apply in specific scenarios stemming from data space specific requirements. These could be modelled as an extension to either PROV-O or PAV. Some data spaces choose to model ‘events’ which take place in the data space. An example of an approach for this is CloudEvents.
This should be documented in the rulebook of the data space. The rulebook should also contain the subsequent legal and functional requirements. For example: the requirement for provenance & traceability could stem from contractual or legal requirements.
Depending on the use case and relevant legal/contractual obligations, it might be necessary to audit data sharing within the data space. In this case, it might be required to record specific data for the purposes of provenance & traceability. Such services are called ‘observability services’.
Currently there are no generic specifications yet for the Observability service. Are you providing services that offer this functionality? Please get in touch with us and help us identify common specifications in future versions of this blueprint.
4. Interlinkages
This building block has a connection with the following building blocks/pillars
Data Models: Provenance and traceability data must be understandable by other participants, and therefore, the use of common data models is necessary.
Data Exchange: provenance and traceability data is a dataset, and sharing it must follow the common data exchange protocols from value, security and privacy perspectives.
Identity and Attestation Management & Access and Usage policies enforcement: provenance and traceability data is also data, and sharing it with the right and intended participants is a must to protect data and value.
Data Value Creation Enablers: Apart from debugging and auditing purposes, provenance and traceability data usually creates added value. For example, a food product claiming to be organic and biologically grown can be substantiated by provenance and traceability data about the farming, production/processing, logistics data. More about the value creation can be learned from the Data Value Creation Services.
Business: Based on the value creation or agreements between organisations that apply to many or all participants, data spaces would need to be captured in business agreements, and governance would be necessary.
Legal: Provenance and traceability requirements could also come from regulations. Additionally, depending on the business requirements, provenance and traceability data may be required to be created and maintained in a legally sound way. Therefore, legal requirements from regulations and contractual perspectives must be considered under this building block.
Governance: Like every other building block, this one will also have elements requiring governance and maintenance, so they must also be defined in the governance building block. This specifically applies to provenance and traceability requirements across the data space.
5. Co-creation questions
Which requirements exist in your dataspace for registering provenance & traceability data?
For instance resulting from legal or contractual requirements, for the purposes of auditing or for the purposes of billing.What data model will your data space use for specifying provenance & traceability data?
This data model can use common ontologies for provenance & traceability, but likely requires extension for domain specific purposes.How is provenance & traceability data stored in your data space?
This can be done either locally by the Data provider, Data consumer or both and/or more centralised at a provider of an Observability service. You can decide to contract this service as a Data Space Governance Authority or you can certify a service which can be contracted by participants of your data space.
The outcomes of these questions should be recorded in the Rulebook of your data space.
6. Implementation
Observability, provenance, and traceability are subject to data sharing contracts. As explained above in the specifications section, data spaces and participants must evaluate the requirements and make choices that most appropriately fits their requirements. All those requirements and choices should be documented into the data space rulebook including the following choice of technical implementations or architecture.
The technical design depends on the business process and feasibility of implementation on one hand and on requirements on the other. Overall it must also fit within the governance rules of the data space. For instance, it may be required that a central service receives all observations, or a decentralized service receives them to jointly ensure observability, potentially segmenting observation services by domain or jurisdiction. In such cases usually such a service is under the governance of the DSGA.
However, it is also possible to have an open market of observation services providing value creation services to other participants of the dataspace like notary services, data accounting/payment processing, dispute resolution, proof of execution, etc. thereby leaving it upto participants to determine what works best for them and also meets the requirements.
To illustrate technical implementations, the following section provides an overview on potential implementations patterns. The first approach is based on either the data provider or data consumer or both are storing the relevant provenance & traceability data, as is specified in the Figure 2.
Technically this can be implemented through the ‘Transfer process’ function in the Control plane of a Participant Agent. Here, it needs to be ensured that when transactions (or more in general: data sharing) are taking place on the Data plane level, provenance & traceability are stored for future auditing purposes.
Either the Data provider or Data consumer should be able to show such records at any point in the future. Additional guarantees are possible when an auditor can compare both records.
In some cases, it might be desirable to use a third party to store this data, optionally combining this with local storage by any one partner. Such a third party is known as observability service provider and its requirements are defined in Intermediaries and Operators. As is the case for all Federation services, it can be contracted by any participant (Data provider, Data consumer, or Data Space Governance Authority). This is shown in Figure 3.
The 3rd party service is called an Observability service when it is contracted by the Data Space Governance Authority.
7. Future topics
Based on community inputs the aim is to extend this building block with more common specifications.
8. Further reading
There are several potential standards for provenance and traceability.
Provenance
W3C: PROV-O / https://www.w3.org/TR/prov-o/. Set of classes, properties, and restrictions to represent and interchange provenance information generated in different systems and under different contexts. Additionally, DCAT vocabulary specification examples include the use of PROV-O https://www.w3.org/TR/vocab-dcat-3/#examples-dataset-provenance
PAV ontology (PAV - Provenance, Authoring and Versioning) / https://pav-ontology.github.io/pav/. Specialises PROV-O to describe authorship, curation and digital creation of online resources
Traceability
Cloudevents -> www.cloudevents.io / https://github.com/cloudevents/spec/blob/main/cloudevents/spec.md. CloudEvents is a specification for describing event data in common formats to provide interoperability across services, platforms and systems.
Additional standards which could be appropriate in some cases and could be considered where relevant:
Open Provenance Model https://openprovenance.org/opm/ a predecessor of the Prov model, also brings specific aspects of provenance and may still have good resources.
For data sharing consents, Kantara Consent Receipts (https://kantarainitiative.org/download/7902/ ). Requirements for the creation of a consent record and the provision of a human-readable receipt.
ISO/IEC TS 27560:2023 (https://www.iso.org/standard/80392.html ). Consent record infrastructure
PIDs (Persistent Identifiers) for data are reasonably universal in the research world. PIDs are minted, which implies a Trust Anchor. However, PIDs for services are more challenging and need service instances that capture what code was executed, what machine, etc.
Further resources:
Approach to Data Spaces from GDPR Perspective: https://www.aepd.es/documento/approach-to-data-spaces-from-gdpr-perspective.pdf
CamFlow tool https://camflow.org/ CamFlow is a Linux Security Module (LSM) designed to capture data provenance for the purpose of system audit.
How-to-provenance repository https://github.com/provenance-io/how-to-provenance where you can find examples of Provenance Blockchain usage, Provenance Blockchain smart contract development, Provenance Blockchain application development and related topics.
9. Glossary
Term | Definition |
|---|---|
Provenance | The place of origin or earliest known history of something. Usually it is the backwards-looking direction of a data value chain which is also referred to as provenance tracking |
Traceability | The quality of having an origin or course of development that may be found or followed |
Provenance Traceability | Any traceability data used to determine the provenance |
Non-Provenance Traceability | All other traceability data useful for other then provenance data. |
Observability Services | A service that stores the audit data of data sharing transactions within the data space and provide services related to observability. |
Data Space Governance Authority | A governance authority refers to bodies of a data space that are composed of and by data space participants responsible for developing and maintaining as well as operating and enforcing the internal rules. |