Content area
As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes. This paper focuses on the design and development of a system for collecting, managing, and querying data provenance of data preparation pipelines in data science. An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of pre-processing data manipulation. Building on this study, we present an approach for transparently collecting data provenance based on the use of an LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity and (ii) store an accurate description of all the activities involved in the input pipelines for supporting the explanation of each of them. We then illustrate and test implementation choices aimed at supporting the provenance capture for data preparation pipelines efficiently in a transparent way for data scientists.
Introduction
The question of how to make AI explainable has been receiving much attention in the past few years, especially in areas of science where AI-generated results are required to stand the scrutiny of human experts [1]. Most explainability features, such as Shapley values and their variants [2], have been developed to explain the model’s behaviour. It is becoming clear that, important as they are, these are insufficient as they fail to account for decisions made upstream in the data-to-insights pipeline, namely dataset selection, data integration, and engineering. Having the ability to explain how data is manipulated before model training and operations is increasingly recognized as the missing component in achieving satisfactory explainability, ultimately leading to trust in the resulting models [3, 4].
Such manipulation typically includes standard algorithms for data cleaning, record linkage, outlier detection, normalization, missing values imputation, statistical upsampling, downsampling, and other data augmentation methods. However, in a growing number of cases, some of these algorithms are being redesigned to interact with model training, for example, for incremental data cleaning [5]. This trend is embodied by what is now called Data-Centric AI, advocating incremental adjustment to data in response to limitations in the models, in an iterative fashion.
This growing landscape of data manipulation algorithms can be consequential, for instance, leading to bias correction in the data, but also, potentially, inadvertently introducing new bias when used inappropriately.
Recognizing the importance of capturing data manipulations in detail, in this paper, we advocate for the role of data provenance towards this goal. In this setting, the term data provenance, originally proposed with reference to database queries [6], refers to records of data transformations, from raw data sources to the point where the data are used to inform (train) Machine Learning and especially AI models. A formal and more general definition of provenance that is not restricted to data is given by the World Wide Web Consortium (W3C) as “a record of the entities and processes that are involved in the production and delivery of or otherwise influence a particular resource” [7].
Adding provenance to the data, we argue, contributes to the trust in AI-based models by improving their transparency, enabling explainability of the end-to-end process, but also facilitating its reproducibility. Specifically, provenance records associated with a dataset at some intermediate processing step typically include items such as references to its original source, references to the addition or removal of records and features relative to the original, and references to linkage or join operations. At a more granular level, in some cases, it may also be possible to track the derivation of individual items within the dataset through specific operations, for instance, normalization across a whole column. This traceability enables developers to understand how data influences model behavior and predictions. For example, if a fraud detection model flags a transaction, provenance tracking allows analysts to trace back which data sources and preprocessing steps contributed to the decision. Similarly, in healthcare AI, knowing the lineage of training data, such as the applied pre-processing filters, can uncover potential biases affecting diagnostics. In addition, by maintaining data provenance, organizations can reproduce results, audit processes, and facilitate debugging by highlighting which data updates or pipeline changes impacted model performance.
This paper extends [8] and our previous work on Data Provenance for Data Science [9]. The main contribution is a system, implemented at the prototype level, for automatically tracking granular provenance through Python scripts where data is in the form of dataframes, in a way that is maximally transparent and minimally intrusive to the programmer. We begin by presenting an analysis of typical usage of data operators in the wild, looking at pipelines available from MLBazaar and Kaggle, and then frame the problem in terms of a classification of data processing operations, before illustrating the system in action on six exemplar benchmark pipelines (reproduced in Appendix A).
A survey of data processing pipelines
To establish an understanding of the importance of pre-processing operations, a survey of publicly available machine learning pipelines is conducted, with each operation present in the pre-processing stage recorded. Operations are only recorded if they act upon the master dataset before the start of model training, excluding operations conducted during the exploration of the data and alterations made to the results such as reverting scaled values to their original size. Pipelines are sourced from the ML Bazaar [10], a collection of machine learning solutions to different data type and task combinations, and the popular data science community Kaggle.1
The ML Bazaar pipelines are composed of a selection of pre-made components known as primitives, improving the pipeline’s ease of understanding but limiting the range of operations present. In addition, as the pipelines are all sourced from the same location there is a potential for bias. However, an analysis of the results is undertaken to provide a basis for understanding general trends, although the results are given a significantly reduced level of importance compared to those obtained from Kaggle. We have considered only pipelines operating on tabular data, since they are the most used in practice, leaving 311 of the 386 pipelines present in the database to be considered. Some details of such pipelines are reported in Table 1.
Table 1. Pre-processing operations in ML Bazaar pipelines
ML Task | Pipelines | DFS | Encoder | Imputer | Scaler | DT Feat |
|---|---|---|---|---|---|---|
Forecasting | 35 | 6 | 34 | 35 | 35 | 0 |
Regression | 77 | 34 | 55 | 77 | 77 | 0 |
Classification | 199 | 86 | 199 | 199 | 199 | 4 |
Total | 311 | 126 | 288 | 311 | 311 | 4 |
Only five types of operation are identified from the ML Bazaar pipelines, with some form of imputing and scaling present in all 311 pipelines, and encoding in over 90%. Of the other two operations, a DateTime Featuriser (DT Feat) is extremely rare and is therefore discounted, though this is partially due to the contents of the datasets, whereas Deep Feature Synthesis (DFS), an automated tool used to perform feature engineering, appears in roughly 40% of pipelines.
While the near homogeneity present in the ML Bazaar limits the potential conclusions and demonstrates the bias of a single source, it does highlight four main operation types commonly used in pre-processing: scaling, encoding, imputation, and feature engineering. This evidence is compared to the Kaggle pipeline survey to identify similarities.
As Kaggle is a public platform that hosts pipelines, referred to by Kaggle notebooks, created for a variety of reasons, including competitions and tutorials, it is important that the correct information is investigated. As we focus on data preparation for machine learning, the survey is conducted on the top 200 most upvoted Python notebooks related to machine learning on the site. Of these 200 pipelines, a small percentage are deemed irrelevant to the survey, as they either do not perform pre-processing operations, or they relate to non-standard data types. The principal operations involved in such pipelines and their frequency are reported in Fig. 1. These operations are, in order: Vertical Transformation, Scaler, Column Rename, Join, Instance Drop, Imputer, Feature Selection, Feature Augmentation, Encoder, Data Type transformation.
[See PDF for image]
Fig. 1
Frequency of 10 most common operation types in Kaggle survey
Across the survey, 29 unique pre-processing operations are detected, of which 12 appear in less than 10 pipelines, including transposing and changing index values. Among the 10 most common operation types, imputing and encoding again lead the way, appearing in 50% and 49% of the 200 pipelines, respectively. The remaining two operations highlighted in the ML Bazaar survey also feature heavily, with 58 instances of feature augmentation and 38 scaling operations.
Due to the high frequency of these four operations in both the Kaggle and ML Bazaar surveys, they can be considered highly important for the wider machine learning community, and therefore it is vital that a tool supporting the analysis of pre-processing operations can correctly track and map their provenance. This also applies to additional operations found in the Kaggle survey, with value transformations and feature selection among the most common findings.
The information obtained through these surveys is used in conjunction with existing knowledge of pre-processing practices to assign a level of importance to each operation [11]. The pre-existing knowledge is introduced to account for the under-representation of some key operations due to the contents of the survey. For example, adding instances to a dataset appeared in only five pipelines across the two surveys, but is a common solution to reduce bias in imbalanced datasets, and so was assigned a higher level of importance. While the final aim of our tool is to fully support all operations, testing the tool’s support for essential operations is required for an initial release.
Related work
Various established methods and tools exist to produce provenance, including the use of provenance polynomials through query instrumentation. These approaches, however, are designed to function in a relational database environment and presuppose that the queries involve relational operators [12, 13–14]. This is however not universally applicable. Therefore, we opt to avoid techniques that are strongly tied to SQL or first-order queries [15], as these methods could limit the inclusion of additional types of operators in future developments.
Provenance capture within scripts is not a novel concept, but prior efforts have largely centered on capturing provenance related to script definition, deployment, and execution [16].
The Vamsa framework [17] addresses some of these challenges by collecting provenance related to pipeline design. However, the provenance records it generates focus on details such as the invocation of specific machine learning libraries through automated script analysis, rather than capturing data derivations.
More recently, [18] utilized provenance to track shifts in data distribution within machine learning pipelines using predefined “inspections“ that assess data at certain operators within the pipeline. This aligns with the motivation behind our work, which expands on this by passively capturing provenance from any operator in the pipeline. Additionally, [23] merges system-level provenance with application-level logs to reconstruct data science pipeline provenance without placing a burden on the pipeline developers.
There are also tools that record the execution of general-purpose scripts (such as Python), though they lack the capability to capture fine-grained data provenance. For instance, NoWorkflow [19, 20] has been integrated with YesWorkflow [21, 22], which offers a workflow-like representation of scripts but similarly does not focus on data derivations.
A recent proposal introduced an application-agnostic approach to capturing fine-grained provenance [23]. This method combines provenance data from low-level OS logs with high-level application logs, yielding a comprehensive provenance record while minimizing the impact on developers. However, it remains unclear how much fine-grained provenance can be gleaned using this method, whereas our approach lays a clear foundation for the type of provenance that should be captured. A valuable direction for future research would be to assess how much of the provenance we identify could be collected using this approach.the approach outlined in [23].
An approach to data provenance collection
Data provenance model
In this paper, we focus on collecting provenance data for pipelines involving typical data manipulation operators over dataframes [24] provided in libraries such as pandas and scikit-learn. A dataframe is just a set of records, each of which includes values for a collection of features (or attributes). A value in a data set is uniquely identified by an index for the feature to which it belongs.
The provenance model that we have adopted, described graphically in Fig. 2, is inspired by the PROV model [7] and uses entities, activities, and relationships between them to describe how elements of a dataframe are derived, created, and possibly deleted by operators in pipeline execution.
In this model, data provenance is represented as a graph. An entity node in the graph represents an atomic element within a dataset (i.e., a single value in a record), while an activity node refers to an operation performed on those elements. An entity e can be used, generated, or invalidated (e.g. deleted) by an activity a, and these interactions are captured through, respectively, the relationships used, wasGeneratedBy, and wasInvalidatedBy between entities and activities.
The model also introduces the concept of column node, which represents a group of entities occurring in a specific attribute of a dataframe. This allows the representation of provenance information at a higher level of granularity. Since activities operate on columns, the same relationships between entities and activities also apply to columns and activities. Additionally, a belongsTo relationship denotes the occurrence of an entity in a specific column of a dataframe.
[See PDF for image]
Fig. 2
The data provenance model used in this paper
These concepts are used to guarantee a complete and general representation and understanding of the provenance, independently of the specific set of operators used in a data preparation pipeline. In the following sections, we will explain how this model is used and the usefulness of the column node.
Data operations
Based on the analysis done in the previous section, we observed that the majority of pre-processing operations, including all the most used in practice and available in e.g. Python libraries, can be implemented by combining a rather small set of basic operators of data manipulation over datasets belonging to four main classes, according to the type of manipulation done on the input dataframe(s), as follows.
Data reductions: operations that take as input a dataset D and reduce its size by eliminating rows or columns from D. Examples of data reduction operators are feature and instance selection.
Data augmentations: operations that take as input a dataset D on a schema S and increase the size of D by adding rows or columns to D. Two basic operators of this kind are defined over datasets. They allow the addition of columns and rows to a dataset, respectively. They include, for example, the operations of hot-encoding and feature augmentation.
Data transformation: operations that take as input a dataset D and, by applying suitable functions, transform (some of) the elements in D. They include, e.g., the operations of imputation and normalization.
Data fusion: operations that take as input two datasets and and somehow combine them into a new dataset D. They include, e.g., the operations of join and append.
[See PDF for image]
Fig. 3
An example of provenance template for data transformations
By mapping effective operators available in standard libraries to these fundamental templates, we are then able to identify the transformation type based on observation of the operator’s input and outputs alone. By abstracting to this level, we can automatically create the appropriate provenance for an operator in a data science pipeline if it follows the pre-identified input–output patterns, even if the operator itself has never been seen before. This approach strongly simplifies the capture mechanisms since it does not require a specific method for each practical operation.
Data provenance collection
One of the main limitations of data provenance systems, particularly those that collect data at the level of individual elements [25], is the large volume of provenance data generated. This data is often redundant for most queries of interest and complicates the readability and interpretability of the provenance graph.
To address this issue, our system introduces the capability to select a granularity level for provenance tracking, providing more precise control over the desired degree of detail.
Sketch Level: The graph captures the most abstract view of the provenance. Activity nodes represent high-level activities or operators. For each activity, only a single entity node is represented for each relationship concerning that activity, including used, wasGeneratedBy, and wasInvalidatedBy. This level is useful for obtaining an overview of the entire pipeline and ensures that the primary consequences of each operator within the pipeline are immediately visible, facilitating a clearer understanding of the process dynamics.
Derivation Level: All the information from the previous level is retained, with the addition of wasDerivedFrom links between entities, based on the activities that operated on them. This expansion can be performed incrementally by clicking on the entity nodes that require further exploration or focus and adding, in this way, only the links connected with such entities.
Full Level: the graph includes all the details at the finest level of granularity, thus capturing all possible insights. All entities and activities are included at this level.
Only Columns Level: the graph includes only the activity and the column nodes, without reporting the Entity nodes, capturing data transformations at a coarser level of granularity with respect to the full level.
A provenance system for data preparation
An LLM-based system for capturing data provenance.
We have implemented the approach to provenance generation illustrated in Sect. in an open source platform called PROLIT (Provenance and Lineage Information Tracker),2 whose logical architecture is reported in Fig. 4.
[See PDF for image]
Fig. 4
PROLIT architecture
The input to the system is a script s that implements a pipeline of operations. The first component ➊ performs, with the help of an LLM, a rewriting of s where the various activities are identified and documented by annotating the original script. This component also produces an internal dictionary ➋ that includes a detailed description of each operation involved in the pipeline. The new script is then executed in the running environment ➌ for which it was originally developed. The provenance generator ➍ analyzes the effect of each operator in the pipeline at execution time by producing a provenance graph fragment that captures the dependencies for the data elements that have changed and injecting the operation descriptions stored in the dictionary; and (iii) writing the resulting provenance fragment into a graph database (in this implementation we used Neo4j3).
The complete provenance graph that becomes available at the end of execution can be analyzed using native, direct access to Neo4j ➎ and its associated graph visualization.
A central activity of PROLIT is the process of detecting and tracking the provenance of a user-defined pipeline of data preparation written in Python and involving common libraries for data processing in data science. This is implemented by the provenance generator that produces the provenance of each operator in the pipeline at execution time by:
observing the execution of operators that consume and generate datasets,
analyzing the value and structural changes between the input(s) and output datasets for that command,
based on the observed change pattern, select and instantiate one or more provenance templates described in the previous section to capture the dependencies between the elements of the datasets that have changed, and store the provenance data produced by the prov-gen function in Neo4j.
The role of LLM
Figure 5 illustrates how the LLM unit of the system leverages an LLM to support provenance tracking. In our implementation, we have used the Meta’s Llama3–70B model, accessed via the Groq APIs, and Langchain4 as mediator between the LLM and the rest of the system. The model is not fine-tuned on the specific pipelines analyzed in this work but is used in a zero-shot setting.
[See PDF for image]
Fig. 5
The data workflow of PROLIT
The LLM is utilized for three main objectives:
Transforming a raw pipeline into a formatted one to achieve a standardized structure by segmenting the pipeline into activities, thereby facilitating automatic provenance tracking, and by adding detailed comments that describe each data manipulation operation in detail. These comments are inserted directly into the code, providing clear explanations of what each section of the pipeline does.
Identifying columns used in each activity, which can be challenging to discern by merely examining the pipeline code.
Providing a high-level description of each activity’s scope. These descriptions are stored in a dictionary and explain the intent and impact of each operation, providing a clear understanding of the semantic meaning behind the code. These descriptions are later used to provide explanations while inspecting the provenance collected during pipeline execution.
An example of a pipeline’s snippet before and after formatting is provided in Figs. 6 and 7, respectively. An example of the dictionary produced for the this pipeline is illustrated in Fig. 8.
[See PDF for image]
Fig. 6
An input pipeline
[See PDF for image]
Fig. 7
The pipeline of Fig. 6 after formatting
[See PDF for image]
Fig. 8
The dictionary produced for the pipeline in Fig. 7
LLM limitations LLM performance is highly sensitive to the consistency, structure, and clarity of the prompts. Minimal variation in wording, formatting, or context can lead to radically different responses in both relevance and accuracy. This sensitivity creates challenges in achieving consistent behavior with heterogeneous codebases or pipeline styles, particularly at scale or in automated pipelines. As opposed to classical rule-based systems or static analysis tools, LLMs offer no formal correctness guarantees. Their outputs are generated probabilistically and may be incorrect, incomplete, or misinterpretations. As such, manual validation or pairing with additional verification mechanisms is generally necessary, especially in use cases with regulatory compliance, data governance, or safety-critical implications.
Robustness to process failure
A possible limit of a platform for data provenance collection is that if a runtime exception occurs during the execution of a pipeline, the provenance of the entire process is not tracked. PROLIT addresses this issue by tracking the provenance of the pipeline even after a runtime error occurs.
This is made possible by a wrapper around the input pipeline that manages exceptions while still capturing the provenance, including all nodes and relationships, up to the point where the exception occurred. The wrapping mechanism primarily focuses on exception handling and provides a centralized approach to tracking errors without modifying the main pipeline function.
This is useful because, in the event of a runtime exception in the pipeline, the information prior to the error is not lost, making it possible to trace the potential cause of the exception. A possible example of an exception is an attempt to access fields that are not present in the dataset. A pipeline with this type of error is reported in Fig. 9.
[See PDF for image]
Fig. 9
Example of pipeline that generates a runtime error
In this case, the provenance will be tracked up to the fourth step, where the exception occurred because in the pipeline an attempt is made to access the
Table 2. Error detected while trying to double the values in the
Property | Value |
|---|---|
Activity id | 3 |
Code | |
Context | Doubles all the values in the |
Function name | Doubles the |
Node id | activity: |
Runtime_exception | The |
This helps the user determine if they can use the provenance of the previous operations to detect the cause of the exception.
PROLIT in action
In this section, we show the main features of PROLIT using a practical example.
The pipeline is reported in Fig. 10 we consider focusing on preprocessing a customer purchase dataset to handle missing values, separate features from the target variable, impute missing numerical data, and apply one-hot encoding to categorical features. It is implemented as follows.
[See PDF for image]
Fig. 10
An example pipeline
To demonstrate the system’s visualization interface and how the user interacts with it, Fig. 11 shows the pipeline taken as an example, where the imputation activity has been selected, displaying the detailed node information.
[See PDF for image]
Fig. 11
PROLIT interface in Neo4j Browser, showing the data preprocessing steps with a focus on the imputation step
PROLIT identifies 5 activities. The details of the third one are reported in the following Table 3 as shown in Neo4j. As explained in section , activities are represented in the same way for all the granularity levels.
Table 3. Details on the imputation operation
Property | Value |
|---|---|
Activity id | 2 |
Code | |
Context | Replace missing values in numerical columns with the mean of the respective column |
Function name | Impute missing values in numerical columns |
Node id | activity:06517138-3478-4ae3-902a-03ad07b37885 |
Figure 12 illustrates the different visualizations for the four levels in Neo4j. In particular, Fig. 12a shows the “Sketch Level”, where only a single cell entity is depicted for each relationship concerning that activity. Figure 12b illustrates the “Derivation Level”, which visually resembles Sketch Level but allows users to click on an entity to access a detailed view of the derivation. Figure 12c shows the visualization of “Full Level”, the most detailed and comprehensive representation, where all available information regarding the entities and their relationships is visualized. Finally, Fig. 12d shows the “Only Columns” Level, in which only columns and activities are represented, omitting entities for a high-level overview of transformations. For further information about the granularity levels, refer to Sect. .
In particular, in Fig. 12a, Sketch Level is represented where only a single cell entity is depicted for each relationship concerning that activity. 12b depicts Derivation Level, which visually resembles Sketch Level, but by clicking on an entity, viewers can access a detailed view of the derivation. Figure 12c presents the visualization of Full Level, which is the most detailed and comprehensive representation, where all available information regarding the entities and their relationships is visualized. For further information about the granularity levels, refer to Sect. .
[See PDF for image]
Fig. 12
Visualization of various granularity levels for the example pipeline. The nodes in the graphs are colored as follows: the orange nodes represent entities, the blue nodes represent columns, and the purple nodes represent activities. This allows for a clear distinction between different components of the provenance across the various levels of granularity presented
Evaluation
Goals and methodology
The evaluation presented below aims to validate PROLIT’s capability to gather and manage data provenance across data preparation pipelines. It is focused primarily on performance, but we also demonstrate qualitatively the value of providing multi-granular provenance records, as these offer the analyst a choice of perspectives on pipeline execution. Such multi-level traceability supports users in explaining how data values were computed, identifying where bugs originated (e.g., in case of runtime exceptions), and facilitating reproducibility through transparency. This demonstrates the capability of the system to be effective, not just as a tool for provenance capture, but also as a foundation for establishing trust in AI pipelines by enabling full historical awareness of the data lifecycle, such as queries like Item History and Record Operation.
Pipelines and experiment setup
Seven pipelines are used to test the tool, as described in more detail in the Appendix. Three of them operate on the commonly used benchmark, German [26], Census [27], and Compas [28]. The rest of them – Orders (which was generated by us), Car [29], Mushrooms [30], and Titanic [31], – have been chosen to test the tool on different operations and datasets. The datasets were selected due to the difference in the number of instances and features, as well as the combination of numerical and categorical data present. Additionally, the operations in the pipelines come from different libraries and vary in complexity to verify that the tool correctly tracks the provenance in various cases. The sizes of the datasets are reported in Table 4.
Table 4. Sizes of the datasets
Pipelines | Records | Features |
|---|---|---|
Car | 42090 | 12 |
Census | 32562 | 15 |
Compas | 7215 | 53 |
German | 1000 | 21 |
Mushrooms | 61070 | 21 |
Orders | 100 | 4 |
Titanic | 892 | 12 |
All experiments illustrated in this section were performed on an HP Laptop, 16 GB RAM (2 x (8GiB SODIMM DDR4 Synchronous 2667 MHz)), 1.60GHz Intel Core i5 4 core and 16 GB RAM (2 x (8GiB SODIMM DDR4 Synchronous 2667 MHz)).
Performance analysis
The analyses focus on the size of the generated graph (in terms of nodes and relationships) and the operation runtimes, which are recorded for nine queries executed across four granularity levels for each dataset and pipeline. The details of the nine queries are presented in the Table 5. By examining the table, it becomes clear that all queries are compatible with the four granularity levels, except for the “Record Operation” and “Item History” queries. The “Record Operation” query cannot be executed at the Only Columns Level due to the entity nodes’ absence, while the “Item History” query is incompatible with both the Only Columns Level and Granularity Level 1 because it lacks entity nodes for the former and derivations for the latter.
Table 5. Cypher queries used in the Experiments
Name | Cypher Query |
|---|---|
All Activities | |
All Distinct Activities | |
Used Relations | |
Invalidate Relations | |
Dataset Spread | |
Feature Spread | |
Activity Metrics | |
Record Operation | |
Item History |
These experiments are not only meant to validate the performance and scalability of the system but also to demonstrate how provenance contributes to transparency, explainability, and reproducibility in data preparation pipelines. For example, queries such as Item History and Record Operation allow users to trace the origin and transformation of specific values within the dataset, offering a clear audit trail of how data has been manipulated. This is crucial for understanding how the final model inputs were constructed, enabling better explanations of model behavior. Moreover, the ability to explore the data lineage across different granularity levels helps practitioners to reproduce experiments more effectively by identifying the exact sequence of operations and the impact of each step.
To justify the presence of multiple levels of granularity for provenance capture, it is possible to see the savings in terms of space obtained on the various levels in Table 6.
Table 6. Percentage of space savings (in terms of Nodes and Relations) and query time reduction (Query Execution Time) for the Sketch Level and Only Columns Level, compared to the Full Level
Sketch/Full levels | Only-Columns/Full levels | |
|---|---|---|
Nodes | 99.62% | 99.69% |
Relations | 99.68% | 99.76% |
Query Execution Time | 34.34% | 46.02% |
The values of the nodes and relationships in the table represent the average percentage savings, calculated considering each test pipeline, compared to the most detailed level of provenance tracking, Granularity Full Level. Execution times are reported as the average percentage savings across all analyzed queries on all test pipelines.
The reduction in graph size, measured by nodes and relationships, is significantly greater at the lower granularity levels, Sketch and Only Columns levels, compared to the complete provenance level, Full Level. A similar trend is observed in query execution times, which also benefit from this reduction. These improvements are driven by the decreasing number of nodes and relationships considered at the lower granularity levels.
Figures 13 and 14 illustrate, respectively, the normalized values of the number of nodes and the number of relationships, calculated as the ratio between these quantities and the total size of the dataset (i.e., the number of entries in the dataset) for each level of granularity considered in logarithmic scale. The patterns observed in these figures provide strong evidence of significant spatial savings in terms of both nodes and relationships when compared to the full granular level. This reduction is achieved while still retaining the capacity to respond to the vast majority of test queries effectively. Such results demonstrate the efficiency of the proposed approach in optimizing storage reserving the ability to answer queries of greatest interest.
[See PDF for image]
Fig. 13
Number of nodes per entry across granularity levels for each pipeline (logarithmic scale). A single bar reports both the derivation and full levels
[See PDF for image]
Fig. 14
Number of relations per entry across granularity Levels for each pipeline (logarithmic scale)
From the tests conducted, a significant difference in the number of nodes and relationships within the provenance graph emerged for the reduced granularity levels, leading to a notable reduction in query execution time as well as improved interpretability of the provenance graph. These observations support and underscore the value of introducing granularity levels, offering users greater flexibility in selecting the desired level of detail for provenance reading.
The type of information being collected, such as data pertaining exclusively to columns or involving entities as well, plays a crucial role in determining the appropriate granularity level.
Figure 15 presents the execution time of the most time-consuming query for each pipeline, evaluated at different levels of granularity. The y-axis represents the execution time in milliseconds, while the x-axis shows the corresponding pipelines.
[See PDF for image]
Fig. 15
Execution time (in milliseconds) of the most time-consuming query for each pipeline across different levels of granularity. The y-axis indicates the time in milliseconds, while the x-axis lists the pipelines involved in the evaluation
Behavior across different libraries and pipelines
The system works well with well-known libraries or those recognized by the model. However, when the model encounters unfamiliar libraries or if a user-defined function is used, the system struggles to understand the processes involved. As a result, it fails to describe and report the code of the activity properly, but still tracks the provenance.
The system is designed to capture detailed provenance across the entire workflow. However, when a process splits into parallel or alternative paths, it records provenance linearly, treating all actions as if they occurred sequentially. As a result, even complex branching workflows are simplified into a linear trace, while preserving provenance for each step.While the design of the system is to capture detailed provenance of every activity that transpires with the goal of effectively characterizing the entire workflow, in instances of a process bifurcation, the whole tracking handled by the system has been linear. That is, while the process forks and proceeds on two parallel or alternative routes, the system still tracks the operations’ provenance as if the actions are performed sequentially. Thus, even complex scenarios that involve many choices simplify tracking to a sequential activity while maintaining, at the same time, a record of provenance for each step.
Provenance for explainability and trust
Beyond performance, the captured provenance supports transparency, explainability, and reproducibility. At the Full Level, queries like Item History and Record Operation (Table 5) enable detailed inspection of how individual data values are derived and transformed throughout the pipeline. This empowers analysts to understand the exact lineage of model inputs and to provide concrete explanations for model behavior. Meanwhile, Sketch and Only Columns levels offer higher-level summaries of transformations, useful for auditing and trust-building without overwhelming complexity. Furthermore, the ability to re-execute pipelines with different granularity levels ensures that both detailed debugging and high-level process review can be conducted reproducibly.
Conclusions
In this paper, we have illustrated the development of a system, called PROLIT, for the collection and management of data provenance in pre-processing pipelines for the subsequent activity of machine learning. PROLIT aims to address some of the critical needs for transparency and interpretability in machine learning and AI systems, as those become key requirements for MLOps and AI deployment. The system is designed to support initially the most common data manipulation operators, as found in extensive analysis of Kaggle and ML Bazaar pipelines. However, it is designed to be extensible and flexible, with support from pre-trained LLM to identify logical units of computation within a script. PROLIT offers a way to track data manipulations at three different levels of granularity. These are recorded into a Neo4J graph database, which can then be queried to generate suitable data-centric explanations.
Extensions to the work include a focus on scalability, providing concise representations of provenance that scale in the size of the data as as well as the number of operators. Most importantly, we aim to support the iterative entanglements between data manipulation and model training, which result from the emerging Data-Centric view on Machine Learning and AI.
Acknowledgements
This work was partially supported by the Progetto Integrato 2.1 “Cyber Security dei sistemi energetici” PTR22–24 funded by Ministero dell’Ambiente e della Sicurezza Energetica (MASE).
Author contributions
All authors contributed to the manuscript.
Availability of data and materials
No datasets were generated or analysed during the current study.
Declarations
Competing interests
The authors declare no competing interests.
https://www.kaggle.com/
2https://github.com/lid-uniroma3/PROLIT
3https://neo4j.com/
4https://www.langchain.com/langchain
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Adadi, A; Berrada, M. Peeking inside the black-box: a survey on explainable artificial intelligence (xai). IEEE Access; 2018; 6, pp. 52138-52160. [DOI: https://dx.doi.org/10.1109/ACCESS.2018.2870052]
2. Sundararajan M, Najmi A. The many shapley values for model explanation. In: Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, 2020;119:9269–9278. https://proceedings.mlr.press/v119/sundararajan20b.html
3. Jacovi A, Marasović A, Miller T, Goldberg Y. Formalizing trust in artificial intelligence: Prerequisites, causes and goals of human trust in ai. In: Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. FAccT ’21, pp. 624–635. Association for Computing Machinery, New York, NY, USA 2021. https://doi.org/10.1145/3442188.3445923
4. Alshdaifat, E; Alshdaifat, D; Alsarhan, A; Hussein, F; El-Salhi, SMFS. The effect of preprocessing techniques, applied to numeric features, on classification algorithms’ performance. Data; 2021; [DOI: https://dx.doi.org/10.3390/data6020011]
5. Neutatz F, Chen B, Abedjan Z, Wu E, Berlin T. From Cleaning before ML to Cleaning for ML.
6. Glavic, B. Data provenance. Found Trends Databases; 2021; 9,
7. Moreau L, Missier P, Belhajjame K, B’Far R, Cheney J, Coppens S, et al. Prov-dm: The prov data model. w3c 2013.
8. Gregori L, Missier P, Stidolph M, Torlone r, Wood A. Design and Development of a Provenance Capture Platform for Data Science. In: Procs. 3rd DATAPLAT Workshop, Co-located with ICDE 2024. IEEE, Utrecht, NL 2024.
9. Chapman, A; Missier, P; Lauro, L; Torlone, R. DPDS: assisting data science with data provenance. PVLDB; 2022; 15,
10. Smith MJ, Sala C, Kanter JM, Veeramachaneni K. The machine learning bazaar: Harnessing the ml ecosystem for effective system development. In: Proceedings of the 2020 ACM SIGMOD International Conference on Management of Data. SIGMOD ’20. ACM, New York, NY, USA 2020
11. Kamiran, F; Calders, T. Data preprocessing techniques for classification without discrimination. Knowl Inform Syst; 2012; 33,
12. Niu X, Kapoor R, Glavic B, Gawlick D, Liu ZH, Radhakrishnan V. Provenance-aware query optimization. In: 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017, 2017;473–484
13. Arab, BS; Feng, S; Glavic, B; Lee, S; Niu, X; Zeng, Q. Gprom-a swiss army knife for your provenance needs. IEEE Data Eng Bull; 2018; 41,
14. Glavic B, Alonso G. Perm: Processing provenance and data on the same data model through query rewriting. In: Ioannidis, Y.E., Lee, D.L., Ng, R.T. (eds.) Proceedings of the 25th International Conference on Data Engineering, ICDE 2009, March 29 2009 - April 2 2009, Shanghai, China, 2009;174–185
15. Lee S, Köhler S, Ludäscher B, Glavic B. A SQL-Middleware Unifying Why and Why-Not Provenance for First-Order Queries. In: 33rd IEEE International Conference on Data Engineering, ICDE 2017, San Diego, CA, USA, April 19-22, 2017, 2017;485–496
16. Pimentel, JF; Freire, J; Murta, L; Braganholo, V. A survey on collecting, managing, and analyzing provenance from scripts. ACM Comput Surv (CSUR); 2019; 52,
17. Namaki MH, Floratou A, Psallidas F, Krishnan S, Agrawal A, Wu Y, et al. Vamsa: Automated provenance tracking in data science scripts. In: Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. KDD ’20, pp. 1542–1551. Association for Computing Machinery, New York, NY, USA 2020. https://doi.org/10.1145/3394486.3403205
18. Grafberger, S; Groth, P; Stoyanovich, J; Schelter, S. Data distribution debugging in machine learning pipelines. VLDB J; 2022; 31, pp. 1-24.
19. Pimentel JF, Freire J, Murta L, Braganholo V. Fine-grained provenance collection over scripts through program slicing. In: International Provenance and Annotation Workshop, 2016;199–203.
20. Pimentel, JF; Murta, L; Braganholo, V; Freire, J. noworkflow: a tool for collecting, analyzing, and managing provenance from python scripts. Proc VLDB Endow; 2017; 10,
21. McPhillips TM, Song T, Kolisnik T, Aulenbach S, Belhajjame K, Bocinsky K, et al. A user-oriented, language-independent tool for recovering workflow information from scripts. CoRR. 2015. abs/1502.02403.
22. Zhang Q, Cao Y, Wang Q, Vu D, Thavasimani P, McPhillips T. et al. Revealing the Detailed Lineage of Script Outputs using Hybrid Provenance. In: Procs. 11th Intl. Digital Curation Conference (IDCC) 2017.
23. Rupprecht, L; Davis, JC; Arnold, C; Gur, Y; Bhagwat, D. Improving reproducibility of data science pipelines through transparent provenance capture. Proc VLDB Endow; 2020; 13,
24. Petersohn, D; Ma, WW; Lee, DJL; Macke, S; Xin, D; Mo, X; Gonzalez, J; Hellerstein, JM; Joseph, AD; Parameswaran, AG. Towards scalable dataframe systems. Proc VLDB Endow; 2020; 13,
25. Chapman, A; Lauro, L; Missier, P; Torlone, R. Dpds: assisting data science with data provenance. Proc VLDB Endow; 2022; 15,
26. FAIRsharing Community: FAIRsharing: C5QG88 (2023). https://doi.org/10.24432/C5QG88.
27. Kohavi R. Census Income 1996. https://doi.org/10.24432/C5GP7S.
28. Pfisterer F, Siyi W, Lang M. COMPAS Dataset in mlr3fairness 2023. https://mlr3fairness.mlr-org.com/reference/compas.html.
29. Volk A. Dataset of USED CARS 2023. https://www.kaggle.com/datasets/volkanastasia/dataset-of-used-cars.
30. Dua D, Graff C. UCI Machine Learning Repository: Mushroom Data Set 2019. https://archive.ics.uci.edu/dataset/73/mushroom.
31. Kaggle: Titanic - Machine Learning from Disaster 2025. https://www.kaggle.com/competitions/titanic/data.
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.