Content area

Abstract

As machine learning and AI systems become more prevalent, understanding how their decisions are made is key to maintaining their trust. To solve this problem, it is widely accepted that fundamental support can be provided by the knowledge of how data are altered in the pre-processing phase, using data provenance to track such changes. This paper focuses on the design and development of a system for collecting, managing, and querying data provenance of data preparation pipelines in data science. An investigation of publicly available machine learning pipelines is conducted to identify the most important features required for the tool to achieve impact on a broad selection of pre-processing data manipulation. Building on this study, we present an approach for transparently collecting data provenance based on the use of an LLM to: (i) automatically rewrite user-defined pipelines in a format suitable for this activity and (ii) store an accurate description of all the activities involved in the input pipelines for supporting the explanation of each of them. We then illustrate and test implementation choices aimed at supporting the provenance capture for data preparation pipelines efficiently in a transparent way for data scientists.

Details

1009240
Title
An LLM-guided platform for multi-granular collection and management of data provenance
Author
Gregori, Luca 1 ; Lazzaro, Pasquale Leonardo 1 ; Lazzaro, Marialaura 1 ; Missier, Paolo 2 ; Torlone, Riccardo 1 

 Università Roma Tre, DICITA, Roma, Italy (GRID:grid.8509.4) (ISNI:0000 0001 2162 2106) 
 University of Birmingham, School of Computer Science, Birmingham, UK (GRID:grid.6572.6) (ISNI:0000 0004 1936 7486) 
Publication title
Volume
12
Issue
1
Pages
187
Publication year
2025
Publication date
Jul 2025
Publisher
Springer Nature B.V.
Place of publication
Heidelberg
Country of publication
Netherlands
e-ISSN
21961115
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-07-26
Milestone dates
2025-05-22 (Registration); 2024-11-01 (Received); 2025-05-22 (Accepted)
Publication history
 
 
   First posting date
26 Jul 2025
ProQuest document ID
3233582374
Document URL
https://www.proquest.com/scholarly-journals/llm-guided-platform-multi-granular-collection/docview/3233582374/se-2?accountid=208611
Copyright
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-11-14
Database
ProQuest One Academic