Content area

Abstract

Most work on evaluating bias in data science workflows tends to focus on the model. However, the training data fed into the model and the data preprocessing step that produces it can also have significant impact on model results. While there has been work on editing the data in data preprocessing to mitigate bias, the impact of conventional data preprocessing operations has been understudied. My dissertation delves into how the data preprocessing step can be improved to help analysts better understand the impact of the step and lead to smarter data science decisions. I first study the needs of data scientists when conducting data preprocessing through a small-scale interview study and compared the results with a literature survey of current preprocessing tools. The comparison analysis identified several key gaps between practice and theory. I utilized of result of the analysis to develop the Preprocess Analyzer (PPA) tool, which is designed to address some of the gaps by being integrated into existing data science work environments and provided users with a deeper insight into their data. I conducted a user study to evaluate the ability of PPA to aid with data preprocessing. The study results found that compared to existing popular tools, data scientists gained a better understanding of their data preprocessing workflow when utilizing PPA. Participants generally agreed that PPA included many helpful features such as the ability to quickly display useful statistics, highlight areas of concern, and integration into familiar work environments. I believe the results of this dissertation can guide the design of future data preprocessing tools to better meet the needs of the end user.

Details

1010268
Title
Understanding the Effects of Increased Transparency on Data Preprocessing Through In-Process Visualizations
Number of pages
114
Publication year
2025
Degree date
2025
School code
0153
Source
DAI-A 87/2(E), Dissertation Abstracts International
ISBN
9798291555507
Committee member
Arguello, Jaime; Szafir, Danielle; Borland, David
University/institution
The University of North Carolina at Chapel Hill
Department
Information and Library Science
University location
United States -- North Carolina
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32122278
ProQuest document ID
3243228331
Document URL
https://www.proquest.com/dissertations-theses/understanding-effects-increased-transparency-on/docview/3243228331/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic