Content area

Abstract

Most work on evaluating bias in data science workflows tends to focus on the model. However, the training data fed into the model and the data preprocessing step that produces it can also have significant impact on model results. While there has been work on editing the data in data preprocessing to mitigate bias, the impact of conventional data preprocessing operations has been understudied. My dissertation delves into how the data preprocessing step can be improved to help analysts better understand the impact of the step and lead to smarter data science decisions. I first study the needs of data scientists when conducting data preprocessing through a small-scale interview study and compared the results with a literature survey of current preprocessing tools. The comparison analysis identified several key gaps between practice and theory. I utilized of result of the analysis to develop the Preprocess Analyzer (PPA) tool, which is designed to address some of the gaps by being integrated into existing data science work environments and provided users with a deeper insight into their data. I conducted a user study to evaluate the ability of PPA to aid with data preprocessing. The study results found that compared to existing popular tools, data scientists gained a better understanding of their data preprocessing workflow when utilizing PPA. Participants generally agreed that PPA included many helpful features such as the ability to quickly display useful statistics, highlight areas of concern, and integration into familiar work environments. I believe the results of this dissertation can guide the design of future data preprocessing tools to better meet the needs of the end user.

Details

Title
Understanding the Effects of Increased Transparency on Data Preprocessing Through In-Process Visualizations
Author
Su, William
Publication year
2025
Publisher
ProQuest Dissertations & Theses
ISBN
9798291555507
Source type
Dissertation or Thesis
Language of publication
English
ProQuest document ID
3243228331
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.