Content area

Abstract

Data science tools, spanning from data collection to analysis and visualization, and leveraging advanced techniques such as artificial intelligence (AI), machine learning (ML), and large language models (LLMs), are now indispensable across a wide range of fields. Addressing today’s complex problems demands interdisciplinary collaboration among domain experts, data engineers, computer scientists, and statisticians, as no single field holds all the necessary expertise. There is an increasing demand for systems that let teams bring their own code across languages, collaborate modularly, inspect and interact with running computations at fine granularity, and manage heterogeneous resources in a resource-aware way. For the past few years, we have been building Texera, an open-source system to support collaborative data science using GUI-based workflows. This dissertation extends Texera with first-class support for user-defined functions (UDFs) and builds UDF-centric systems to meet these needs.

We first present UDFlow, a framework for supporting UDFs in dataflow systems. It provides a unified API that supported tuple-, batch-, and table-oriented execution, enabling collaborators to express UDF logic at whatever granularity their task required. The API is also expressive enough to handle UDFs with multiple input ports and output ports. It allows collaborators to use Python, R, Scala, and Java UDFs together in a single workflow. We discuss execution support for host-language UDFs as well as foreign-language UDFs (e.g., Python, R) run in sidecar processes. We showcase the UDF UI and supporting services that provide an IDE-like experience to ease the development process of UDFs.

We then propose Udon, a novel UDF debugger to support line-by-line debugging on dataflow systems. Udon allows users to set breakpoints, perform code inspections, and make code modifications while executing a UDF even on a single tuple. It includes a novel debug-aware UDF execution model to ensure the responsiveness of the operator during debugging. It utilizes advanced state-transfer techniques to satisfy breakpoint conditions that span across multiple UDFs. It incorporates various optimization techniques to reduce the runtime overhead. We conduct experiments with multiple UDF workloads on various datasets and show Udon’s high efficiency and scalability.

We then present Peanut, a port-based framework for compilation & scheduling of UDFs in dataflow systems. Peanut converts a multi-port UDF as a DAG of mini-operators, called a U-plan. Each input and output port of such a UDF can be treated as a mini-operator, and the internal state is transferred via state edges between those mini-operators. Decoupling a monolithic UDF into a U-plan unlocks finer-grained parallelism, increased pipelined execution, resulting in higher resource utilization, which are the critical capabilities for resource-intensive data-science workloads. At the core of Peanut is a UDF compiler that automatically rewrites standard, multi-port Python UDFs into U-plans. We demonstrate that Peanut can effectively optimize a wide range of real-world UDFs, from machine learning training and inference to custom join implementations.

Taken together, these contributions show that making UDFs first class requires an integrated stack that spans the interface, debugger, compiler, and execution runtime. Together, they advance the state of distributed dataflow systems toward accessible, efficient, and collaborative data science, AI, and ML.

Details

1010268
Business indexing term
Title
UDF-Centric Dataflow Systems for Supporting User-Defined Functions in Collaborative Data Science, AI, and ML
Number of pages
194
Publication year
2025
Degree date
2025
School code
0030
Source
DAI-B 87/2(E), Dissertation Abstracts International
ISBN
9798290963273
Advisor
Committee member
Mehrotra, Sharad; Carey, Michael
University/institution
University of California, Irvine
Department
Computer Science
University location
United States -- California
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32236959
ProQuest document ID
3240377738
Document URL
https://www.proquest.com/dissertations-theses/udf-centric-dataflow-systems-supporting-user/docview/3240377738/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic