Content area

Abstract

In the era of big data, domain experts commonly begin their analysis by exploring diverse datasets to gain meaningful insights. The concept of the Data Lake has emerged in recent years as a modern solution for storing and managing data from heterogeneous sources. It has quickly become the mainstream storage paradigm in industry, with widely adopted platforms such as Amazon Lake Formation, Azure Data Lake, and Google BigLake.

In this thesis, we present a distributed data processing system named SynopsisDB, designed to support large-scale data exploration over data lakes. SynopsisDB consists of three layers: the storage layer, the query processing layer, and the user interface layer.

The storage layer manages thousands of data files, combining storage engines of data lakes with a local Log-Structured Merge (LSM) tree–based engine. The data lake files are stored in the Hadoop Distributed File System (HDFS), while the local engine runs on a NewSQL database system that extends the leveled LSM-tree architecture, as Bi-LSM.

The query processing layer features a component called SynopsisLake, which extends the Data Lakehouse architecture to manage and query thousands of data synopses. SynopsisLake bridges the gap between traditional query optimization techniques from Database Management Systems (DBMSs) and Data Warehouses and the heterogeneous, multi-resolution nature of data synopses in modern data lakes.

The user interface layer supports three key operations: approximate query processing, progressive query processing, and progressive query visualization. These capabilities empower domain experts to efficiently explore their data, gain early insights, and interactively refine their queries over a short time.

Together, these contributions make SynopsisDB a comprehensive and practical system for scalable, synopsis-driven data exploration in the age of big data.

Details

1010268
Business indexing term
Title
SynopsisDB: A Distributed Data System Supports In-System Data Exploration
Author
Number of pages
153
Publication year
2025
Degree date
2025
School code
0032
Source
DAI-A 87/5(E), Dissertation Abstracts International
ISBN
9798263308971
Committee member
Christidis, Evangelos; Tsotras, Vassilis; Sun, Yihan
University/institution
University of California, Riverside
Department
Computer Science
University location
United States -- California
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32239520
ProQuest document ID
3271743352
Document URL
https://www.proquest.com/dissertations-theses/synopsisdb-distributed-data-system-supports/docview/3271743352/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic