Content area

Abstract

In this dissertation we study entity matching, a fundamental problem that lies at the heart of data integration for data science and AI. Specifically, we consider the following common entity matching problem: given two tables A and B with the same schema, find all pairs of records (a, b) ∈ A x B that "match'", i.e., refer to the same real world entity.

Typically, entity matching is done in two steps: blocking and matching. The goal of the blocking step is to quickly reduce the number of pairs to be processed later in the matching step, while retaining as many true matches as possible. The goal of the matching step is to accurately predict which records pairs match. In this dissertation we focus on the blocking step and make three major contributions.

The first contribution is Sparkly, a novel TF-IDF based blocker built on top of Spark and Lucene, using a distributed shared-nothing architecture. The TF-IDF similarity measure is well known in the information retrieval literature but has received very little attention in entity matching research. In developing Sparkly, we explore TF-IDF based blocking for entity matching and demonstrate its effectiveness in a wide range of scenarios. Extensive experiments show that Sparkly outperforms eight state-of-the-art blockers, producing both higher recall and smaller candidate sets. Additionally, we ran Sparkly on over 100M tuples and demonstrate near-linear scale-out behavior.

The second contribution is Delex, a system for combining blocking methods using a powerful declarative language. Delex is built on top of Spark. In real-world applications, users frequently want to combine multiple blocking methods to take advantage of the strengths of each method. Currently, combining multiple blocking methods is done in an ad-hoc way, which leads to both costly development and suboptimal performance. Delex is designed from the ground up for combining blocking methods using a scalable architecture. Experiments show that Delex can effectively optimize blocking plans, reducing runtime by up to threefold, and can scale to large datasets. In addition, we demonstrate the extensibility of Delex by implementing a new blocking method with only 150 lines of code.

The third, and final, contribution is BigGoat, a benchmark for blocking for entity matching which mirrors how blockers are created for real-world applications. There are a wide variety of entity matching benchmark datasets. However few, if any, focus on scaling, with the majority of benchmark datasets containing fewer than 1M records. Due in part to this gap, many research blocking solutions cannot scale to large datasets, and hence are not practical for use in real-world applications. To address this problem we created the BigGoat benchmark. BigGoat consists of five realistic datasets with tables having up to 60M records. In creating BigGoat, we also develop a novel downsampling algorithm specifically designed for estimating the recall of blockers.

Collectively, the contributions presented in this dissertation represent a significant advancement in the state of the art in blocking for entity matching. In addition, this work lays the foundation for future research aimed at further improving blocking algorithms and developing more robust evaluation methodologies.

Details

1010268
Business indexing term
Title
Toward Effective Blocking for Entity Matching
Number of pages
193
Publication year
2025
Degree date
2025
School code
0262
Source
DAI-A 87/6(E), Dissertation Abstracts International
ISBN
9798270220532
Advisor
Committee member
Craven, Mark; Graefe, Goetz; Koutris, Paraschos
University/institution
The University of Wisconsin - Madison
Department
Computer Sciences
University location
United States -- Wisconsin
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32399748
ProQuest document ID
3282833148
Document URL
https://www.proquest.com/dissertations-theses/toward-effective-blocking-entity-matching/docview/3282833148/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic