Toward Effective Blocking for Entity Matching

Abstract

In this dissertation we study entity matching, a fundamental problem that lies at the heart of data integration for data science and AI. Specifically, we consider the following common entity matching problem: given two tables A and B with the same schema, find all pairs of records (a, b) ∈ A x B that "match'", i.e., refer to the same real world entity.

Typically, entity matching is done in two steps: blocking and matching. The goal of the blocking step is to quickly reduce the number of pairs to be processed later in the matching step, while retaining as many true matches as possible. The goal of the matching step is to accurately predict which records pairs match. In this dissertation we focus on the blocking step and make three major contributions.

The first contribution is Sparkly, a novel TF-IDF based blocker built on top of Spark and Lucene, using a distributed shared-nothing architecture. The TF-IDF similarity measure is well known in the information retrieval literature but has received very little attention in entity matching research. In developing Sparkly, we explore TF-IDF based blocking for entity matching and demonstrate its effectiveness in a wide range of scenarios. Extensive experiments show that Sparkly outperforms eight state-of-the-art blockers, producing both higher recall and smaller candidate sets. Additionally, we ran Sparkly on over 100M tuples and demonstrate near-linear scale-out behavior.

The second contribution is Delex, a system for combining blocking methods using a powerful declarative language. Delex is built on top of Spark. In real-world applications, users frequently want to combine multiple blocking methods to take advantage of the strengths of each method. Currently, combining multiple blocking methods is done in an ad-hoc way, which leads to both costly development and suboptimal performance. Delex is designed from the ground up for combining blocking methods using a scalable architecture. Experiments show that Delex can effectively optimize blocking plans, reducing runtime by up to threefold, and can scale to large datasets. In addition, we demonstrate the extensibility of Delex by implementing a new blocking method with only 150 lines of code.

The third, and final, contribution is BigGoat, a benchmark for blocking for entity matching which mirrors how blockers are created for real-world applications. There are a wide variety of entity matching benchmark datasets. However few, if any, focus on scaling, with the majority of benchmark datasets containing fewer than 1M records. Due in part to this gap, many research blocking solutions cannot scale to large datasets, and hence are not practical for use in real-world applications. To address this problem we created the BigGoat benchmark. BigGoat consists of five realistic datasets with tables having up to 60M records. In creating BigGoat, we also develop a novel downsampling algorithm specifically designed for estimating the recall of blockers.

Collectively, the contributions presented in this dissertation represent a significant advancement in the state of the art in blocking for entity matching. In addition, this work lays the foundation for future research aimed at further improving blocking algorithms and developing more robust evaluation methodologies.

Details

Business indexing term

Subject:

Artificial intelligence

Subject

Computer science;
Computer engineering;
Artificial intelligence;
Information science

Classification

0984: Computer science
0464: Computer Engineering
0723: Information science
0800: Artificial intelligence

Identifier / keyword

Data science; Databases; Distributed systems; Entity matching; Indexing; Information retrieval

Title

Toward Effective Blocking for Entity Matching

Author

Paulsen, Derek

Number of pages

193

Publication year

2025

Degree date

2025

School code

0262

Source

DAI-A 87/6(E), Dissertation Abstracts International

ISBN

9798270220532

Advisor

Doan, AnHai

Committee member

Craven, Mark; Graefe, Goetz; Koutris, Paraschos

University/institution

The University of Wisconsin - Madison

Department

Computer Sciences

University location

United States -- Wisconsin

Degree

Ph.D.

Source type

Dissertation or Thesis

Language

English

Document type

Dissertation/Thesis

Dissertation/thesis number

32399748

ProQuest document ID

3282833148

Document URL

https://www.proquest.com/dissertations-theses/toward-effective-blocking-entity-matching/docview/3282833148/se-2?accountid=208611

Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.

Database

ProQuest One Academic

Toward Effective Blocking for Entity Matching

Content area

Abstract

Details