Abstract

Entity resolution is essential for data integration, facilitating analytics and insights from complex systems. Multi-source and incremental entity resolution address the challenges of integrating diverse and dynamic data, which is common in real-world scenarios. A critical question is how to classify matches and non-matches among record pairs from new and existing data sources. Traditional threshold-based methods often yield lower quality than machine learning (ML) approaches, while incremental methods may lack stability depending on the order in which new data is integrated. Additionally, reusing training data and existing models for new data sources is unresolved for multi-source entity resolution. Even the approach of transfer learning does not consider the challenge of which source domain should be used to transfer model and training data information for a certain target domain. Naive strategies for training new models for each new linkage problem are inefficient. This work addresses these challenges and focuses on creating as well as managing models with a small labeling effort and the selection of suitable models for new data sources based on feature distributions. The results of our method StoRe demonstrate that our approach achieves comparable qualitative results. Regarding efficiency, StoRe outperforms both a multi-source active learning and a transfer learning approach, achieving efficiency improvements of up to 48 times faster than the active learning approach and by a factor of 163 compared to the transfer learning method.

Details

1009240
Identifier / keyword
Title
Stop Relearning: Model Reuse via Feature Distribution Analysis for Incremental Entity Resolution
Publication title
arXiv.org; Ithaca
Publication year
2024
Publication date
Dec 12, 2024
Section
Computer Science
Publisher
Cornell University Library, arXiv.org
Source
arXiv.org
Place of publication
Ithaca
Country of publication
United States
University/institution
Cornell University Library arXiv.org
e-ISSN
2331-8422
Source type
Working Paper
Language of publication
English
Document type
Working Paper
Publication history
 
 
Online publication date
2024-12-13
Milestone dates
2024-12-12 (Submission v1)
Publication history
 
 
   First posting date
13 Dec 2024
ProQuest document ID
3144199241
Document URL
https://www.proquest.com/working-papers/stop-relearning-model-reuse-via-feature/docview/3144199241/se-2?accountid=208611
Full text outside of ProQuest
Copyright
© 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2024-12-14
Database
2 databases
  • ProQuest One Academic
  • ProQuest One Academic