Abstract

With technological innovations, enterprises in the real world are managing every iota of data as it can be mined to derive business intelligence (BI). However, when data comes from multiple sources, it may result in duplicate records. As data is given paramount importance, it is also significant to eliminate duplicate entities towards data integration, performance and resource optimization. To realize reliable systems for record deduplication, late, deep learning could offer exciting provisions with a learning-based approach. Deep ER is one of the deep learning-based methods used recently for dealing with the elimination of duplicates in structured data. Using it as a reference model, in this paper, we propose a framework known as Enhanced Deep Learning-based Record Deduplication (EDL-RD) for improving performance further. Towards this end, we exploited a variant of Long Short Term Memory (LSTM) along with various attribute compositions, similarity metrics, and numerical and null value resolution. We proposed an algorithm known as Efficient Learning based Record Deduplication (ELbRD). The algorithm extends the reference model with the aforementioned enhancements. An empirical study has revealed that the proposed framework with extensions outperforms existing methods.

Details

Title
An efficient learning based approach for automatic record deduplication with benchmark datasets
Author
Ravikanth, M 1 ; Korra, Sampath 2 ; Mamidisetti, Gowtham 1 ; Goutham, Maganti 1 ; Bhaskar, T. 3 

 Malla Reddy University, Department of CSE, Maisammaguda, Kompally, Hyderabad, India 
 Sri Indu College of Engineering and Technology (A), Department of CSE, Sheriguda, Ibrahimpatnam, Hyderabad, India 
 Department of CSE CMR College of Engineering and Technology, Kandlakoya, Medchal, Hyderabad, India (ISNI:0000 0004 6822 5265) 
Pages
16254
Publication year
2024
Publication date
2024
Publisher
Nature Publishing Group
e-ISSN
20452322
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3080883010
Copyright
© The Author(s) 2024. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.