Content area

Abstract

The study aimed to propose methods to improve the data integrity of the Relational databases such as MS SQL, MySQL and PostgreSQL via record duplication detection. The FODORS and ZAGAT Restaurant database benchmark datasets have been utilized to facilitate the processes involved in preparing and delivering high-quality data. Furthermore, the Levenshtein distance algorithm was used to propose three (3) methods namely: default, eliminating equal string, and knowledge-based libraries to cut duplicate records in the database. In the 70% selected threshold, the average detected duplicate records of 88 out of 112 records in the restaurant dataset. Finally, to efficiently detect duplicate records in the database, depend on the data being analyzed and threshold selected.

Details

1009240
Business indexing term
Title
Deduplication Methods Using Levenshtein Distance Algorithm
Author
Valeriano, Eugene S 1 

 Tarlac Agricultural University, Santa Ignacia, Philippines 
Publication title
Volume
20
Issue
7s
Pages
997-1006
Publication year
2024
Publication date
2024
Publisher
Engineering and Scientific Research Groups
Place of publication
Paris
Country of publication
France
e-ISSN
11125209
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
ProQuest document ID
3081859542
Document URL
https://www.proquest.com/scholarly-journals/deduplication-methods-using-levenshtein-distance/docview/3081859542/se-2?accountid=208611
Copyright
© 2024. This work is published under https://creativecommons.org/licenses/by/4.0/legalcode (the“License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2024-08-05
Database
2 databases
  • Coronavirus Research Database
  • ProQuest One Academic