Content area

Abstract

In this paper, we introduce an advanced Fuzzy Record Similarity Metric (FRMS) that improves approximate record matching and models human perception of record similarity. The FRMS utilizes a newly developed similarity space with favorable properties combined with a metric space, employing a bag-of-words model with general applications in text mining and cluster analysis. To optimize the FRMS, we propose a two-stage method for approximate string matching and search that outperforms baseline methods in terms of average time complexity and F measure on various datasets. In the first stage, we construct an optimal Q-gram count filter as an optimal lower bound for fuzzy token similarities such as FRMS. The approximated Q-gram count filter achieves a high accuracy rate, filtering over 99% of dissimilar records, with a constant time complexity of O(1). In the second stage, FRMS runs for a polynomial time of approximately O(n4) and models human perception of record similarity by maximum weight matching in a bipartite graph. The FRMS architecture has widespread applications in structured document storage such as databases and has already been commercialized by one of the largest IT companies. As a side result, we explain the behavior of the singularity of the Q-gram filter and the advantages of a padding extension. Overall, our method provides a more accurate and efficient approach to approximate string matching and search with real-time runtime.

Details

1009240
Company / organization
Title
Real-Time Fuzzy Record-Matching Similarity Metric and Optimal Q-Gram Filter
Author
Rozinek, Ondřej 1   VIAFID ORCID Logo  ; Marek, Jaroslav 2   VIAFID ORCID Logo  ; Panuš, Jan 3   VIAFID ORCID Logo  ; Mareš, Jan 4   VIAFID ORCID Logo 

 Rozinet s.r.o., U Josefa 110, 532 10 Pardubice, Czech Republic; [email protected]; Department of Information Technology, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic; [email protected]; Department of Software Engineering, Faculty of Information Technology, Czech Technical University in Prague, Thákurova 9, 166 34 Prague, Czech Republic 
 Department of Automation and Mathematics, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic; [email protected] 
 Department of Information Technology, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic; [email protected] 
 Department of Automation and Mathematics, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic; [email protected]; Department of Mathematics, Informatics and Cybernetics, University of Chemistry and Technology Prague, Technicka 5, 166 28 Prague, Czech Republic 
Publication title
Algorithms; Basel
Volume
18
Issue
3
First page
150
Publication year
2025
Publication date
2025
Publisher
MDPI AG
Place of publication
Basel
Country of publication
Switzerland
Publication subject
e-ISSN
19994893
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-03-06
Milestone dates
2024-09-16 (Received); 2024-12-02 (Accepted)
Publication history
 
 
   First posting date
06 Mar 2025
ProQuest document ID
3181337594
Document URL
https://www.proquest.com/scholarly-journals/real-time-fuzzy-record-matching-similarity-metric/docview/3181337594/se-2?accountid=208611
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-11-07
Database
ProQuest One Academic