Content area
In this paper, we introduce an advanced Fuzzy Record Similarity Metric (FRMS) that improves approximate record matching and models human perception of record similarity. The FRMS utilizes a newly developed similarity space with favorable properties combined with a metric space, employing a bag-of-words model with general applications in text mining and cluster analysis. To optimize the FRMS, we propose a two-stage method for approximate string matching and search that outperforms baseline methods in terms of average time complexity and F measure on various datasets. In the first stage, we construct an optimal Q-gram count filter as an optimal lower bound for fuzzy token similarities such as FRMS. The approximated Q-gram count filter achieves a high accuracy rate, filtering over 99% of dissimilar records, with a constant time complexity of
Details
Commercialization;
Software;
Similarity;
Search engines;
Datasets;
Deep learning;
Approximate string matching;
Ontology;
Assignment problem;
Polynomials;
Metric space;
Cluster analysis;
Natural language;
Time;
Document storage;
Graph theory;
Neural networks;
Optimization;
Perception;
Linear programming;
Methods;
Phonetics;
Algorithms;
Complexity;
Real time;
Semantics;
Databases;
Property;
Storage;
Matching;
Data mining
; Marek, Jaroslav 2
; Panuš, Jan 3
; Mareš, Jan 4
1 Rozinet s.r.o., U Josefa 110, 532 10 Pardubice, Czech Republic;
2 Department of Automation and Mathematics, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic;
3 Department of Information Technology, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic;
4 Department of Automation and Mathematics, Faculty of Electrical Engineering and Informatics, University of Pardubice, Studentská 95, 532 10 Pardubice, Czech Republic;