Razy: A String Matching Algorithm for Automatic

Full text

Turn on search term navigation

1. Introduction

Every part of our lives is being impacted by digitalization. Therefore, data analytics systems provide enormous advantages in healthcare industries [1,2] and other sectors, such as security [3], text processing [4], and finance [5]. Modern medical systems produce enormous amounts of data every day. To extract valuable information and identify hidden patterns in these data, mining and analysis are required [6]. The benefits of data analytics systems in the healthcare sector extend from disease investigation to treatment levels to analyzing the pathology report, which provides an interpretation of the results from the patient’s body sample, and these reports are considered undoubtedly one of the most critical documents in medicine [7]. Typically, pathology reports include various pieces of essential information about the patient’s symptomatology. These reports are in free-text semi-structured or unstructured formats. The data analysts in this scenario manually examine the pathology reports, extract useful information, and then interpret the information in compliance with diagnostic features [8]. Finally, the results are finalized and entered into the database based on several computer processes [9]. Text mining has developed as a practical computer tool to accurately translate pathology reports into a usable structured representation by extracting only the important information that affects the hematological disease. For example, a string matching algorithm may provide helpful tools for data analysis [10]. Therefore, it is evident that any new technology that can automatically handle pathology report data will become an area of interest. So, this paper is used to consolidate and characterize the use of CBC-driven parameters using a hybrid algorithm. Therefore, it can be concluded that many challenges can be countered with the application of data analytics technology in healthcare [11]:

Data Quality;
Data Variety;
Data Validity, meaning the suitability of the data for their intended application;
Data Security;
Data Storage because of the large amount of data supplied;
Data Visualization may be needed in some cases;
Healthcare Data should be updated frequently to remain up-to-date and valuable.

Unlike previous studies, this paper proposes an efficient, dedicated algorithm for string matching to be used for the pathological analysis process to override the limitations in the best-chosen algorithms specified in the implementation of the comparisons study step. The main contributions are identified as follows:

This paper delves thoroughly into the probability of encountering many challenges in the application of data analytics technology to health care.
Several studies have presented the possibilities of using data analytics in healthcare application sectors.
This paper also identifies the most powerful string matching methods that can be used in healthcare by performing a comparative study of these methods.
Moreover, an enhanced Rapin–Karp algorithm was presented by integrating the Rapin–Karp algorithm after several modifications to find the exact matching with the possibility that there are no accurate words matched. The fuzzy ratio method was used to find the approximate matching.

The paper is organized as follows: Section 2 provides a brief literature review of the studies on healthcare test analyses and string matching algorithms that would help the analyzing systems. Section 3 presents the proposed system methodology, while Section 4 interprets and describes the significance of the results of the proposed method. Finally, Section 5 and Section 6 provide a compelling discussion and conclusions that inform researchers on what they can learn from the research results.

2. Literature Review

Healthcare is a complicated industry with diverse sectors, including patients, doctors, hospitals, and health organizations. It is also worth noting that this industry has been updated such that the treatment of patients is no longer the only emphasis of healthcare. Instead, promoting healthy lifestyles and preventing diseases can be placed at the top of the priorities for decision-makers depending on the patient’s history [12,13]. The patient‘s history can be created using many pathological reports of different types and other patient information. This can be completed using data analytics tools and techniques. One of these critical tools provides a string matching algorithm explicitly designed to deal with the terminology in medical reports. Therefore, considering the literature, many studies have reviewed various subjects for string matching algorithms in DNA pattern sequence matching, health patient records, and pathological test analyses.

DNA Pattern Sequence Matching: In applications, such as life and agricultural sciences, DNA sequencing is crucial. Unfortunately, the amount of DNA sequence data is now a significant computational difficulty [14]. So, practical DNA sequence analysis is an important issue to solve. A significant problem in biomedical engineering, biotechnology, and health informatics has been matching patterns across various species by precisely matching patterns for DNA sequences [15]. Many research papers have been published to address the exact matching problem, such as [16], who suggested a new approach that utilizes a barcode to achieve compression using the reference sequence already described. This is performed based on Bloom filters, which hash all reads in the Bloom filter and then decode them for querying. While in [14], the suggested solution is based on multiple Bloom filters, which locate every instance of the desired sequence inside a DNA sequence and count how many times it appears there because these elements have a crucial role in determining the nature and severity of every disease. Moreover, in [15], they presented a model approach to DNA sequences based on a packed string for optimal multiple patterns matched with wildcards. Because of the operation’s similarity to assembly code, the outcomes have proven to be more effective than the competition’s.
Health Patient Record: Practical patient matching algorithms are an essential requirement with the enormous amount of medical data being generated on a digital basis. Moreover, there is a need to ensure that unique patient information is effectively connected across numerous sources [17]. So, many papers have been published to address these issues, such as [18], in which the authors propose a new naturalistic method that allows for manual case evaluation when necessary but is predicated on a reasonable person’s reasoning when comparing medical records. This technique is a hybrid of deterministic and probabilistic systems. They tested the derived algorithm on medical patient records to validate the method. Furthermore, in the same year, ref. [19] described a prescription model that models the variability and flexibility within a prescription for drugs. They also created a text analysis system that uses procedures to pull structured data from prescription free-text dosing instructions. An anonymous electronic records database was used to test the proposed method. Further, ref. [20] presented a similar hybrid technique that effectively runs a combined string and language-dependent phonological pattern matching on a collection of Portuguese-language medical records.
Pathological Test Analysis: Nowadays, interest has switched to studying unstructured textual data and structured data due to the increased availability and quality of text mining tools. It is becoming increasingly apparent that these methods have significant advantages for statistics research [21]. Therefore, several papers were published to design strategies that can be useful to achieve these advantages, such as [22], in which the authors discuss a pattern matching technique that can easily be hand-tweaked that de-identifies medical content written in Dutch automatically. The procedure is divided into two phases. First, a list of categories for protected health information is chosen with the help of the medical team. The second phase is used to develop an approach that uses fuzzy string matching, lookup tables, and decision rules to de-identify every piece of information in one of these categories. Furthermore, to ensure the validity of the medical tests (pathological tests) for only a certain period, Authors in [23] suggested an automated technique for discovering the typical laboratory test order patterns in electronic health records, allowing users to learn the shelf-life of certain laboratory tests.

3. The Proposed System Methodology

This methodology is for the creation of the general patient report (comprehensive prescription information). The method entails several steps, as shown in Figure 1, which describes the proposed workflow of the proposed system.

3.1. Research Setting and Data Collection

In this paper, the CBC reports were collected as research symbols from two main places: the laboratory in the Al-Zahra hospital and the Hematology Center in the Medical City in Baghdad, Iraq. First, the data are aggregated from each patient as an initial step toward building a unified health information system to computerize patient records and create a general report for each patient. Then, as testing data, 150 of the most common terms (e.g., WBC, RBC, …), with their numerical values, used in the reports were selected. Next, the raw data were aggregated and comprised the individual patient level. They were then extracted to add to the patient records. In the case of adding a patient for the first time, a new record is created and generates a unique code as a patient identifier, i.e., the patient ID consists of 12 digits (for example, XXXX-XXXX-XXXX), and the content of this ID is as follows:

The first pair indicates the computer number (device) that was used to add the patient ID;
Two digits refer to the year (its value is the time of the ID’s addition);
Two digits refer to the month (its value is the time of the ID’s addition);
Two digits refer to the day (its value is the time of the ID’s addition);
Two digits refer to the hour (its value is the time of the ID’s addition);
Two digits refer to the seconds (its value is the time of the ID’s addition).

3.2. Automated Text Processing

Reading Pathological Reports: The initial step is performed depending on the current values extracted from the file (test name and test values) to facilitate data collection to update the patient’s history based on the proposed method. It is important to note that the reports may be images or PDF files. Therefore, we should unify the reading and analyzing of texts from the pathological reports process because each file type’s reading process differs. So, the reading process might affect the extracted text, and the possibility of an error occurring increases. Therefore, this step is divided into two parts. Firstly, convert all pdf files to image files (here, all files are converted to images). Then, extract the text from each image using the Tesseract-OCR engine (developed by Hewlett-Packard [24]). After extracting the data, the progressive stages of the proposed system will be implemented to update the patient’s information based on the extracted values.
Data Cleaning: When text is extracted from a file, irrelevant symbols such as “*”, “;” …, and so on can be found. As a result, the text is processed to remove irrelevant information, extract only the most essential words and their values, and export this to a CSV file for further processing.
Error Handling: There is a possibility of existing handwritten (terms or values) inside the pathological reports. Therefore, the system should be trained to read files and ensure that the reading process is completed correctly for these terms. Moreover, the system should look at all of the words in the text to identify familiar (standard) and unfamiliar (wrong) words that could be misspelled. A similarity metrics technique, such as the Jaro–Winkler, dice coefficient, matching coefficient, or overlap coefficient) [25] or distance metrics technique (such as the Levenshtein, Damerau–Levenshtein, longest common subsequence, or N-Gram) [26] should be used to handle words that are unusual in pathological reports. In the case that an error occurs in the analysis terms, for example, the white blood cell rate analysis (e.g., WBD instead of WBC), there are two types of solutions, as shown below:
- –. They are adopting the modified string matching algorithm and the word dictionary method (which contains the common pathological words) to identify the similarity percentage and adopt the word with the most nominal rate of difference depending on the pre-specified threshold. The laboratory specialist should pre-specify the threshold and assume the risk of using the wrong analysis terms due to the importance of this information.
- –. Suppose the specified threshold value is not met. In this case, the system should show a message alerting the presence of an ambiguous analysis term in the pathological report that was not identified by standard methods and present the closest term.

Suppose there are missing values in the pathological report. In that case, the system can adopt the values found in the previous reports with a time limit (e.g., should not exceed three months) to maintain the shift-life time of information. If the missing values are unavailable within this time limit, a message is displayed in the system describing the situation (i.e., the findings acquired and the date of acquisition).

3.3. Best Algorithms Selection

In this research, various string matching methods were used to determine which is the most effective for creating a string similarity measure for a pathological report. Precisely 12 methods classified according to the working-based method were implemented and are listed as follows:

Edit-based methods: These algorithms determine how many operations are required to change one word into another. In the case of less similarity between two input words, there are more operations to perform [27]. In this paper, three distances were used (the Hamming distance, Damerau–Levenshtein, and Levenshtein) to test the performance of this type with the pathological reports.
Token-based methods: Instead of whole strings, a collection of tokens is what is expected as the input. Tokens can be single characters, N-Grams, or whole words. Calculating the overlap’s size and normalizing it using a string length measurement is the basis for quantification [28]. Four distances were used in this paper, the Jaccard distance, N-Gram, bag distance, and Sørensen–Dice coefficient, to test the performance of this type with pathological reports.
Sequence-based methods: The algorithms seek out the most extended sequence in both strings. Therefore, detecting longer similar sequences means a higher similarity score [28]. In this paper, only the longest common substring algorithm is implemented and compared with other types of algorithms.
Naïve String Matching Algorithm: This compares the provided pattern to every point in the supplied text. The pattern length determines how long each comparison takes to complete, and the number of locations determines how long the text is. In contrast, the modified version of the Naïve algorithm (Rabin–Karp) depends on the hash values for a pattern search to reduce the time [29]. This paper implemented the Naïve and Rabin–Karp algorithms to find the more suitable one for use in pathological reports analysis.
Fuzzy methods: All of the standard matching algorithms search to find matching with a low threshold of difference or for only exact matching. Therefore, the fuzzy techniques to solve this problem include finding the similarity ratio by the Levenshtein distance [30]. In this paper, two methods were implemented using the Python language to determine the most effective method for use in the pathological analysis operations.

3.4. Methodology

The proposed system for analyzing pathological reports utilizes a string matching strategy, which is simple and practical due to the limited words found in these reports. Therefore, in this paper, an enhanced algorithm was designed and implemented that can be used in the proposed system that would provide accuracy and fast response. Therefore, in this paper, we design this algorithm based on two essential concepts:

Most of the words found in the CBC tests are of a length not exceeding five letters.
In addition to this, the words usually do not contain special symbols.

Therefore, a coding system was designed based on these cases. In addition, the hashing principle was used depending on the length of the five letters. This saves a lot of time and leads to maintaining accuracy as much as possible. Now, the steps of the proposed system are discussed, step by step, as follows.

First, the reports were obtained from the laboratory as image or pdf files and then were automatically converted into the CSV file format. The reports were then pre-processed by removing the irrelevant text artifacts created by the OCR software and handling the handwriting cases and misspelled error cases that may result from the OCR software or others. The string matching algorithm enables the system to manage errors by matching words depending on the standard data items (stored in the standard dictionary). The data items are a string or number, and these items are used to denote a piece of information within a general patient record (e.g., the patient’s unique ID, age, sex, and other essential information in the CBC report, such as the WBC). Therefore, after implementing 12 different string matching algorithms, the best two algorithms based on implemented results were hybrids with a modified version of the Rabin–Karp algorithm to improve the processing time. However, first, the basic steps of the Rabin–Karp algorithm must be explained, as shown in Algorithm 1, where q’ is a prime number, and $L_{t}$ and $L_{p}$ are the numbers of characters in the text and the pattern, respectively. The main problems with this algorithm are that its ability to find the exact match patterns depends on the hash values, and it contains some steps that increase the processing time, such as calculating the hash value each time and repeated division and multiplication operations. Moreover, this algorithm requires a large prime integer to prevent hash values that may be identical to different words [31]. On the other hand, the fuzzy method is used to find the closest words, but it takes much more time than the Rabin algorithm. So, in this paper, a new hybrid algorithm was proposed by combining the Rabin algorithm with the fuzzy method to utilize the advantage and overcome the limitations of these two techniques.

Algorithm 1 Rabin–Karp Matching Algorithm

1:. Input: T, which is a sequence of terms (words) in which we want to find a pattern, and P, which is the pattern that we want to find it.
2:. Output: Indexlist, which is a list containing all occurrences of P in T.
3:. Begin
4:. Step 1: Define all variables and their initial values.
5:. Calculate $L_{t} \leftarrow l e n g t h (T)$
6:. Calculate $L_{p} \leftarrow l e n g t h (P)$
7:. Calculate $q \leftarrow p r i m e, i . e ., q = 101$
8:. Calculate $V \leftarrow d_{m - 1} % q$
9:. Calculate $p \leftarrow 0$
10:. Calculate $t_{0} \leftarrow 0$
11:. Step 2: Find all P in the input T
12:. for $1 \leq j \leq L_{p}$ do
13:. Calculate $p \leftarrow (d_{p} + P [j]) % q$
14:. Calculate $t_{0} \leftarrow (d t_{0} + t [j]) % q$
15:. end for
16:. for $0 \leq s \leq L_{t} - L_{p}$ do
17:. if $p = = t s$ then
18:. if $p [(1) \dots (L_{p})] = = T [(s + 1) \dots (s + L_{p})]$ then
19:. Print (”the pattern found at position:“+ s)
20:. Indexlist.append(s)
21:. end if
22:. end if
23:. if $s < L_{t} - L_{p}$ then
24:. Calculate $t_{s} + 1 = (d (t_{s} - T [s + 1] V) + T [s + L_{p} + 1]) % q$
25:. end if
26:. end for
27:. Return Indexlist
28:. End

The main idea is to take advantage of the speed of the Rabin algorithm in processing words with an exact match and specifying the maximum term because the longest word in the standard report is only five letters. So, to design a new equation, as shown in Equation (1) for generating the hash value, a unique coding system was adopted for report handling to reduce the calculations because these systems deal only with the alphabet (A to Z). Suppose that we want to find the hash value for (term = abcde):

(1) $H = 26^{0} \times o r d (a) + 26^{1} \times o r d (b) + 26^{2} \times o r d (c) + 26^{3} \times o r d (d) + 26^{4} \times o r d (e)$

This equation allows us to eliminate the remainder of the division operations, as well as to facilitate the work and reduce the required arithmetic operations. First, calculate the hash value only once for the input term (T). Then, match all of the stored term lengths to the length of T; if the lengths of the terms match, calculate the hash value for the candidate word. Otherwise, move on to the next word. If no exact matches are found, then the fuzzy method should be used to find the closest term and select the one with the smallest different ratio. Therefore, the steps of the proposed Razy algorithm are listed in Algorithm 2.

Algorithm 2 Razy Algorithm

1:. Input: T, which is a sequence of terms (words) in which we want to find a pattern, and P, which is the pattern that we want to find it.
2:. Output: Indexlist, which is a list containing all occurrences of P in T.
3:. Begin
4:. Step 1: Define all variables and their initial values.
5:. Set $T e m p \leftarrow 0$
6:. Set $h a s h v \leftarrow H (T)$
7:. Set $L_{t} \leftarrow L e n g t h (T)$
8:. Set $L_{p} \leftarrow L e n g t h (P)$
9:. Step 2: Find all positions that give the largest similarity, either exact or approximate.
10:. Set $I n d e x l i s t \leftarrow M o d i f i e d - R a b i n (h a s h_{v}, L_{t}, p, L_{p}, P o s i t i o n s)$
11:. if $I n d e x l i s t = = N U L L$ then
12:. for $0 \leq i \leq L_{p}$ do
13:. Calculate $R a t i o \leftarrow g e t - f u z z y - r a t i o (T, P)$
14:. if $R a t i o = = T e m p$ then
15:. Set $T e m p \leftarrow R a t i o$
16:. Set $P o s i t i o n \leftarrow i$
17:. end if
18:. end for
19:. end if
20:. Return Position
21:. End main procedure
22:. procedure Modified-Rabin( $h a s h_{v}, L_{t}, p, L_{p}, P o s i t i o n s$ )
23:. if $L_{t} = = L_{p}$ then
24:. Set $H a s h_{p} \leftarrow 0$
25:. for $0 \leq i \leq L_{p}$ do
26:. Calculate $h a s h_{p} \leftarrow h a s h_{p} + o r d (p a t t e r n [i]) * 26^{i}$
27:. if $h a s h_{v} = = h a s h_{p}$ then
28:. Set $P o s i t i o n \leftarrow A p p e n d (i)$
29:. end if
30:. end for
31:. end if
32:. Return Position
33:. end procedure
34:. procedure get_fuzzy_Ratio( $T, P$ )
35:. Calculate $D i s t a n c e \leftarrow l e v e n s h t e i n . d i s t a n c e (T . l o w e r (), P . l o w e r ())$
36:. Calculate $L \leftarrow l e n (T) + l e n (P)$
37:. Calculate $R a t i o \leftarrow (1 - D i s t a n c e / L) * 100$
38:. Return Ratio
39:. end procedure
40:. End

4. Results

The implemented algorithms are listed in three tables (Table 1, Table 2 and Table 3). The average waiting time (i.e., the amount of time that passes between the moment that a process is requested and finishing it) for each algorithm in these tables is calculated and illustrated in Table 4, which represents the average execution (implementation) time. Then, depending on the minimum average waiting time values, a new hybrid algorithm is proposed by merging the algorithms to produce a new algorithm to exploit their benefits and overcome their individual limitations.

A comparison was established to measure the execution time efficiency of the proposed algorithm with the optimized Damerau–Levenshtein and dice coefficients using enumeration operations (ODADNEN) [26], as shown in Table 5. In addition, to evaluate the proposed algorithm’s efficiency, this paper considers a standard S1 Dataset (only English data) to compare with [32]. The results of the implementation comparison are shown in Table 6, which presents the efficiency and flexibility of the proposed method.

On the other hand, the assessment of the performance of the proposed approach is performed in terms of the metrics listed in Table 7. These metrics include the F1-score, recall, specificity, accuracy, precision, sensitivity, N-value, and P-value [33,34,35,36,37]. In these metrics, the true positive, true negative, false positive, and false negative measures used in these metrics are denoted by the $T P, T N, F P,$ and $F N$ symbols, respectively. In this case, $T N$ represents the number of words in the dataset that do not match the input term, whereas $T P$ represents the number of words that have been compared to the input term and are the same.

The assessment results are presented in Table 8. In this table, a comparison is performed to show the superiority and effectiveness of the proposed approach. This comparison includes the proposed Razy approach and the Fuzzy approach. The results presented in this table confirm the effectiveness and superiority of the proposed approach. The achieved accuracy using the proposed Razy method is (0.99973894) for the term (The) and (0.999152054) for the term (Good), whereas the corresponding values achieved by the Fuzzy approach are (0.988385718) and (0.9821751). Similarly, the achieved values of the other metrics, such as the sensitivity, specificity, PPV, NPV, and F1-score, are superior for the proposed approach when compared to the currently existing fuzzy method.

To evaluate the statistical significance of the proposed Razy approach, two sets of experiments were conducted. The first set targeted the statistical analysis of the achieved results, and the results of this experiment are presented in Table 9. In addition, the second set of experiments was conducted using the Wilcoxon signed rank test, and the results are illustrated in Table 10. As presented in these tables, it can be noted that the proposed approach is stable and has statistical significance based on the recorded results.

On the other hand, to clearly show the superiority of the proposed approach, the accuracy of the plots for both the proposed Razy and Fuzzy approaches are shown in Figure 2. As shown in this figure, the accuracy of the proposed approach is stable with a significant value. However, the accuracy of the Razy approach varies from low accuracy to high and is still lower than the proposed approach.

The behavior of the accuracy of the proposed approach in comparison to the Fuzzy approach is clearly obvious in the histogram shown in Figure 3. In this figure, the histogram of the accuracy using the proposed approach is shown in blue and is almost perfect without variations. However, the accuracy histogram of the Fuzzy approach varies and is lower in value than that of the proposed approach. These results emphasize the superiority and effectiveness of the proposed approach in text mining tasks.

5. Discussion

Text mining involves using string matching algorithms in many subjects, such as document classification (i.e., analysis of their contents and plagiarism detection). However, healthcare analytics systems must optimize large-scale resource administration and availability with the massive increases in resources, users, services, and system contents. Therefore, it might be challenging to identify effective methods for solving the problems and making these decisions in light of the applications and needs. As a result, one of the main goals of this paper is to provide critical analysis of the basis or benchmark methodologies in terms of their advantages and disadvantages. The proposed system can analyze pathological report files (images or pdf) to extract important information to find misspelled words and avoid the information loss caused by approximate string matching or limited by the exact matching approaches to detection of misspelled variations within an acceptable processing time in a large dataset. This paper combines the modified Rabin method with a fuzzy fast similarity search technique capable of using a small different ratio threshold (<1) over the created dictionary of standard terms. The proposed algorithm was implemented using Python, and the findings show that the proposed method can effectively be applied to finding lexically related words in the health sector. The performance of the proposed algorithm was tested using two datasets, and the results are presented in Table 3 and Table 4. These tables describe how the proposed method’s efficiency compares to that of its adversaries since it also uses the hash concept with less complex operations to speed up computing and processing. Therefore, the theoretical analysis and experimental results showed that the performance of the proposed (suggested) method was better than the state-of-the-art techniques and is particularly useful for pathological analysis.

The proposed algorithm was designed based on the basic principles available in the words found in pathological reports in terms of the absence of special characters, which affect the encoding system, and the length of the word, which affects the hashing process in the Rabin–Karp algorithm ( i.e., most words are five letters long or less). Moreover, in order to deal with other cases, (words that are longer than five letters; words for which no exact match has been found; whether the file is unclear, contains noise, or contains handwriting words) the fuzzy ratio is used. We can summarize the proposed algorithm as follows:

Razy is the hybrid algorithm that consists of a modified Rabin–Karp algorithm and the fuzzy ratio with a special coding system:

The coding system used for coding symbols depends on the existing symbols in the pathological reports.
A Modified Rabin–Karp Algorithm: Change the process of how to compute the hash values for input strings depending on the number of characters in the coding system and the maximum length for most of the words in the report (five characters).
The ability to find the most similar word using the fuzzy ratio for words with h length maximum and five characters or words that have no match via the Modified Rabin–Karp Algorithm.

However, despite the design and improvement of this algorithm for a specific purpose, which is the analysis of pathological reports, its results were tested on (bible dataset) ordinary English data, and the results are better than the results achieved by the previously published paper, as shown in Table 6, so we mention that it can be used for other purposes.

6. Conclusions

Doctors track these significant changes by paying attention to blood parameter readings that exceed the normal range. However, slight variations and/or interactions among numerous blood parameters are crucial for identifying abnormal patterns. Therefore, the string similarity can be used to find words (test name) that are similar or identical to words from a standard pathological report in order to collect current patient test values and establish a baseline for the general report. This paper proposed a new algorithm to measure the degree of similarity. This metric, known as Razy string matching, is used to identify two types of matches: exact matches, if any, or approximate matches. Therefore, one of the most important contributions of this research is the development of a matching algorithm for the words found in the CBC tests, which was developed to fit the words in these tests to provide the appropriate time and accuracy for the diagnostic systems for blood diseases. Two additional contributions are building a general report that contains all of the patient’s information (the patient’s medical history) and generating a unique number for each patient through the system, to be included in the report and relied on in dealing with patient data to avoid repetition in the names of patients. The experiments show that an enhanced string similarity measure can be used for both exact and approximate matching with a high accuracy of 99.94% for retrieving results to find the best matching words.

Author Contributions

Conceptualization, S.S.A.-J. and A.K.F.; methodology, S.S.A.-J. and A.A.A.; software, S.S.A.-J. and M.E.G.; validation, A.K.F., M.E.G. and A.A.A.; formal analysis, A.A.A. and S.S.A.-J.; investigation, A.A.A. and A.K.F.; resources, S.S.A.-J. and A.K.F.; data curation, S.S.A.-J. and M.E.G.; writing—original draft preparation, S.S.A.-J. and A.A.A.; writing—review and editing, S.S.A.-J. and A.A.A.; visualization, A.K.F. and M.E.G. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

Not applicable.

Acknowledgments

The author Mohamed E. Ghoneim would like to thank the Deanship of Scientific Research at Umm Al-Qura University for supporting this work by Grant Code:22UQU4331317DSR002.

Conflicts of Interest

The authors declare no conflict of interest.

Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Figures and Tables

Figure 1. The architecture of the proposed approach.

Figure 2. The typical drone employed in data collection.

Figure 3. The typical drone employed in data collection.

Table 1

Time measurement (in $μ$ s) using token-based methods.

Input Type	Jaccard Dis.	N-Gram	Bag Dis.	Sorensen–Dice
Complete Words	0	3719.32	0	0
Delete One Char.	0	779.39	538.82	0
Change One Char.	1106.76	0	1966.19	0
Complete Words	0	83333.33	0	1998.66
Delete One Char.	0	90909.09	0	799.06
Change One Char.	1157.76	0	1351.83	460.86
Complete Words	0	76923.07	0	0
Delete One Char.	0	76923.07	0	235.67
Change One Char.	1152.51	76923.07	2069.47	0

Table 2

Time measurement (in $μ$ s) using naive string matching algorithms and fuzzy methods.

Input Type	Naïve Matching	Fuzzy Ratio	Fuzzy Extract Process
Complete Words	1270.05	0	0
Delete One Char.	0	0	1590.92
Change One Char.	0	0	1079.08
Complete Words	321.62	170.23	0
Delete One Char.	0	2632.85	0
Change One Char.	0	0	0
Complete Words	676.15	0	0
Delete One Char.	1450.3	0	1010.17
Change One Char.	748.15	0	0

Table 3

Time measurement (in $μ$ s) using edit-based and sequence-based methods.

Input Type	# Chars	Damerau–Lev.	Lev.	Hamming Dist.	Longest-Common-Substring
Complete Words	3	853.53	1001.35	0	1097.67
Delete One Char.		0	1007.79	1235.48	1148.46
Change One Char.		0	1054.76	0	0
Complete Words	4	0	0	0	0
Delete One Char.		0	2741.81	2879.38	0
Change One Char.		998.49	0	0	3016.94
Complete Words	5	2428.77	2254.96	1077.41	4007.81
Delete One Char.		1045.94	1049.04	1039.26	1714.46
Change One Char.		1221.17	0	0	0

Table 4

The average waiting time for each implemented algorithm.

Method Name	Average Waiting Time ( $μ$ s)
Damerau–Levenshtein	591.85
Levenshtein	1012.19
Hamming Distance	692.39
Longest-Common-Substring	1220.59
Jaccard Distance	379.67
N-Gram	36954.14
Bag Distance	428.53
Sørensen–Dice	388.25
Naïve Matching	413.12
Rabin–Karp	0
Fuzzy Ratio	311.45
Fuzzy Extract Process	408.90

Table 5

The waiting time for the proposed algorithm and ODADNN.

Input Type	No. of Char.	ODADNN
Complete Words	3	4707.8
Delete One Char.		1000
Change One Char.		2028.7
Complete Words	4	2024.9
Delete One Char.		1002
Change One Char.		1999.7
Complete Words	5	1990.7
Delete One Char.		2113.6
Change One Char.		2047.7

Table 6

The waiting time (milliseconds) for the Razy algorithm compared to [32].

Input	Value
Corpus	English
Text	Bible
Size (MB)	3.83
Sample patterns	Good, the
Existing method [32]	4205 ms
The proposed Razy string matching	754 ms

Table 7

Evaluation metrics used in assessing the proposed optimized voting classifier.

Metric	Formula
F1-Score	$\frac{T P}{T P + 0.5 (F P + F N)}$
Recall	$\frac{T P}{T P + F N}$
Specificity (TNR)	$\frac{T N}{T N + F P}$
Accuracy	$\frac{T P + T N}{T P + T N + F P + F N}$
Precision	$\frac{T P}{T P + F P}$
Sensitivity (TPR)	$\frac{T P}{T P + F N}$
N-value (NPV)	$\frac{T N}{T N + F N}$
P-value (PPV)	$\frac{T P}{T P + F P}$

Table 8

Evaluating the performance of the proposed approach in comparison to the Fuzzy approach.

Metric	Proposed Razy	Fuzzy	Proposed Razy	Fuzzy
Input Term	The	The	Good	Good
Accuracy	0.99973894	0.988385718	0.999152054	0.9821751
Sensitivity (TRP)	0.998378729	0.939005794	0.993546305	0.905854663
Specificity (TNP)	0.999858041	0.992951075	0.999645178	0.989463747
P-value (PPV)	0.998378729	0.924902373	0.995956655	0.891430226
N-value (NPV)	0.999858041	0.994352899	0.999432405	0.990995149
F1-score	0.998378729	0.931900726	0.99475002	0.898584562

Table 9

Statistical analysis of the achieved results using the proposed Razy approach.

Metric	Proposed Razy	Fuzzy
Number of Values	10	10
Minimum	0.9992	0.9722
25% Percentile	0.9992	0.9822
Median	0.9992	0.9822
75% Percentile	0.9992	0.9837
Maximum	0.9992	0.9922
Range	0	0.02
Mean	0.9992	0.9828
Std. Deviation	0	0.005087
Std. Error of Mean	0	0.001609
Sum	9.992	9.828

Table 10

Wilcoxon signed rank test of the results achieved by the proposed Razy approach.

Metric	Proposed Razy	Fuzzy
Theoretical median	0	0
Actual median	0.9992	0.9822
Number of values	10	10
Wilcoxon signed rank test
Sum of signed ranks (W)	55	55
Sum of positive ranks	55	55
Sum of negative ranks	0	0
p value (two-tailed)	0.002	0.002
Exact or estimate?	Exact	Exact
p value summary
Significant (alpha = 0.05)?	Yes	Yes
How big is the discrepancy?
Discrepancy	0.9992	0.9822

References

1. Dhindsa, K.; Bhandari, M.; Sonnadara, R.R. What’s holding up the big data revolution in healthcare?. BMJ; 2018; 363, k5357. [DOI: https://dx.doi.org/10.1136/bmj.k5357] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30593447]

2. Al Mayahi, S.; Al-Badi, A.; Tarhini, A. Exploring the Potential Benefits of Big Data Analytics in Providing Smart Healthcare. Emerging Technologies in Computing; Miraz, M.H.; Excell, P.; Ware, A.; Soomro, S.; Ali, M. Springer International Publishing: Cham, Switzerland, 2018; Volume 200, pp. 247-258. [DOI: https://dx.doi.org/10.1007/978-3-319-95450-9_21]

3. de Boer, M.H.T.; Bakker, B.J.; Boertjes, E.; Wilmer, M.; Raaijmakers, S.; van der Kleij, R. Text Mining in Cybersecurity: Exploring Threats and Opportunities. Multimodal Technol. Interact.; 2019; 3, 62. [DOI: https://dx.doi.org/10.3390/mti3030062]

4. Ranjan, N.M.; Prasad, R.S. Text Analytics: An Application of Text Mining. J. Data Min. Manag.; 2021; 6, pp. 1-6. [DOI: https://dx.doi.org/10.46610/JoDMM.2021.v06i03.001]

5. Zhong, S.; Sun, D. Logic-Driven Traffic Big Data Analytics: Methodology and Applications for Planning; OCLC: 1280274422 Springer Nature: Singapore, 2022; pp. 97-118.

6. Jaiswal, M.; Srivastava, A.; Siddiqui, T.J. Machine Learning Algorithms for Anemia Disease Prediction. Recent Trends in Communication, Computing, and Electronics; Khare, A.; Tiwary, U.S.; Sethi, I.K.; Singh, N. Springer: Singapore, 2019; Volume 524, pp. 463-469. [DOI: https://dx.doi.org/10.1007/978-981-13-2685-1_44]

7. Kalra, S.; Li, L.; Tizhoosh, H.R. Automatic Classification of Pathology Reports using TF-IDF Features. arXiv; 2019; [DOI: https://dx.doi.org/10.48550/arXiv.1903.07406] arXiv: 1903.07406

8. Dube, N.; Girdler-Brown, B.; Tint, K.; Kellett, P. Repeatability of manual coding of cancer reports in the South African National Cancer Registry, 2010. S. Afr. J. Epidemiol. Infect.; 2013; 28, pp. 157-165. [DOI: https://dx.doi.org/10.1080/10158782.2013.11441539]

9. Achilonu, O.J.; Olago, V.; Singh, E.; Eijkemans, R.M.J.C.; Nimako, G.; Musenge, E. A Text Mining Approach in the Classification of Free-Text Cancer Pathology Reports from the South African National Health Laboratory Services. Information; 2021; 12, 451. [DOI: https://dx.doi.org/10.3390/info12110451]

10. Goh, Y.M.; Ubeynarayana, C. Construction accident narrative classification: An evaluation of text mining techniques. Accid. Anal. Prev.; 2017; 108, pp. 122-130. [DOI: https://dx.doi.org/10.1016/j.aap.2017.08.026]

11. Wang, L.; Alexander, C.A. Big Data Analytics in Healthcare Systems. Int. J. Math. Eng. Manag. Sci.; 2019; 4, pp. 17-26. [DOI: https://dx.doi.org/10.33889/IJMEMS.2019.4.1-002]

12. Castro, E.M.; Van Regenmortel, T.; Vanhaecht, K.; Sermeus, W.; Van Hecke, A. Patient empowerment, patient participation and patient-centeredness in hospital care: A concept analysis based on a literature review. Patient Educ. Couns.; 2016; 99, pp. 1923-1939. [DOI: https://dx.doi.org/10.1016/j.pec.2016.07.026] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/27450481]

13. Yang, J.J.; Li, J.; Mulder, J.; Wang, Y.; Chen, S.; Wu, H.; Wang, Q.; Pan, H. Emerging information technologies for enhanced healthcare. Comput. Ind.; 2015; 69, pp. 3-11. [DOI: https://dx.doi.org/10.1016/j.compind.2015.01.012]

14. Najam, M.; Rasool, R.U.; Ahmad, H.F.; Ashraf, U.; Malik, A.W. Pattern Matching for DNA Sequencing Data Using Multiple Bloom Filters. BioMed Res. Int.; 2019; 2019, 7074387. [DOI: https://dx.doi.org/10.1155/2019/7074387] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31111064]

15. Wang, X.; Saif, A.A.; Liu, D.; Zhu, Y.; Benediktsson, J.A. A novel optimal multi-pattern matching method with wildcards for DNA sequence. Technol. Health Care; 2021; 29, pp. 115-124. [DOI: https://dx.doi.org/10.3233/THC-218012]

16. Rozov, R.; Shamir, R.; Halperin, E. Fast lossless compression via cascading Bloom filters. BMC Bioinform.; 2014; 15, S7. [DOI: https://dx.doi.org/10.1186/1471-2105-15-S9-S7] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/25252952]

17. Williams, T.; van Staa, T.; Puri, S.; Eaton, S. Recent advances in the utility and use of the General Practice Research Database as an example of a UK Primary Care Data resource. Ther. Adv. Drug Saf.; 2012; 3, pp. 89-99. [DOI: https://dx.doi.org/10.1177/2042098611435911]

18. Lee, M.L.; Clymer, R.; Peters, K. A naturalistic patient matching algorithm: Derivation and validation. Health Inform. J.; 2016; 22, pp. 1030-1044. [DOI: https://dx.doi.org/10.1177/1460458215607080] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/26460103]

19. Karystianis, G.; Sheppard, T.; Dixon, W.G.; Nenadic, G. Modelling and extraction of variability in free-text medication prescriptions from an anonymised primary care electronic medical record research database. BMC Med. Inform. Decis. Mak.; 2016; 16, 18. [DOI: https://dx.doi.org/10.1186/s12911-016-0255-x]

20. Tissot, H.; Dobson, R. Combining string and phonetic similarity matching to identify misspelt names of drugs in medical records written in Portuguese. J. Biomed. Semant.; 2019; 10, 17. [DOI: https://dx.doi.org/10.1186/s13326-019-0216-2] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/31711534]

21. Patel, R.; Jayatilleke, N.; Jackson, R.; Stewart, R.; McGuire, P. Investigation of negative symptoms in schizophrenia with a machine learning text-mining approach. Lancet; 2014; 383, S16. [DOI: https://dx.doi.org/10.1016/S0140-6736(14)60279-8]

22. Menger, V.; Scheepers, F.; van Wijk, L.M.; Spruit, M. DEDUCE: A pattern matching method for automatic de-identification of Dutch medical text. Telemat. Inform.; 2018; 35, pp. 727-736. [DOI: https://dx.doi.org/10.1016/j.tele.2017.08.002]

23. Levy-Fix, G.; Gorman, S.L.; Sepulveda, J.L.; Elhadad, N. When to re-order laboratory tests? Learning laboratory test shelf-life. J. Biomed. Inform.; 2018; 85, pp. 21-29. [DOI: https://dx.doi.org/10.1016/j.jbi.2018.07.019]

24. Biggs, J. Comparison of Visual and Logical Character Segmentation in Tesseract OCR Language Data for Indic Writing Scripts. Proceedings of the Australasian Language Technology Association Workshop 2015; Parramatta, Australia, 8–9 December 2015; pp. 11-20.

25. Chernyak, E. Comparison of String Similarity Measures for Obscenity Filtering. Proceedings of the 6th Workshop on Balto-Slavic Natural Language Processing; Association for Computational Linguistics; Valencia, Spain, 4 April 2017; pp. 97-101. [DOI: https://dx.doi.org/10.18653/v1/W17-1415]

26. Abdul-Jabbar, S.; George, L. A Comparative Study for String Metrics and the Feasibility of Joining them as Combined Text Similarity Measures. ARO-Sci. J. Koya Univ.; 2017; 5, pp. 6-18. [DOI: https://dx.doi.org/10.14500/aro.10180]

27. Yu, M.; Wang, J.; Li, G.; Zhang, Y.; Deng, D.; Feng, J. A unified framework for string similarity search with edit-distance constraint. VLDB J.; 2017; 26, pp. 249-274. [DOI: https://dx.doi.org/10.1007/s00778-016-0449-y]

28. Kosa, V.; Chaves-Fraga, D.; Keberle, N.; Birukou, A. Similar Terms Grouping Yields Faster Terminological Saturation. Information and Communication Technologies in Education, Research, and Industrial Applications; Ermolayev, V.; Suárez-Figueroa, M.C.; Yakovyna, V.; Mayr, H.C.; Nikitchenko, M.; Spivakovsky, A. Springer International Publishing: Cham, Switzerland, 2019; Volume 1007, pp. 43-70. [DOI: https://dx.doi.org/10.1007/978-3-030-13929-2_3]

29. Yaqin, A.; Dahlan, A.; Hermawan, R.D. Implementation of Algorithm Rabin-Karp for Thematic Determination of Thesis. Proceedings of the 2019 4th International Conference on Information Technology, Information Systems and Electrical Engineering (ICITISEE); Yogyakarta, Indonesia, 20–21 November 2019; pp. 395-400. [DOI: https://dx.doi.org/10.1109/ICITISEE48480.2019.9003867]

30. Bosker, H.R. Using fuzzy string matching for automated assessment of listener transcripts in speech intelligibility studies. Behav. Res. Methods; 2021; 53, pp. 1945-1953. [DOI: https://dx.doi.org/10.3758/s13428-021-01542-4] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33694079]

31. Putri, R.E. Examination of Document Similarity Using Rabin-Karp Algorithm. Int. J. Recent Trends Eng. Res.; 2017; 3, pp. 196-201. [DOI: https://dx.doi.org/10.23883/IJRTER.2017.3404.4SNDK]

32. Hakak, S.; Kamsin, A.; Shivakumara, P.; Idna Idris, M.Y.; Gilkar, G.A. A new split based searching for exact pattern matching for natural texts. PLoS ONE; 2018; 13, e0200912. [DOI: https://dx.doi.org/10.1371/journal.pone.0200912]

33. El-Kenawy, E.S.M.; Mirjalili, S.; Alassery, F.; Zhang, Y.D.; Eid, M.M.; El-Mashad, S.Y.; Aloyaydi, B.A.; Ibrahim, A.; Abdelhamid, A.A. Novel Meta-Heuristic Algorithm for Feature Selection, Unconstrained Functions and Engineering Problems. IEEE Access; 2022; 10, pp. 40536-40555. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3166901]

34. Sami Khafaga, D.; Ali Alhussan, A.; M. El-kenawy, E.S.; E. Takieldeen, A.; M. Hassan, T.; A. Hegazy, E.; Abdel Fattah Eid, E.; Ibrahim, A.; A. Abdelhamid, A. Meta-heuristics for Feature Selection and Classification in Diagnostic Breast-Cancer. Comput. Mater. Contin.; 2022; 73, pp. 749-765. [DOI: https://dx.doi.org/10.32604/cmc.2022.029605]

35. Abdel Samee, N.; M. El-Kenawy, E.S.; Atteia, G.; M. Jamjoom, M.; Ibrahim, A.; A. Abdelhamid, A.; E. El-Attar, N.; Gaber, T.; Slowik, A.; Y. Shams, M. Metaheuristic Optimization Through Deep Learning Classification of COVID-19 in Chest X-Ray Images. Comput. Mater. Contin.; 2022; 73, pp. 4193-4210. [DOI: https://dx.doi.org/10.32604/cmc.2022.031147]

36. Khafaga, D.S.; El-kenawy, E.S.M.; Karim, F.K.; Alshetewi, S.; Ibrahim, A.; Abdelhamid, A.A. Optimized Weighted Ensemble Using Dipper Throated Optimization Algorithm in Metamaterial Antenna. Comput. Mater. Contin.; 2022; 73, pp. 5771-5788. [DOI: https://dx.doi.org/10.32604/cmc.2022.032229]

37. El-Kenawy, E.S.M.; Mirjalili, S.; Abdelhamid, A.A.; Ibrahim, A.; Khodadadi, N.; Eid, M.M. Meta-Heuristic Optimization and Keystroke Dynamics for Authentication of Smartphone Users. Mathematics; 2022; 10, 2912. [DOI: https://dx.doi.org/10.3390/math10162912]

Word count: 6820

Show less

© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Pathology reports are necessary for specialists to make an appropriate diagnosis of diseases in general and blood diseases in particular. Therefore, specialists check blood cells and other blood details. Thus, to diagnose a disease, specialists must analyze the factors of the patient’s blood and medical history. Generally, doctors have tended to use intelligent agents to help them with CBC analysis. However, these agents need analytical tools to extract the parameters (CBC parameters) employed in the prediction of the development of life-threatening bacteremia and offer prognostic data. Therefore, this paper proposes an enhancement to the Rabin–Karp algorithm and then mixes it with the fuzzy ratio to make this algorithm suitable for working with CBC test data. The selection of these algorithms was performed after evaluating the utility of various string matching algorithms in order to choose the best ones to establish an accurate text collection tool to be a baseline for building a general report on patient information. The proposed method includes several basic steps: Firstly, the CBC-driven parameters are extracted using an efficient method for retrieving data information from pdf files or images of the CBC tests. This will be performed by implementing 12 traditional string matching algorithms, then finding the most effective ways based on the implementation results, and, subsequently, introducing a hybrid approach to address the shortcomings or issues in those methods to discover a more effective and faster algorithm to perform the analysis of the pathological tests. The proposed algorithm (Razy) was implemented using the Rabin algorithm and the fuzzy ratio method. The results show that the proposed algorithm is fast and efficient, with an average accuracy of 99.94% when retrieving the results. Moreover, we can conclude that the string matching algorithm is a crucial tool in the report analysis process that directly affects the efficiency of the analytical system.

Details

Title

Razy: A String Matching Algorithm for Automatic Analysis of Pathological Reports

Author

Abdul-Jabbar, Safa S¹

; Farhan, Alaa K²

; Abdelhamid, Abdelaziz A³

; Ghoneim, Mohamed E⁴

¹ Computer Science Department, College of Science for Women, University of Baghdad, Baghdad 10011, Iraq
² Computer Science Department, University of Technology, Baghdad 10030, Iraq
³ Department of Computer Science, Faculty of Computer and Information Sciences, Ain Shams University, Cairo 11566, Egypt
⁴ Department of Mathematical Sciences, Faculty of Applied Science, Umm Al-Qura University, Makkah 21955, Saudi Arabia; Faculty of Computers and Artificial Intelligence Damietta University, Damietta 34517, Egypt

First page

547

Publication year

2022

Publication date

2022

Publisher

MDPI AG

e-ISSN

20751680

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/axioms11100547

ProQuest document ID

2728429133

Razy: A String Matching Algorithm for Automatic Analysis of Pathological Reports

Jump to:

Full text

Abstract

Details

Suggested sources