Content area
This paper discusses technologies for software performance optimization. Optimization methods are divided into high-level and low-level, as well as parallelization. The described optimization methods are applied to programs and software systems for processing large volumes of information, which have hot spots. An algorithm for classifying and linking fields in a recognized image of an administrative document is described. The implementation features of the classification and linking tasks, which consist in using constellations of text key points and a modified Levenshtein distance, are considered. For optical character recognition (OCR), Smart Document Engine and Tesseract are employed. Several methods used to optimize the performance of functions for document classification and linking are described. The performance optimization of the system for sorting administrative document image streams is considered. The proposed methods for software performance optimization are suitable not only for image processing algorithms but also for computational algorithms with cyclic information processing. The approach can also be used in modern CAD systems to analyze the content of recognized text files.
INTRODUCTION
Software performance optimization is performed for various reasons. Performance requirements can be formulated by the end user of an information system in its technical specifications. These requirements are specified for particular computers and operating systems. Similarly, there are performance requirements for mobile applications. Runtime constraints imposed on mobile applications are due to not only response requirements but also power consumption and operating temperature requirements of mobile devices. There is a correlation between application performance and power consumption.
Software optimization is aimed at
─ reducing the average execution time of an application on a certain test dataset;
─ reducing the average execution time of an implemented function.
PRINCIPLES OF SOFTWARE PERFORMANCE OPTIMIZATION
In the process of software development, several types of optimization can be used:
─ high-level optimization (algorithm optimization);
─ low-level optimization using hardware features;
─ parallel programming.
High-level optimization is based on a certain method for solving a problem included in the functionality of a program being developed. The algorithm can significantly depend on the domain of definition, e.g., a decrease in the amount of data obviously reduces search time. The selection of algorithm parameters also significantly affects complexity. High-level optimization is carried out on simplified computer architectures, e.g., von Neumann architecture or Harvard architecture. The architecture is understood as a set of user-defined characteristics, including the basic devices and blocks of a simplified computer, as well as the structure of their interconnections. From a software development perspective, computer architecture is a set of descriptions of data, operations (instructions), and their characteristics.
High-level optimization can use the following methods:
─ the analysis of an algorithm (implementation, selection, domain of definition, parameterization, and termination criterion);
─ the use of intermediate data (memoization [1]);
─ the representation of source data;
─ the reduction of algorithm complexity (lookup table [2] and interpolation);
─ loop optimization (removing computations from the body of a loop, loop fusion, and loop unrolling).
Low-level optimization is focused on the hardware platform, which determines the operating environment of specific programs. The basis of a hardware platform is formed by a motherboard, central processing unit (CPU), and storage devices. A program running on a computer consists of instructions of a particular processor.
Low-level optimization is carried out for particular CPU microarchitectures [3], e.g.,
─ CISC (variable instruction length);
─ RISC (uniform instruction length);
─ VLIW (parallel execution of several operations in one instruction);
─ superscalar architecture, in which the decision to execute two or more instructions in parallel across several devices is made by the CPU at runtime.
Generally speaking, low-level optimization techniques differ for different CPU microarchitectures. For instance, SIMD instruction sets for ARM and Intel CPUs are different. Programmer directives aimed at improving parallel execution of instructions for VLIW CPUs are not suitable for Intel CPUs.
Parallel execution (parallelization) is an effective way to improve software performance. Parallel execution is facilitated when several programs that share one or more processors and other computer resources are alternately executed on one processor. This method is called multitasking. It is aimed at improving the efficiency of a computing system. In this case, various efficiency criteria can be used:
─ throughput: the number of tasks executed per unit of time in a batch operating system (OS);
─ user-friendliness: the possibility of simultaneous interaction with several applications on one machine in a time-sharing OS;
─ system reactivity: the ability of the system to maintain predetermined (possibly, very short) periods of time between launching a program and receiving its output in a real-time OS.
It is reasonable to carry out parallelization upon completion of high-level and low-level optimizations of a sequential program or algorithm. Parallelization implies the execution of certain parts of a program on several devices, in our case, several CPUs. Formally, a distinction is made between parallel execution of application tasks on one computer and distributed execution of several applications on several computers in a local network. The main approaches to parallel software development are
─ sequential programming with subsequent automatic parallelization;
─ direct generation of parallel control flows, taking into account specific architectures of parallel computing systems or operating systems;
─ description of parallelism without explicit control.
METHODS FOR SOFTWARE PERFORMANCE OPTIMIZATION
The optimization methods described below are applied primarily to programs and software systems that implement various information processing functions. In other words, we do not consider programs that implement one or more mathematical algorithms, e.g., functions from the Eigen or MKL libraries. More specifically, we consider programs in which the inefficient use of hardware resources can take place in a large number of hot spots.
Specifics of High-level Performance Optimization for Project Implementation and Product Development
The software optimization principles described in Section 2 are used differently in different scenarios.
Specification of performance requirements differs significantly for product development (APIs, libraries, and applications) and project implementation (systems, subsystems, and functional modules). The developments of both types have certain similarities; including
─ the use of ready-made software modules;
─ high-level optimization;
─ the expert estimation of speedup.
However, there are also significant differences. For instance, a product release plan allows for an increase in development time, e.g., associated with the implementation of new algorithms that a priori should have lower complexity than old ones. During product development, it is necessary to enable constant profiling and carry out performance analyses of other types. Product development requires the performance analysis of competing products, which can provide useful ideas. In project implementation, performance requirements can be specified by the customer; however, a project deadline and availability of resources may reduce the time required for optimization. In both cases, the optimization of commercial and proprietary modules is carried out. The optimization of commercial modules by replacing or modifying implemented algorithms is generally impossible; however, it is possible to control performance by data representation and parameter selection for callee methods.
Software Profiling
Profiling is necessary in all cases. It implies measuring the performance of a user program or its parts. The purpose of profiling is to evaluate the performance of an application as a whole and its components (spots): functions, loops, strings, and instructions.
Performance evaluation allows one to analyze
─ the total execution time of an application;
─ the set of hot spots;
─ the relative time of each spot;
─ the number of spot calls;
─ the degree of code coverage (the degree to which the source code of a program is executed when running a particular test);
─ CPU workload and memory locks;
─ the low-level statistics on CPU workload, memory bus load, cache misses, and cache hits;
─ the degree of parallelism.
Profiling can be carried out for a debug version of an application, an optimized debug version, or a release version with debug information. An essential condition for profiling is to conduct the measurements on the same test dataset, which can be either public or proprietary. Profiling must be carried out under the same conditions on chosen platforms, including their hardware and system software.
There are several methods of instrumental profiling:
─ manual: time measurements that use calls of system functions (e.g., std::clock) added to application code or methods from the std::chrono library;
─ time-based profiling: collection of statistics about an application in the process of profiling;
─ instrumentation: collection of detailed runtime information about each callee function in the process of profiling.
An advantage of the manual approach is the ease of measurements for a small number of known spots. Its disadvantages are
─ the need to insert additional code into a program, which can lead to errors and induced effects;
─ the analysis of the result in text form;
─ the absence of a function call tree and automated analysis tools.
The time-based profiler periodically gathers information about the state of a program, e.g., values of CPU performance counters and program counters. Based on these measurements, the performance is evaluated. Time-based profiling has the minimum effect on the operation of analyzed software, and the obtained information is highly accurate. The main disadvantages of this approach are
─ incomplete information about program code;
─ time-consuming collection of real statistics.
The instrumentation method modifies the executable file of a program to collect information about execution of each function, except for the time spent on function calls and system calls. The instrumentation method provides more information than time-based profiling; however, it can significantly slow down the profiled program.
There are several universal profilers:
─ CodeAnalyst;
─ Valgrind;
─ Visual Studio Performance Profiler;
─ Intel VTune.
All profilers mentioned above allow one to detect hot spots, localize sections of code that inefficiently use hardware and CPU resources, as well as identify synchronization objects that negatively affect software performance. Intel VTune can evaluate many hardware counters and metrics to detect hot spots. Intel VTune can emulate particular CPUs, including various levels of cache memory with replacement mechanisms, instruction decoders, instruction buffers, instruction pipelines, and other CPU components, as well as emulate interaction between the CPU and memory.
Methods for Low-level Optimization of Software Performance
The main low-level optimization method in software development is the selection of compiler parameters. For instance, in Microsoft Visual Studio and other compilers, the most important parameter is the optimization option: no optimization, performance optimization, or program size optimization. Another performance optimization is to explicitly specify the architecture, which allows the compiler to use instructions from the selected instruction set during translation.
The use of mathematical libraries is also an effective way to speed up computations. Intel Compiler Enable Matrix has the Multiply Library Call option, which enables or disables the use of the Matrix Multiply library. The Intel MKL library is well-known; it contains numerous computational methods implemented and optimized for modern CPUs.
Of course, programming in assembly language and use of intrinsics also remain relevant, which allows one to create libraries for fast computations.
Parallelization Approaches
The main approaches to parallel software development are
─ sequential programming with subsequent automatic parallelization;
─ direct generation of parallel control flows taking into account specific architectures of parallel computing systems or operating systems;
─ description of parallelism without explicit control.
Sequential programming with subsequent automatic parallelization facilitates the development, making it possible to completely abstract from parallelization capabilities and delegate parallelization to automated tools. The obvious disadvantage of this approach is the impossibility to achieve the maximum speedup.
Below are some automated parallelization tools that use cross-platform multithreaded libraries for C++:
─ Qt4 Threads;
─ Intel Threading Building Blocks (TBB).
The advantage of the direct generation of parallel control flows is the possibility of a significantly greater speedup for particular computing systems as compared to the previous approach. The disadvantages of the direct generation are as follows:
─ the complication of the development process due to the manual control of processes and flows, including the analysis of possible conflicts;
─ the dependence on a particular architecture of a multiprocessor system, which complicates portability.
The development is facilitated by the specification of directives that instruct the compiler to parallelize a fragment of source code with expected parallel implementation.
When parallelizing, it is also necessary to carry out profiling to estimate time losses due to thread conflicts and suboptimal use of data by multiple threads.
RECOGNITION AND CLASSIFICATION OF ADMINISTRATIVE DOCUMENTS
Recognition of document images with known descriptions includes several tasks:
─ search for document boundaries;
─ normalization of document size and boundaries;
─ extraction of graphic primitives;
─ character and word recognition, document structure analysis;
─ search for boundaries and recognition of document fields;
─ postprocessing of recognition results.
The most important tasks are classification of documents (document fragments) and field linking (search for document regions to extract content). When analyzing recognized images, it is necessary to take into account OCR errors, which occur in noisy, brightened, or distorted document images. In this study, we consider administrative documents used for information exchange among organizations and individuals [4]. Administrative documents are characterized by a relatively simple structure and a limited vocabulary.
A document is defined as a set of fields and static information. The text structure of an administrative document can be described using three objects: a word, a text string, and a text fragment. To classify a recognized document and link fields, we can use text key points and constellations of text points defined in [5, 6]. Text key points are similar to geometric singular points [7, 8]. A word in the model is represented as a sequence of characters. A recognized word is represented by a matrix of alternative matches between the character locations and the recognition-alphabet characters in the bounding frame of the word.
For a pair of text key points (ω1, ω2), the following relations can be set:
─ ω1 ∈ S, ω2 ∈ S (ω1 ∈ F, ω2 ∈ F): both text points belong to one string S or one text fragment F;
─ ω1 ∈ S1, ω2 ∈ S1 (ω1 ∈ F1, ω2 ∈ F2): both points ω1 and ω2 belong to different strings S1 and S2 or different text fragments F1 and F2;
─ ω1 < ω2: point ω1 is located “before” point ω2;
─ ω1 ∨ ω2: point ω1 is located “above” point ω2.
A text string is a constellation of several adjacent text points. Text strings can be found using algorithms for clustering frame boxes of recognized words. A text string is described by a set of ordered text points. For two strings (S1, S2), the following relations can be set:
─ S1 ∈ F, S2 ∈ F: both text strings belong to the same text fragment F;
─ S1 ∈ F1, S2 ∈ F2: both text strings belong to different text fragments F1 and F2;
─ S1 ∨ S2: string S1 is “above” string S2.
By string linking, we mean matching between the words of a recognized string and one of the described template strings. Some points are mandatory for linking. In this case, each point must be associated with some recognized word. In addition, the description of a string can contain prohibited text points. When linking a string, none of the prohibited points can be associated with a recognized word. The remaining text points may or may not be associated with recognized words in the process of string linking.
The string description can impose certain constraints on pairs of text points by using the following metrics:
─ the number of text points in the interval between ω1 and ω2;
─ the sum of widths of the text points located between ω1 and ω2;
─ the number of strings in the interval between the strings that contain ω1 and ω2;
─ the Euclidean distance between two projections of the bounding boxes of ω1 and ω2.
A text fragment is a set of several text strings. In our model, it is assumed that, within one fragment, strings are grouped only into one column. For two fragments (F1, F2), the following relations can be set:
─ F1 ∈ F, F2 ∈ F: two fragments belong to text fragment F;
─ F1 < F2: fragment F1 is “before” fragment F2;
─ F1 ∨ F2: fragment F1 is “above” fragment F2.
Text fragments can be created using text structure analysis algorithms. A document is divided into parts based on its graphic structure (separator lines, columns, paragraphs, etc.) by using a certain description (template) of the document [9, 10]. For document fragmentation, both fragment-separating segments and spaces between fragments can be used. By fragment linking, we mean matching between the words and strings of a fragment and one of the described template fragments. As with string descriptions, the fragment description contains mandatory, prohibited, and ordinary strings and words.
A constellation is represented as a sequence of ordered points ω1, …, ωn. A simple case of a constellation is a sequence of points that belong to one text string; in the simplest case, it is a shingle (a sequence of words in a document header). Another case of a constellation is a chain: a sequence of points the pairs of which are ordered based on relations ω1 < ω2 (simple chain) or ω1 ∨ ω2 (vertical chain). The use of chains and constellations allows one not only to recognize the type of a document but also to extract text fragments and strings. The latter makes it possible to reduce the amount of text used in the analysis of a text object, e.g., in field linking.
In [5, 6], a field linking method for flexible documents was described. Strings, paragraphs, and document fragments are linked using the following classification algorithm. Models of admissible strings M1, M2, …, Mq are specified, with each model M being defined by a set of text points
1
and dLINK(M) is a threshold on the number of linked points for reliable linking. Set (1) uses three bags of words:─ prohibited words W–= {, , …, };
─ mandatory words W+ = {, , …, };
─ optional words W = {W1, W2, …, Wk}.
Estimates Δ(S, Mi) of string–model matches are calculated. Estimate Δ(S, Mi) is zero if
─ at least one point from set W–(S) is linked;
─ not a single point from set W+(S) is linked.
Estimate Δ(S, Mi) is equal to one if no points from W–(S) are linked and the number of linked points from W(S) and W+(S) exceeds dLINK(Mi). If Δ(S, Mi) is one, then string S is considered linked to model Mi. The accuracy of string linking depends on the preliminary linking of the neighborhood of an admissible string location. After string linking, a search (prediction) of field boundaries is carried out for subsequent information extraction. For a region of each field, basic elements that define a rectangle or polygon are specified. Field linking is carried out using linked basic elements.
The described string classification algorithm is used for document classification. A document page image is classified by linking constellation points while taking into account the relations between certain points.
A text point and a recognized word are matched using the modified Levenshtein distance proposed in [6]. The word matching mechanism is used in many tasks that involve comparison of words with an alphabet in a database [11, 12]. The original Levenshtein distance [13] between two text strings V and W is defined as the minimum number of editing operations used to transform V into W:
2
where substCost(, wj) is the cost of replacing character with character wj, while |V| and |W| are the lengths of words V and W. By default, the cost of any editing operation is equal to one. In this study, the algorithm for calculating the Levenshtein distance between two text strings is implemented in full accordance with recurrent formula (2). The implementation does not use any memory saving techniques, which reduce performance.Words V and W match if dLEV(V, W) < d(V), where d(V) is the known threshold for a model word. When using OCR, recognition errors often occur. Hence, threshold d(V) cannot be zero. Obviously, d(V) should be different for words of different lengths. Thus, we can use the normalized Levenshtein distance [14]:
When recognizing noisy and distorted document images, numerous recognition errors can occur. Some OCR errors are not random. Incorrect recognition of character “X” as character “O” is unlikely. Meanwhile, character “Ъ” can be misrecognized as “b” due to the similarity of their patterns. Examples of similar patterns for the Latin alphabet are pairs of characters “B8,” "DO,” and “1I.” In other words, some OCR errors occur more often than others. To take this fact into account, function substCost(vi, wj) need to be constructed in such a way that, when calculating the Levenshtein distance, the penalty for similar characters is lower than that for dissimilar characters:
─ for identical characters, substCost(, ) = 0;
─ for different dissimilar characters, substCost(, wj) = 1;
─ for similar characters, substCost(, wj) = 0, or 0 < substCost(, wj) < 1.
The described modification allows us to reduce the distance calculated for words with errors of the similar-characters type.
For some words distant in their meaning, e.g., identifiers, the Levenshtein distance is small. To avoid the misrecognitions considered above, we use word templates of the following form:
These templates specify mandatory characters at the beginning, middle, or end of a word. If the characters of a recognized word do not match the template, then the Levenshtein distance is increased by a predetermined penalty. This modification allows us to increase the distance between the words that differ in a small number of characters, which are features of different words.
The penalty can be imposed for the mismatching lengths of words V and W:
The similarity between the model word V and the recognized word W is calculated as follows:where
─ f1(G(V), W) is the penalty for the mismatch between the model word V and the recognized word W, calculated using template G(V);
─ f2(G2(V, W)) is the penalty for the mismatch between the lengths of the model word V and recognized word W.
In this case, function substCost(, wj) can be used, which takes into account the recognition errors of the similar-characters type. If there is no penalty (f(G(V), W) = 0) and there are no similar characters (substCost(, wj) = 0 or substCost(, wj) = 1), then Sim(V, W) coincides with Levenshtein distance dLEV(V, W). Similarity Sim(V, W) can also be normalized by analogy with ρLEV(V, W).
The proposed modifications of the Levenshtein distance make it possible to reduce the number of matches for the words that cannot be considered identical, as well as increase the number of matches for the words with insignificant recognition errors.
OPTIMIZATION OF THE IMPLEMENTED ALGORITHMS FOR CLASSIFICATION AND LINKING OF ADMINISTRATIVE DOCUMENTS
As objects of performance optimization, we consider the proposed classification and linking algorithms based on word matching.
The high-level performance optimization of the classification and linking algorithms is based on creating a description of a document page structure and constellations that correspond to document fragments. When profiling the algorithms in C++, the calculation of the distance between words is a hot spot, which takes 20–50% of the total execution time of the algorithm (5–15 milliseconds per document for various document types). In other words, most of the time is spent on word matching. The optimization of the linking and classification functions is due to the need to apply these functions to a document image not just once but as many times as there are descriptions of types of documents.
Obviously, when analyzing a document fragment, the number of candidates for matching with text points in the model can be significantly smaller than when analyzing all recognized words on a page. This is ensured by penalty functions f1 and f2, relations between points in a constellation, and thresholds used for the metric-based calculation of the distance between V and W. When training classification models (1) on administrative documents with more than 10 types, the volume of the model exceeds 100 points. In contrast, when representing constellations as simple or vertical chains, no more than 10 points are required. Thus, the performance is significantly improved.
An effective optimization for calculating the distance between words V and W is to calculate penalties Pen(V, W) = f1(G(V), W) + f2(G2(V, W)) at the first stage, when Pen(V, W) exceeds threshold d(V), recurrent formula (2), which has quadratic complexity, is used only when Pen(V, W) < d(V).
The proposed optimization is high-level and effective regardless of the architectural platform on which the algorithm is executed. Low-level optimization is also carried out. The time cost of classification and linking on a set of recognized words is insignificant when the classification and linking of document fields are carried out once for each document image. If the classification and linking are carried out repeatedly for documents of several types and are applied multiple times to one set of recognized words, then the time cost grows with increasing number of document types. Let us describe two optimizations aimed at speeding up the calculations for some manufactured CPUs.
The first optimization concerns the evaluation of function substCost(s, c). In the implementation of the algorithm, the global (for a document as a whole) and local (for a word) tables of character equivalence (s ≠ c) are created, for which substCost(s, c) = 0. When comparing two characters, a search is performed in the equivalence table m_nEquChars, which has a 32-bit integer type, taking into account permutations. When profiling on Intel(R) Core(TM) i7-4790 3.60 GHz with 16 GB RAM and Windows 7 Pro 64 bit by using MVS Analyzer and Intel VTune [15], the substCost function was identified as a hot spot. The source and assembly codes are shown in Fig. 1.
Fig. 1. [Images not available. See PDF.]
Implementation of the original version of the substCost function.
The function can be accelerated by using a 64-bit integer type (see Fig. 2). When using __int64 for MSV Compiler in x64 mode, the number of instructions for the loop body is reduced from 13 to 8. The speedup on some document types reaches 10%.
Fig. 2. [Images not available. See PDF.]
Implementation of the optimized version of the substCost function.
Another optimization was designed for the ARM-based platform used in iPhones. It turns out that, in complex scenarios of repeated classification and linking, the access to objects (text points, strings, and interpoint relations), e.g., using get_object(int id, void *pObject), leads to the occurrence of a hot spot in the RISC architecture. This is due to the high latency of operations for data copying from RAM. The speedup is achieved by the direct access to an object, without creating a new copy of an object.
The described algorithms were integrated into Smart Document Engine SDK [16], which is designed for recognizing flexible documents. Smart Document Engine SDK runs mainly in parallel mode. Parallelization is carried out automatically using the Intel TBB library [17].
Let us consider another example of performance optimization for the task of sorting a large stream of administrative documents (300 000 pages in 8 hours). Upon using Tesseract OCR [18], the classification was carried out for 45 types of known documents. Note that the number of classes, equal to 45, significantly exceeds the number of classes in public datasets [19, 20]. The high-level optimization of the classification algorithms was carried out by parameter selection and data representation for the Tesseract OCR component. A significant effect was achieved by limiting the region of recognition for each page. For this purpose, the region that included the text key points necessary for classification of all documents was selected on the training set. The binarization of page images before recognition proved to be effective. The initial goal of the binarization was to improve the recognition accuracy due to the removal of a complex background and morphological operations.
The low-level optimization was carried out by selecting compilation parameters for Intel C++ Compiler XE 15.0. For the compiler, the AVX2 optimization option was enabled. The Intel compiler made it possible to optimize the performance of both Tesseract and all other components of the system, primarily the bilateral filter. The speedup due to the high-level and low-level optimizations exceeded 50%.
The parallelization was implemented using independent components, which made it possible to run several applications that processed pages in several threads on several multicore nodes, with the input page stream being assigned to these applications by a certain balancing algorithm (see Fig. 3). The system was implemented with parallelism without using explicit control.
Fig. 3. [Images not available. See PDF.]
Parallel implementation of the sorting system.
CONCLUSIONS
Sections 4 and 5 have presented some examples of high-level and low-level optimizations for document recognition programs. The described methods for software performance optimization are suitable for a broad class of image processing applications.
For development of image processing applications, the choice of ready-made or newly developed software components is important. This choice depends on the type of development (project or in-house development). In all cases, it is reasonable to estimate the complexity of selected algorithms, as well as to prototype image processing applications. A necessary optimization step is to determine the performance requirements for the application.
From the very beginning of development, the most important tool for high-level and low-level optimization is the profiler. It is used to analyze the profile, select hot spots, and refine runtime constraints on hot spots. At the initial stages of development, it is useful to analyze source code on emulators of a chosen computing platform, including memory latency simulation [21]. The choice of image representation can make a significant contribution to speeding up the algorithm. For instance, an integral representation can be used to extract Haar features [22]. When implementing artificial neural networks, choosing the size of network layers and data representation is efficient for speedup [23, 24].
Extended SIMD (MMX, XMM, and NEON) systems efficiently accelerate image processing and recognition algorithms. To take advantage of the capabilities of a particular platform, both compilers and intrinsics can be used. However, for different computing platforms, not only compilers but also sets of intrinsics are different. This should be taken into account at the previous stage of data representation, e.g., for neural networks [23].
Assuming that image processing takes time that significantly exceeds the OS time quantum, e.g., exceeds 100 ms, two parallelization methods are possible. The first method is to use automatic parallelization tools [17], while the second method is the direct generation of parallel control flows. It is reasonable to carry out parallelization upon completion of high-level and low-level optimization.
The proposed methods for software performance optimization are suitable not only for image processing applications, but also for computational algorithms with cyclic processing of large volumes of information.
FUNDING
This work was supported by ongoing institutional funding. No additional grants to carry out or direct this particular research were obtained.
CONFLICT OF INTEREST
The author of this work declares that he has no conflicts of interest.
Translated by Yu. Kornienko
Publisher’s Note.
Pleiades Publishing remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
AI tools may have been used in the translation or editing of this article.
REFERENCES
1 Acar, U.A.; Blelloch, G.E.; Harper, R. Selective memorization. ACM SIGPLAN Not.; 2003; 38, pp. 14-25. [DOI: https://dx.doi.org/10.1145/640128.604133]
2 Tatarowicz, A.L., Curino, C., Jones, E.P.C., and Madden, S., Lookup tables: Fine-grained partitioning for distributed databases, Proc. IEEE 28th Int. Conf. Data Engineering, 2012, pp. 102–113. https://doi.org/10.1109/ICDE.2012.26
3 Harris, D.M.; Harris, S.L. Digital Design and Computer Architecture; 2013;
4 Rusiñol, M.; Frinken, V.; Karatzas, D.; Bagdanov, A.D.; Lladós, J. Multimodal page classification in administrative document image streams. IJDAR; 2014; 17, pp. 331-341. [DOI: https://dx.doi.org/10.1007/s10032-014-0225-8]
5 Slavin, O.A. and Pliskin, E.L., Method for analyzing the structure of noisy images of administrative documents, Program. Comput. Software, 2023, pp. 47–61.
6 Slavin, O.A.; Farsobina, V.; Myshev, A.V. Analyzing the content of business documents recognized with a large number of errors using modified Levenshtein distance, Cyber-Physical Systems: Intelligent Models and Algorithms; 2022; [DOI: https://dx.doi.org/10.1007/978-3-030-95116-0]
7 Bellavia, F., SIFT matching by context exposed, IEEE Trans. Pattern Anal. Mach. Intell., 2022. https://doi.org/10.1109/TPAMI.2022.3161853
8 Bay, H.; Tuytelaars, T.; Van Gool, L. SURF: Speeded up robust features. Comput. Vision Image Understanding; 2006; 110, pp. 404-417.
9 Du, X.; Wumo, P.; Bui, T.D. Text line segmentation in handwritten documents using Mumford–Shah model. Pattern Recognit.; 2009; 42, pp. 3136-3145. [DOI: https://dx.doi.org/10.1016/j.patcog.2008.12.021]
10 Maraj, A.; Martin, M.V.; Makrehchi, M. A more effective sentence-wise text segmentation approach using BERT; 2021; [DOI: https://dx.doi.org/10.1007/978-3-030-86337-1_16]
11 Kravets, A.G.; Salnikova, N.A.; Shestopalova, E.L. Development of a module for predictive modeling of technological development trends; 2021; [DOI: https://dx.doi.org/10.1007/978-3-030-67892-0_11]
12 Sabitov, A.; Minnikhanov, R.; Dagaeva, M.; Katasev, A.; Asliamov, T. Text classification in emergency calls management systems; 2021; [DOI: https://dx.doi.org/10.1007/978-3-030-67892-0_17]
13 Deza, M.M.; Deza, E. Encyclopedia of Distances; 2009; [DOI: https://dx.doi.org/10.1007/978-3-642-00234-2]
14 Yujian, L.; Bo, L. A normalized Levenshtein distance metric. IEEE Trans. Pattern Anal. Mach. Intell.; 2007; 29, pp. 1091-1095. [DOI: https://dx.doi.org/10.1109/TPAMI.2007.1078]
15 Intel® VTune™ Profiler Performance Analysis Cookbook. https://www.intel.com/content/www/us/en/docs/vtune-profiler/cookbook/2023-2/overview.html. Accessed September 23, 2023.
16 Smart Document Engine – Automatic analysis and data extraction from business documents for desktop, server and mobile platforms. https://smartengines.com/ocr-engines/document-scanner. Accessed September 23, 2023.
17 Intel® oneAPI Threading Building Blocks (oneTBB) developer guide and API reference. https://www.intel.com/content/www/us/en/docs/onetbb/developer-guide-api-reference/2021-10/overview.html. Accessed September 23, 2023.
18 OCR Tesseract. https://github.com/tesseract-ocr/tesseract. Accessed September 23, 2023.
19 NIST Special Database. https://www.nist.gov/srd/nist-special-database-2. Accessed September 23, 2023.
20 Tobacco-3482. https://www.kaggle.com/patrickaudriaz/tobacco3482jpg. Accessed September 23, 2023.
21 Kravets, A.G.; Egunov, V. The software cache optimization-based method for decreasing energy consumption of computational clusters. Energies; 2022; 15, 7509. [DOI: https://dx.doi.org/10.3390/en15207509]
22 Crow, F.C. Summed-area tables for texture mapping. ACM SIGGRAPH Comput. Graphics; 1984; 18, pp. 207-212. [DOI: https://dx.doi.org/10.1145/964965.808600]
23 Trusov, A.; Limonova, E.; Nikolaev, D.; Arlazarov, V.V. 4.6-bit quantization for fast and accurate neural network inference on CPUs. Mathematics; 2024; 12, 651. [DOI: https://dx.doi.org/10.3390/math12050651]
24 Rybakova, E.O.; Limonova, E.E.; Nikolaev, D.P. Fast Gaussian filter approximations comparison on SIMD computing platforms. Appl. Sci.; 2024; 14, 4664. [DOI: https://dx.doi.org/10.3390/app14114664]
© Pleiades Publishing, Ltd. 2024. ISSN 0361-7688, Programming and Computer Software, 2024, Vol. 50, No. 6, pp. 457–466. © Pleiades Publishing, Ltd., 2024. Russian Text © The Author(s), 2024, published in Programmirovanie, 2024, Vol. 50, No. 6.