Content area
Full Text
Abstract
Skew angle estimation is essential to enhance the accuracy of optical character recognition (OCR) system. In this paper we present a new boundary growing (BG) and nearest neighbor clustering (NNC) to estimate accurate skew angle for the scanned documents. The BG extracts the boundary characters present in each text line of the document and extracts uppermost, lowermost and centroid coordinates of character components of the scanned document image. The NNC helps us in clustering the characters which is presented due to additional modifiers-characters that are usually present in the South Indian scripts. The extracted coordinates are subjected to moments to estimate skew angle of the document image. Several experiments have been conducted on various types of documents such as documents containing South Indian scripts, English documents, journals, textbook, text with picture, text with tables, text with graphs, different languages, noisy images and document with different fonts, documents with different resolutions, to reveal the robustness of the proposed method. The experimental results revealed that the proposed method is accurate compared to the results of well-known existing methods.
Key Words
Optical character recognition, boundary growing, nearest neighbor clustering, moments, South Indian scripts, skew estimation
(ProQuest: ... denotes formulae omitted.)
1. Introduction
Optical character recognition (OCR) systems are used to transform text document into its equivalent digital form, which is suitable for a computerized data processing. This transformation process often consists of three major stages namely, preprocessing stage, document layout understanding and segmentation stage, feature extraction and classification stage. The OCR-system works well if the document is not skewed during the scanning process and its text lines are strictly horizontal, which is very unlikely and hence accurate estimation of skew angle is invariably involved in OCR-based character recognition system to achieve good recognition rate [1, 2]. From the literature survey, it is found that certain OCR systems no doubt works on skewed documents containing English scripts but not so accurate on South Indian script documents. This is due to additional modifier-characters, which get plugged in as bottom fixes or top fixes, or as extensions, that remain as disconnected protrusions of a main character. Hence designing an algorithm for estimating skew for South Indian scripts is still remains as challenging and interesting.
Existing skew estimation techniques are...