Content area

Abstract

OCR (Optical Character Recognition) for scanned paper invoices is very challenging due to the variability of 19 invoice layouts, different information fields, large data tables, and low scanning quality. In this case, table structure recognition is a critical task in which all rows, columns, and cells must be accurately positioned and extracted. Existing methods such as DeepDeSRT only dealt with high-quality born-digital images (e.g., PDF) with low noise and apparent table structure. This paper proposes an efficient method called CluSTi (Clustering method for recognition of the Structure of Tables in invoice scanned Images). The contributions of CluSTi are three-fold. Firstly, it removes heavy noises in the table images using a clustering algorithm. Secondly, it extracts all text boxes using state-of-the-art text recognition. Thirdly, based on the horizontal and vertical clustering algorithm with optimized parameters, CluSTi groups the text boxes into their correct rows and columns, respectively. The method was evaluated on three datasets: i) 397 public scanned images; ii) 193 PDF document images from ICDAR 2013 competition dataset; and iii) 281 PDF document images from ICDAR 2019’s numeric tables. The evaluation results showed that CluSTi achieved an F1-score of 87.5%, 98.5%, and 94.5%, respectively. Our method also outperformed DeepDeSRT with an F1-score of 91.44% on only 34 images from the ICDAR 2013 competition dataset. To the best of our knowledge, CluSTi is the first method to tackle the table structure recognition problem on scanned images.

Details

Title
ClusTi: Clustering Method for Table Structure Recognition in Scanned Images
Author
Zucker, Arthur 1 ; Younes, Belkada 1 ; Vu Hanh 2 ; Nguyen Van Nam 3   VIAFID ORCID Logo 

 Polytech Sorbonne, Sorbonne University, Paris, France (GRID:grid.462844.8) (ISNI:0000 0001 2308 1657) 
 Viettel CyberSpace Center, Hanoi, Vietnam (GRID:grid.462844.8) 
 Thuyloi University, Computer science and Engineering Department, Hanoi, Vietnam (GRID:grid.440808.0) (ISNI:0000 0004 0385 0086) 
Pages
1765-1776
Publication year
2021
Publication date
Aug 2021
Publisher
Springer Nature B.V.
ISSN
1383469X
e-ISSN
15728153
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2586187323
Copyright
© The Author(s), under exclusive licence to Springer Science+Business Media, LLC, part of Springer Nature 2021.