Full Text

1. Introduction

1.1 Motivation and related work

The rapid development of digital library provides researchers with faster and easier access to digital documents mostly on portable document format (PDF). When a researcher wants to build a collection of digital documents (e.g. reference list, personal academic bibliography), he/she usually has to read the digital documents and annotate the metadata (e.g. title, authors, journal, and published year) manually. It is a time consuming work, in particular when digital documents are collected from various sources. Many automatic metadata extraction methods are proposed and usually divided into two types: rule-based methods and machine learning methods.

Rule-based methods follow a set of predefined rules or regular expression to extract metadata. For example, Gao et al. (2011) focussed on metadata extraction from the title page of Chinese books. Their system checked each text block against the predefined rule set and assigned a probability for a metadata type (a total of 30 metadata types for Chinese books). Finally, to select best-fit metadata type for a block, their system utilized optimal matching of bipartite graph to obtain a global optimal assignment. Yang et al. (1999) used image processing approach (without using OCR) to segment color space of the image of book cover. The text regions that contain header bibliography information could then be extracted. Giuffrida et al. (2000) proposed a spatial knowledge based approach to extract metadata from postscript files. Their system exploited the visual/spatial knowledge humans make use of when reading a document. The spatial properties used in their approach include appearance order and font size. Zhang et al. (2004) designed a platform, called PKUSpace, that used heuristic rules and regular expression matching to mine different patterns and extract relevant metadata from PDF documents. Several other researches (Day et al. , 2007; Wei et al. , 2007) proposed using rule-based methods to extract bibliographic data from reference section of a scientific paper. Day et al. (2007) used a hierarchical template-based approach, called INFOMAP, to extract and integrate metadata from paper's references section. Wei et al. (2007) analyzed bibliographic attributes' appearances and punctuations to perform format and semantic tagging on two defined parsing layers: underlayer for format tagging and upperlayer for semantic tagging. Each reference are tagged using seven attributes (author, title,...

Show less

Extracting bibliographical data for PDF documents with HMM and external resources

Full Text

Suggested sources

Extracting bibliographical data for PDF documents with HMM and external resources

Content area

Full Text

Suggested sources