Content area
Full text
1. Introduction
"Metadata" as a term is used to denote all of the information that describes the characteristics of objects that are stored in digital repositories. The need for quality metadata has become greater with the development of the internet, which is becoming a powerful resource for the dissemination and exchange of information. Metadata are a fundamental component of digital libraries as well as initiatives such as the Semantic Web (see www.w3.org/2001/sw/) and Open Archives Initiative (see www.openarchives.org/). Institutional repositories such as DSpace (see www.dspace.org/), Fedora (see http://fedora-commons.org/) and EPrints (see www.eprints.org/) use metadata to enable efficient access and retrieval of digital content. An enormous amount of digital resources requiring metadata represent a major challenge for these initiatives, libraries and repositories For example, a project which would aim at forming a digital repository of all publicly available scientific papers in the UK would have to face a collection with the growing pace of 100,000 papers per year that require manual metadata entry ([2] Adams, 2009). Ideally the metadata should be entered by the authors, but authors rarely do that even when they are provided with the appropriate tools ([13] Crystal and Land, 2003). According to [13] Crystal and Land (2003), it would take about 60 employee-years to create metadata for one million documents. Based on the overwhelming cost of entering metadata manually, it is obvious that there is a need for tools that enable their automatic extraction. The Directorate for Cataloguing of the US Library of Congress has recognised this problem ([2] Adams, 2009) and sponsored the Automatic Metadata Generation Applications (AMEGA) project ([23] Greenberg et al. , 2006).
Based on the development of automatic indexing of digital content and the fact that it is less costly than manual indexing ([3] Anderson and Pérez-Carballo, 2001), it can be assumed that, over time, automated metadata extraction is to become more efficient, cheaper and more consistent. Although previous studies show that automatic generation of metadata provides acceptable performance ([38] Liddy et al. , 2002; [25] Han et al. , 2003; [48] Peng and McCallum, 2004; [56] Takasu, 2003), researchers generally conclude that the best results are achieved by integrating automated and manual methods ([54] Schwartz, 2001).
This paper presents a method for the automatic extraction of metadata...