Content area
Full text
Introduction
The portable document format (PDF) ([2] Adobe Systems Incorporated, 2001) is an electronic file format developed by the Adobe Company. Its platform-independent feature makes it the ideal file format to release and disseminate electronic documents and digital information across the internet. However, despite the universal popularity of this file format, the research into the extraction of semantic information from PDF files lags far behind.
The Extensible Markup Language (XML) ([7] W3C, 2000) is the data exchange standard recommended by W3C. The introduction of it advanced the network expression of language aggregation. Since it is a cross-platform technology merely depending on contents in the internet environment, XML has become the tool for processing distributed information. Owing to its content-oriented feature, XML can make up the shortcomings of PDF in the aspect of semantic description; therefore, converting PDF files to XML files has become an urgent task.
This paper introduces the methodology and describes a system to achieve the above conversion, which first, converts the PDF file to an intermediate XML file following the basic technological principle of information extraction, and second, extracts information from the intermediate file according to the Extensible Stylesheet Language Transformations (XSLT) extraction rule to complete the final conversion.
General framework of the file conversion system
The general frame of the conversion system from PDF files to XML files is shown in Figure 1 [Figure omitted. See Article Image.]. The system consisted of three major modules: the intermediate file generation module, the rule generation module and the automatic extraction module. The input of the intermediate file generation module was the PDF file (including the PDF source file and the sample PDF file). The output was the intermediate XML file describing the content and structure of the PDF file. The rule generation module analyzed and processed the sample PDF file, which resulted in the automatically generated extraction rules of the PDF file by the system. The automatic abstraction module received the intermediate XML file and the relevant rule and finally outputted the XML file with semantic information.
Semantic model of the system
The display model of the PDF file cannot reflect the semantic information in the contents of the file. For carrying out the semantic processing from the PDF file to the XML...





