Content area
Full text
Abstract
This paper introduces definitions of Thai elementary discourse units (T-EDUs), grammar rules for T-EDU segmentation, and two versions of chart parsing for T-EDU detection; one with longest matching and the other with maximum matching. Four different environments are investigated; close test with a pre-chunked text, close test with a running text, open test with a pre-chunked text, and open test with a running text. As our test-bed, 1,530 T-EDUs extracted from the NE and POS-tagged Thai-NEST corpus are used to construct a set of context free grammar (CFG) rules and then the rules are applied to segment T-EDUs from a text. With consideration of all generated edges (T-EDUs) without syntactic constraints, longest matching (LM) or maximum matching (MM) are developed to select possible T-EDUs. The result shows that on the pre-chunked text, the precisions can be improved from 28.64% to 92.53% and 92.70%, respectively with minor sacrifice of 7-8% recall, dropping to 93.01% and 92.09%, respectively.
Keywords : Thai Elementary Discourse Unit; Discourse unit segmentation; Chart parser.
1 Introduction
Most of the works on text processing usually define elementary discourse units (EDUs) as minimal building blocks for forming a discourse tree. Such discourse segmentation has been recognized as an important process in several text processing tasks [1], In many applications, paragraphs or sentences are too large to use as processing units since they may contain too many pieces of information and need to be broken down into small units. By segmenting a text into a set of tractable discourse units and discovering their relationships, one could construct an abstraction-based summary, which is usually more comprehensive than an extraction-based summary. In this paper, we present definitions of Thai elementary discourse units (T-EDUs), grammar rules for T-EDU segmentation and an EDU chart parser. With our definition of T-EDUs, our process includes the NEand POS-tagged. and T-EDU segmentation. All possible parse trees based on grammar rules using chart parsing are constructed, and then the implausible ones are filtered out by longest matching (LM) [2] and maximum matching (MM) [3]. A set of context-free grammar rules for T-EDUs are constructed based on the THAI-NEST news corpus which has been developed by Theeramunkong et al. [4]. In this paper, Section 2 describes related works on discourse segmentation. Characteristics...





