Content area
Full Text
[Received September 2015 and accepted November 2015]
Abstract. This paper provides a survey on the available methods for induction of morphology grammars using statistical methods. The selected word and character representation forms play a crutial role in the efficiency of the generated grammar rule system. The performed analysis covers a wide range of methods where the main focus is set on the decision tree based rule trees which provides an efficient approach for large problems too. Regarding the word representation formats, the paper shows the main benefits of the attribute-based approach.
Keywords: morphology analysis, grammar induction, word representation formats
1. Morphology
The morphology is a linguistic discipline analyzing the internal components of words in expression of syntactical roles. In the case of syntetic languages, the morphemes are the primary tools to determine the meaning of the words. The most widely used implementation forms of morpheme units are the affixes. In concatenative morphology, the different components (morphemes) can be composed into a sequence and there is a clear boundary between the different components. On the other hand, some languages use also non-concatenative morphology rule too. The past tense of irregular verbs in English language is based on this kind of transformation. In the Ngiti [4] language, the plural of a noun is generated by replacing the last two syllables with high tone syllables: singular: kama, plural: km.
Current morphology analyzer, like the Humor analyser for the Hungarian language [1] uses a lexicon on the possible surface forms. The implemented program performs a search on the input word form for possible analyses. It looks up morphs in the lexicon the surface form of which matches the beginning of the input word (and later the beginning of the yet unanalyzed part of it). The lexicon may contain not only single morphs but also morph sequences. These are ready-made analyses for irregular forms of stems or suffix sequences, which can thus be identified by the analyzer in a single step, which makes its operation more efficient [7, 8].
A transformed word has usually a stem part which relates to the base lexeme word. The inflectional part denotes the modified parts of the word. Considering inflectional part, it can be subcategorized based on the position of the modification: prefix,...