Content area
Full Text
Abstract
Natural language processing (NLP) began as a branch of Artificial Intelligence is a field of computer science and linguistics and is concerned with interaction between human language and computer. Major tasks of NLP such as Machine Translation (MT), Information Retrieval (IR) and Summarization require extensive knowledge of the language for the effective identification of semantic information in the text. Meaning or semantics of a text is mainly decided by the named entities which are the role carrying agents in a text. The system presented here is a Named Entity (NE) Identifier created using Statistical methods based on linguistic grammar principles. Malayalam NER is a difficult task as each word of named entity has no specific feature such as Capitalization feature in English. NERs in other languages are not suitable for Malayalam language since its morphology, syntax and lexical semantics is different from them. For testing this system, documents from well known Malayalam news papers and magazines containing passages from five different fields are selected. Experimental results show that the average precision recall and F-measure values are 85.52%, 86.32% and 85.61% respectively.
Keywords: Malayalam compound word, Finite state Transducer, Extended Conditional Random Field, Feature vector.
1. Introduction
NER is an important tool in almost all natural language processing applications such as IE, IR and Question Answering (QA) systems. Proper identification and classification of NEs are very crucial and pose a big challenge to the NLP researchers. The level of ambiguity in NER makes it difficult to attain human performance [1].
NER is the process of identifying and categorizing names in text. The NE task was first introduced as part of the MUC 6 (MUC 1995) evaluation exercise and was continued in MUC 7(MUC 1998).This formulation of NE task defines 7 types of NE: PERSON, ORGANISATION, LOCATION, DATE, TIME, MONEY and PERCENTAGE. NER also known as entity identification and extraction is a subtask of IE that seeks to locate and classify atomic elements in text into predefined categories. In the expression named entity the word named restricts the task to those entities for which one or many rigid designators as defined by Kripke stands for the referent [2].
The term named entity is not strict but has to be explained in the context where it...