Content area
Full text
Abstract: Named Entity Recognition (NER) is the task of identifying and classifying all proper nouns in a document as person names, organization names, location names, date & time expressions and miscellaneous. There has been a growing interest in this field since early 1990s. Earlier, work has been done on NER taking English language as the medium. Apart from that some researchers have also tried their hands on Hindi and regional languages such as Telugu. The objective of our project is to identify names of Indian origin in English documents. The idea is to cover cross-linguistic aspects of text while performing NER on Indian names. The proposed project mainly distinguishes persons, organizations, locations and contact numbers in a document. The approach adopted is mainly unsupervised learning based on the feature space. Gazetteers are also used to improve the results of the experiment. The application is developed in C#.NET using the IDE of Visual Studio 2008.
Keywords: NER, Indian names, Text mining, Hindi names in English documents.
1. INTRODUCTION
(ProQuest: ... denotes formula omitted.)
The objective of NER is to classify all tokens in a text document into predefined classes such as person, organization, location, miscellaneous. In evaluations at the Message Understanding Conferences of the 1990s, it became clear that in order to reasonably extract information from documents, it is useful to first identify certain classes of information referred to in the text. They therefore established the Named Entity Task, where systems attempted to identify dates, times, numerical information and names [1]. At the time, MUC was focusing on IE tasks wherein structured information on company and defense-related activities are extracted from unstructured text, such as newspaper articles. In defining IE tasks, people noticed that it is essential to recognize information units such as names including person, organization, and location names, and numeric expressions including time, date, money, and percentages. Identifying references to these entities in text was acknowledged as one of IE's important sub-tasks and was called "Named Entity Recognition (NER)." Before the NER field was recognized in 1996, significant research was conducted by extracting proper names from texts. A paper published in 1991 by Lisa F. Rau [2] is often cited as the root of the field. Named Entity Recognition has remained an...




