Content area
Full Text
Abstract
This paper proposes an effective set expansion system that can automatically extract named entities (NEs) from the Web to construct NE domain dictionaries. The purpose of this set expansion system is to expand a given partial set of objects into a more complete set. Google Sets is a representative set expansion system that uses the Web. The proposed system uses several seed words as initial information to collect Web pages that probably contain many NEs and to extract NE candidates from the collected Web pages. A mutual-importance measurement technique is developed to estimate the importance scores of the NE candidates, and then, these importance scores are used for ranking these candidates. We can easily extract real NEs from an ordered list of NE candidates. As a result, the proposed method showed 95.60% mean average precision (MAP) in 7 Korean NE domains and 99.98% MAP in 8 English NE domains. In particular, the accuracy of the proposed system in the case of English domains is higher than that of Google Sets.
Key Words: Set Expansion, Named Entity, Named Entity Recognition, Mutual Importance Measurement (MIM)
(ProQuest: ... denotes non-USASCII text omitted.)
1. Introduction
Since the Web contains a considerable amount of information today, it has become an important resource for extracting certain information. In particular, the names of persons, organizations, locations, and products give us crucial information to discriminate Web pages [1,2]; these proper nouns are known as Named Entities (NEs). However, since they are open word classes, new entities are always created and pose a serious unknown-word problem [3,4]. In addition, many researchers have recently explored the area of opinion detection and analysis. Many opinions are being expressed on the Web in various forms such as product reviews, personal blogs, and news group message boards [5]. This trend has raised many interesting and challenging research topics such as subjectivity detection, semantic orientation classification, and review classification [6]. In fact, product names, movie titles, politician names, and tourist resorts are all proper nouns and are objects of opinion mining. Although NE recognition is one of the key systems in opinion mining, it has suffered from the difficulty of constructing an NE dictionary for each domain because there are a number of application domains of opinion...