Content area
Full text
Abstract-As it is of great importance to extract useful information from heterogeneous Web data, in this paper, we propose a novel heterogeneous Web data extraction algorithm using a modified hidden conditional random fields model. Considering the traditional linear chain based conditional random fields can not effectively solve the problem of complex and heterogeneous Web data extraction, we modify the standard hidden conditional random fields in three aspects, which are 1) Using the hidden Markov model to calculate the hidden variables, 2) Modifying the standard hidden conditional random fields through two stages. In the first stage, each training data sequence is learned using hidden Markov model, and then implicit variables can be visible. In the second stage, parameters can be learned for a given sequence. (3) The objective functions of hidden conditional random fields are revised, and the heterogeneous Web data are extracted by maximizing the posterior probability of the modified hidden conditional random fields. Finally, experiments are conducted to make performance evaluation on two standard datasets-"EData dataset and "Research Papers dataset". Compared with the existing Web data extraction methods, it can be seen that the proposed algorithm can extract useful information from heterogeneous Web data effectively and efficiently.
Index Terms-Hidden Conditional Random Fields; Web Data Extraction; Undirected Graph; Potential Energy Function; Hidden Variable
(ProQuest: ... denotes formulae omitted.)
I. Introduction
With the popularization of the World Wide Web, a great number of data from different domains have become available. Hence, the popularity of the World Wide Web provides opportunities for users to benefit from the useful Web data. In the traditional modes, users search Web data by browsing Web page and searching by keyword, which are intuitive forms of seeking data from the Web. Unfortunately, the above searching methods have several limitations. The browsing behavior is not suitable to locate particular contents of the Web data, the reasons lie in that following links does not make sense and it is easy to get lost. On the other hand, although searching by keyword is much efficient than browsing behavior, it usually returns massive data, which is beyond users' processing ability. Therefore, in spite of being publicly and readily available, it is hard to extract useful information from the Web data [1].
To extract useful information...





