Content area

Abstract

In this paper, the author proposes a novel heterogeneous Web data extraction algorithm using a modified hidden conditional random fields model. Considering the traditional linear chain based conditional random fields can not effectively solve the problem of complex and heterogeneous Web data extraction, the author modifies the standard hidden conditional random fields in three aspects, which are using the hidden Markov model to calculate the hidden variables and modifying the standard hidden conditional random fields through two stages. In the first stage, each training data sequence is learned using hidden Markov model, and then implicit variables can be visible. In the second stage, parameters can be learned for a given sequence. Finally, experiments are conducted to make performance evaluation on two standard datasets -- "EData dataset" and "Research Papers dataset". Compared with the existing Web data extraction methods, it can be seen that the proposed algorithm can extract useful information from heterogeneous Web data effectively and efficiently.

Full text

Turn on search term navigation

Copyright Academy Publisher Apr 2014