Content area
In this paper, the author proposes a novel heterogeneous Web data extraction algorithm using a modified hidden conditional random fields model. Considering the traditional linear chain based conditional random fields can not effectively solve the problem of complex and heterogeneous Web data extraction, the author modifies the standard hidden conditional random fields in three aspects, which are using the hidden Markov model to calculate the hidden variables and modifying the standard hidden conditional random fields through two stages. In the first stage, each training data sequence is learned using hidden Markov model, and then implicit variables can be visible. In the second stage, parameters can be learned for a given sequence. Finally, experiments are conducted to make performance evaluation on two standard datasets -- "EData dataset" and "Research Papers dataset". Compared with the existing Web data extraction methods, it can be seen that the proposed algorithm can extract useful information from heterogeneous Web data effectively and efficiently.
Abstract-As it is of great importance to extract useful information from heterogeneous Web data, in this paper, we propose a novel heterogeneous Web data extraction algorithm using a modified hidden conditional random fields model. Considering the traditional linear chain based conditional random fields can not effectively solve the problem of complex and heterogeneous Web data extraction, we modify the standard hidden conditional random fields in three aspects, which are 1) Using the hidden Markov model to calculate the hidden variables, 2) Modifying the standard hidden conditional random fields through two stages. In the first stage, each training data sequence is learned using hidden Markov model, and then implicit variables can be visible. In the second stage, parameters can be learned for a given sequence. (3) The objective functions of hidden conditional random fields are revised, and the heterogeneous Web data are extracted by maximizing the posterior probability of the modified hidden conditional random fields. Finally, experiments are conducted to make performance evaluation on two standard datasets-"EData dataset and "Research Papers dataset". Compared with the existing Web data extraction methods, it can be seen that the proposed algorithm can extract useful information from heterogeneous Web data effectively and efficiently.
Index Terms-Hidden Conditional Random Fields; Web Data Extraction; Undirected Graph; Potential Energy Function; Hidden Variable
(ProQuest: ... denotes formulae omitted.)
I. Introduction
With the popularization of the World Wide Web, a great number of data from different domains have become available. Hence, the popularity of the World Wide Web provides opportunities for users to benefit from the useful Web data. In the traditional modes, users search Web data by browsing Web page and searching by keyword, which are intuitive forms of seeking data from the Web. Unfortunately, the above searching methods have several limitations. The browsing behavior is not suitable to locate particular contents of the Web data, the reasons lie in that following links does not make sense and it is easy to get lost. On the other hand, although searching by keyword is much efficient than browsing behavior, it usually returns massive data, which is beyond users' processing ability. Therefore, in spite of being publicly and readily available, it is hard to extract useful information from the Web data [1].
To extract useful information from Web data more effectively, many existing methods resorted to the ideas from the research domain of database area. As is well known that the data in database are belonged to structured data, hence, traditional database techniques can not used in Web data extraction. There it is very important to extract useful information from the unstructured and semi-structured Web data, and then to populate databases for further data process [6-10].
Structured data in Web pages usually include important information. Such web data are often obtained from underlying databases and displayed in Web pages using fixed templates. Particularly, the structured data objects are denoted as data records. Extracting data records enables one to integrate data from multiple Web sites and pages to provide value-added services [2]. As is shown in Fig. 1 (a), an E-commerce website is used as an example, and information of three kinds of iPhone products are illustrated. In Fig.l (b), a Web page segment containing a data table is given, of which each table row is a data record. However, seeking and organizing Web data is not easy, because traditional database techniques can not be utilized directly in Web data extraction [3].
In the research field of heterogeneous Web data extraction, hidden conditional random fields are powerful technologies, which could model the state sequence as being conditionally Markov given the observation sequence. Hidden conditional random fields are designed based on Conditional random fields which can be used in many complex computing tasks, and have been widely exploited in the field of intelligent information processing [4] .
However, there are some limitations in Conditional random fields, for example, Conditional random fields cannot capture intermediate structures using hidden-state variables. Therefore, we use Hidden conditional random fields in heterogeneous Web data extraction. Hidden conditional random fields utilize the intermediate hidden variables to construct the latent structure of the input data [5] .
The main innovations of this paper lie in the following aspects:
(1) The proposed algorithm uses the hidden Markov model to calculate the hidden variables more accurately and modified the standard hidden conditional random fields through two stages. In the first stage, each training data sequence is learned using hidden Markov model, and then implicit variables can be visible. In the second stage, for a given sequence, parameters can be learned.
(2) The objective function of hidden conditional random fields is modified in our algorithm, and the heterogeneous Web data are extracted by maximizing the posterior probability of the modified hidden conditional random fields.
The rest of the paper is organized as the following sections. Section 2 introduces the related works. Section 3 illustrates the proposed scheme for heterogeneous Web data extracting by modified hidden conditional random fields. In section 4, a series of experiments are designed and conducted to make performance evaluation. Finally, we conclude the whole paper in section 5.
II. Related Works
In this section, we will survey related works of this paper in two aspects, which include 1) Web data extraction related works and 2) applications of Hidden conditional random fields.
Su et al. presented a data extraction system based on the browser and the Web services resource framework. This framework is designed based on the interaction of DE system, which regulates the access and the operation state of the Web service specification. The coupling interaction DE system can provide users with packaging status, and the status of the Web service (named Web service resources). The interactive DE system can accept parameters from the application in the extraction process by notification WSRF design pattern as well [11].
Kayed et al. regarded the data extraction problem as the page generated decoding process based on structured data and tree templates. Furthermore, the authors propose an unsupervised, page-level data extraction method. In this method, each individual site is derived from deep architecture and a template is proposed to contain a single or multiple data records in a page. The FiVaTech system applied tree matching, tree alignment, and data mining to implement an effective computing process [12].
Lin et al. proposed a novel method to acquire collocations from the Web. Three classical lexical association measures (co-occurrence frequency, mutual information and t-test) are used to automatically extract collocation. Based on the experimental results, the benchmarks indicate that superior performance of this new Web-based approach in both high precision and recall [13],
Cafarella et al. proposed three typical extraction systems which can be implement on most of the websites. Particularly, the TEXTRUNNER system is designed based on natural language texts, and the WEBTABLES system focuses on HTML-embedded tables. Moreover, in this paper, several unique data applications are discussed which are enabled by aggregating extracted Web data [14].
Zhu et al. illustrated a paradigm of hierarchical model which is used to achieve an integrated Web data extraction. Furthermore, this model model is named dynamic hierarchical Markov random field (DHMRFs). Considering the structural uncertainty, in DHMRFs, the joint distribution, model structure and class label are defined. Particularly, joint distribution refers to exponential family distribution. As a powerful model, DHMRFs relax the indicator model through the independence assumption. Hence the exact inference is intractable, a variational approach is proposed to construct a parameter learning model [15].
Afterwards, the applications of hidden conditional random fields are given as follows.
Indio et al. developed a system named TPpred, which is a novel predictor of organelle-targeting peptides based on Grammatical-Restrained Hidden Conditional Random Fields. The proposed system is trained on a nonredundant dataset of proteins where the presence of a target peptide was experimentally validated [16].
Wang et al. proposed a hidden conditional random field model with independent component analysis mixture feature functions for event classification for videos. Hidden conditional random field model refers to a discriminative model without conditional independence assumption of observations, and can be used in the application of video analysis as well. [17].
Qian et al. utilized hidden conditional random field to model the observations of mid-level semantics of an event clip for event detection. Comparisons are made with the dynamic Bayesian networks, hidden Markov model, enhanced HMM and conditional random fieldbased event detection approaches [18]. Other research works about hidden conditional random fields' application please refer to [19] and [20]
III. The Proposed Scheme
Conditional random fields belong to statistical modeling methods, which are widely utilized in pattern recognition and information retrieval. On the other hand, conditional random fields is a discriminative undirected probabilistic graphical model. Particularly, Conditional random fields can be used to find applications in shallow parsing, named entity recognition and gene finding and so on. As is well known that Conditional random fields is an alternative model to hidden Markov models. The framework of conditional random fields is shown in Fig.2.
For the undirected graph G = (V,E) , V and E represent the nodes and edges of the graph G respectively, z refers to the data which are required to be observed, y1 denotes a random variable, and the value of ? is chosen from a labeled set, Moreover, y = [yl,y2,...,yk]T is satisfied. If y'(i e[l,k]) meets the Markov properties, (y,z) is called conditional random fields. In the theory of conditional random fields, for a given z which is observed, the distribution of label sequence y meets the following equation.
... (1)
where the parameter Z should be normalized, and Nj refers to the neighbor node of G . Next, ^s(yj,z) denotes the probability of the node j which is labeled by yj according to the observing information, and (y', y1 |z) refers to the dependency degree between node j and its neighbors.
However, the traditional linear chain based conditional random fields can not effectively model complex and heterogeneous Web data. Hence, in this paper, we proposed modified hidden conditional random fields and efficient Web data extraction algorithm to implement a heterogeneous Web data extraction process with high accuracy. The framework of the standard hidden conditional random fields is shown in Fig. 3
Hidden conditional random fields make a judgement according to the input sequence y belonged to the category of z. y refers to local observation vector, and y = {yl,y2,...,yt} is satisfied. Particularly, y' denotes a feature vector. For a given sequence y , there may existing a related vector h = {h',h2,-,ht} , h'eH , where H means all the set which includes all possible hidden states. The modeling process for observed data using hidden conditional random fields is represented by the following equation.
... (2)
where the potential energy function (/)() with parameter 0 and co is used to represent the influence degree between hidden state structure, observing data and data category, and parameter co refers to the range of the window. 0Q is calculated by the following equation.
... (3)
In Eq. 3, y refers to a chain structure based graph, of which each node corresponds to a hidden state of time t, and (p(y, j,co) represents the arbitrary feature of the observing window. The key step of the hidden conditional random fields' modeling process is the model training process and parameter learning. The parameter 0 is equal to [0e,0z,0h] . 0e is used to measuring the compatibility degree between the continuous state j , k, and the variable z. 0z refers to the compatibility degree between state j and the variable z and 0h is the parameter of the state hj ( hj g H ). The training process of the above model utilizes the following objective function.
... (4)
where parameter n is the total number of sequences in the training set. Parameters of hidden conditional random fields can be estimated by the following formula.
... (5)
... (6)
where L represents the label of the observing data, and it can be deduced by Eq.7 as follows.
... (7)
However, the difficulties of extracting the heterogeneous Web data directly using hidden conditional random field lie in that the proposed model depends on the initial choice of parameters severely. Therefore, in order to improve the accuracy of Web data extraction, we propose a modified hidden conditional random fields model. The main idea is to use the hidden Markov model to calculate the hidden variables more accurately.
The main modifications of the hidden conditional random fields are implemented through two stages. In the first stage, each training data sequence is learned using hidden Markov model, and then implicit variables can be visible. A particular hidden Markov class can be solved by the following equation.
... (8)
After the implicit variables becoming visible, in the second stage, for a given sequence y, the parameter 0' can be learned from the above process. Next, the category of which the test sequence y is belonged to can be obtained by the following equation.
... (9)
Afterwards, the objective function of hidden conditional random fields is modified as Eq.10.
... (10)
Based on the above analysis, the heterogeneous Web data (denoted as z ) can be extracted from the websites by maximizing the posterior probability of the modified hidden conditional random fields.
... (11)
IV. Experiments
A Dataset and Performance Evaluation Metric
In this paper, two standard datasets are used, which are 1) EData dataset [6] and 2) Research Papers dataset [23].
We use the EData dataset [6] to testify the performance of the Web data extraction. The EData dataset is created to test the precision of Web data extraction on online advertising. As is illustrated in paper [6], EData is made up of 18,000 online advertisements, which were extracted from Craigslist, Ebay, and KSL. The advertisements in EData are uniformly distributed among the eight chosen domains, which are Cars-for-Sale, Computer Science Jobs, Food Coupons, Furniture, Houses-for-Sale, Jewelry, Motorcycles-for-Sale, and Musical Instruments, and there are 750 advertisements in each of the eight advertisements domains extracted from each of the three advertisements websites. The advertisements domains in EData vary in terms of their 1) diversity, which include advertisements in jobs, transportation, food, housing, and entertainment that are essential to our daily lives, 2) advertisements sizes, from arbitrary long advertisements to relatively short ones, and 3) word distribution, different word usage in closely related advertisements which are similar in contents and nature. Moreover, advertisements in EData were extracted from various online data with different data structures, which can be utilize to test the performance of Web data extraction approach.
Research Papers dataset [23] is made up of the headers of research papers. The header of a research paper is defined to be all of the words from the beginning of the paper, usually the introduction, or to the end of the first page, whichever occurs first. We randomly select 500 for training and the other ones for testing. This sub-dataset is denoted as RP.
To evaluate the performance of Web data extraction, we use Recall, Precision and FI. For Web data extraction dataset i (denoted as di ), the number of information we extract in dl is Nf(di) and the number of ground truth information in di is Ng (di ). The precision and recall of di is defined as follows.
... (12)
... (13)
To obtain the weighted harmonic mean of precision and recall, the FI metric is used as well.
... (14)
Recall measures the accuracy of returning ground-truth information which are extracted from Web data, while Precision assesses the ability of excluding false positives. FI metric can calculate the fitness of ground-truth and detect information extracted from Web data by jointly considering recall and precision.
B. Experimental Results and Analyses
Firstly, we test the performance of the proposed algorithm on EData dataset, and eight kinds of online advertisements are utilized, including: "Cars", "Food", "Furniture", "Houses", "Jewelry", "CS Jobs", "Motorcycles", and "Musics". Particularly, the average of all the above kinds of online advertisements is given as well. As is shown in Fig.5, we can see that the average precision, recall and FI values are all larger than 0.8. Therefore, the proposed algorithm can effectively extract useful online advertisements information from EData dataset.
Afterwards, to make performance evaluation more objective, some existing information extraction methods are compared with the proposed method, which are: 1) ADEx [6], 2) WDE-SS [21], 3) DHMRFs [22], 4) CRFs [23], 5) 2D-CRFs [24]. Then, the precision, recall and FI are utilized to make performance evaluation, and the experimental results are shown in Fig.6, from which we find that the proposed algorithm performs better than other methods.
In the following parts, we will compare the training time and testing time for the above methods under EData dataset utilizing the same PC. All the experiments are conducted on the PC with Intel Corel i7 processer, the main frequency of this CPU we used is 2.9GHz. The capacity of the internal memory we chosen is the 4GB, and the hard disk we utilized is 1.0TB. Based on the above hardware settings, the algorithm running time are shown in Fig 7.
As is shown in Table. 1, it can be seen that under the dataset RP, the proposed algorithm performs significantly better than other methods.
Based on the above experimental results, in can be seen that the proposed heterogeneous Web data extraction algorithm is effective both in the system performance and in the system efficiency. The reasons lie in the following aspects:
(1) The proposed algorithm utilizes the hidden Markov model to calculate the hidden variables more accurately and modified the standard hidden conditional random fields through two stages. Particularly, the objective function of hidden conditional random fields is modified, and the heterogeneous Web data are extracted by maximizing the posterior probability of the modified hidden conditional random fields. Therefore, the proposed algorithm can effectively extract useful information from the heterogeneous Web data with high accuracy.
(2) ADEx is currently designed to extract data from advertisements that include a single product/service in an ad. Particularly, ADEx should be enhanced to solve any online advertisements that include multiple products, such as in video games advertisements. Furthermore, even though ADEx currently handles advertisements from various domains, it is less accurate in distinguishing ads in closely related domains, such as advertisements on different means of transportation, or advertisements that advertise professional jobs that cross multiple disciplines, such as Bioinformatics. Therefore, ADEx should be extended to discern advertisements with closely related domains.
(3) The approach of WDE-SS only considers the structure of Web documents. How to combine semantic information in documents to further improve the proposed framework is to be deeply studied. As WDE-SS requires to parse documents into trees in memory, it is not efficient on very large data sets.
(4) CRFs denotes investigates the issues of regularization, feature spaces, and efficient use of unsupported features in CRFs, with an application to information extraction from research papers. Conditional random fields (CRFs) are undirected graphical models trained to maximize a conditional probability. However, fundamental advances in regularization for CRFs should be studied in future research works.
(5) 2D-CRFs refers to a two-dimensional Conditional Random Field model, which provides a new way of incorporating two-dimensional neighborhood depend encies to improve the performance of Web information extraction. By marginalizing variables progressively along the diagonals, efficient parameter learning and labeling can be performed. When the proposed model is applied to product information extraction, significant improvements are achieved compared with linear-chain CRF models.
V. Conclusions
In this paper, we present a heterogeneous Web data extraction algorithm based on a modified hidden conditional random fields model. The standard hidden conditional random fields are modified by three aspects. Firstly, we utilize the hidden Markov model to calculate the hidden variables. Secondly we modify the standard hidden conditional random fields through two stages. In the first stage, each training data sequence is learned using hidden Markov model, and then implicit variables can be visible. In the second stage, parameters can be learned for a given sequence. Thirdly, the objective function of hidden conditional random fields is revised, and the heterogeneous Web data can be obtained by maximizing the posterior probability of the modified hidden conditional random fields.
References
[1] Laender AHF, Ribeiro-Neto BA, da Silva AS, "A brief survey of Web data extraction tools", SIGMOD Record, 2002, 31(2) pp. 84-93.
[2] Zhai Yanhong, Liu Bing, "Structured data extraction from the web based on partial tree alignment", IEEE Transactions on Knowledge and Data Engineering, 2006, 18(12) pp. 1614-1628.
[3] Lage JP, da Silva AS, Golgher PB, "Automatic generation of agents for collecting hidden Web pages for data extraction", Data & Knowledge Engineering, 2004, 49(2) pp. 177-196.
[4] Chen Shi-Wen, Wu Jiang-Xing, Ye Xiao-Long, Guo Tong, "Distributed denial of service attacks detection method based on conditional random fields", Journal of Networks, 2013, 8(4) pp. 858-865.
[5] Quattoni Ariadna, Wang Sybor, Morency Louis-Philippe, "Hidden conditional random fields", IEEE Transactions on Pattern Analysis and Machine Intelligence, 2007, 29(10) pp. 1848-1853.
[6] Pera Maria S., Qumsiyeh Rani, Ng Yiu-Kai, "Web-based closed-domain data extraction on online advertisements", Information Systems, 2013, 38(2) pp. 183-197.
[7] Furche Tim, Gottlob Georg, Grasso Giovanni, "OXPATH: A language for scalable data extraction, automation, and crawling on the deep web", VLDB Journal, 2013, 22(1) pp. 47-72.
[8] Hong Jer Lang, "Data Extraction for Deep Web Using WordNet", IEEE Transactions on Systems Man and Cybernetics Part C-applications and Reviews, 2011, 41(6) pp.854-868.
[9] Chen Kerui, Zuo Wanli, He Fengling, "Data Extraction and Annotation Based on Domain-specific Ontology Evolution for Deep Web", Computer Science and Information Systems, 2011, 8(3) pp. 673-692.
[10] Liu Wei, Meng Xiaofeng, Meng Weiyi, "ViDE: A Vision-Based Approach for Deep Web Data Extraction", IEEE Transactions on Knowledge and Data Engineering, 2010, 22(3) pp. 447-460.
[11] Su Jui-Yuan, Chen Lung-Pin, Wu I-Chen, "A Loosely Coupled Interactive Web Data Extraction System", Journal of Internet Technology, 2010, 11(2) pp. 237-249.
[12] Kayed Mohammed, Chang Chia-Hui, "FiVaTech: Page-Level Web Data Extraction from Template Pages", IEEE Transactions on Knowledge and Data Engineering, 2010, 22(2) pp. 249-263.
[13] Lin Jianfang, Li Sheng, Cai Yuhan, "Collocation Extraction Using Web Feedback Data", Chinese Journal Of Electronics, 2009,18(2) pp. 312-316.
[14] Cafarella Michael J., Madhavan Jayant, Halevy Alon, "Web-Scale Extraction of Structured Data", Sgmod Record, 2008, 37(4) pp. 55-61.
[15] Zhu Jun, Nie Zaiqing, Zhang Bo, "Dynamic Hierarchical Markov Random Fields for integrated web data extraction", Journal of Machine Learning Research, 2008, 9 pp. 1583-1614.
[16] Indio Valentina, Martelli Pier Luigi, Savojardo Castrense, "The prediction of organelle-targeting peptides in eukaryotic proteins with Grammatical-Restrained Hidden Conditional Random Fields", Bioinformatics, 2013, 29(8) pp. 981-988.
[17] Wang Xiaofeng, Zhang Xiao-Ping, "An ICA Mixture Hidden Conditional Random Field Model for Video Event Classification", IEEE Transactions on Circuits and Systems for Video Technology, 2013, 23(1) pp. 46-59.
[18] QianX., Hou X., Tang, Y. Y., "Hidden conditional random field-based soccer video events detection", IET Image Processing, 2012, 6(9) pp. 1338-1347.
[19] Bousmalis Konstantinos, Zafeiriou Stefanos, Morency Louis-Philippe, "Infinite Hidden Conditional Random Fields for Human Behavior Analysis", IEEE Transactions on Neural Networks and Learning Systems, 2013, 24(1) pp. 170-177.
[20] Zhang Jianguo, Gong Shaogang, "Action categorization with modified hidden conditional random field", Pattern Recognition, 2010, 43(1) pp. 197-203.
[21] Li Zhao, Ng Wee Keong,Sun Aixin, "Web data extraction based on structural similarity, Knowledge and Information Systems, 2005, 8(4) pp. 438-461.
[22] Zhu Jun, Nie Zaiqing, Zhang Bo, "Dynamic Hierarchical Markov Random Fields for integrated web data extraction", Journal of Machine Learning Research, 2008, 9 pp. 1583-1614.
[23] Peng Fuchun, McCallum Andrew, "Information extraction from research papers using conditional random fields", Information Processing & Management, 2006, 42(4) pp. 963-979.
[24] Zhu Jun, Nie Zaiqing, Wen Ji-Rong, Zhang Bo, Ma Wei-Ying, "2D conditional random fields for web information extraction", Proceedings of the 22nd international conference on Machine learning, 2005, pp. 1044-1051.
Cheng Cui
Department of Information Engineering Jilin Police College, Changchun, China
Email: [email protected]
Copyright Academy Publisher Apr 2014