Content area

Abstract

Deep Web crawling refers to the problem of traversing the collection of pages in a deep Web site, which are dynamically generated in response to a particular query that is submitted using a search form. To achieve this, crawlers need to be endowed with some features that go beyond merely following links, such as the ability to automatically discover search forms that are entry points to the deep Web, fill in such forms, and follow certain paths to reach the deep Web pages with relevant information. Current surveys that analyse the state of the art in deep Web crawling do not provide a framework that allows comparing the most up-to-date proposals regarding all the different aspects involved in the deep Web crawling process. In this article, we propose a framework that analyses the main features of existing deep Web crawling-related techniques, including the most recent proposals, and provides an overall picture regarding deep Web crawling, including novel features that to the present day had not been analysed by previous surveys. Our main conclusion is that crawler evaluation is an immature research area due to the lack of a standard set of performance measures, or a benchmark or publicly available dataset to evaluate the crawlers. In addition, we conclude that the future work in this area should be focused on devising crawlers to deal with ever-evolving Web technologies and improving the crawling efficiency and scalability, in order to create effective crawlers that can operate in real-world contexts.

Details

Title
Deep Web crawling: a survey
Author
Hernández, Inma 1   VIAFID ORCID Logo  ; Rivero, Carlos R 2 ; Ruiz, David 1 

 Department of Languages and Computer Systems, University of Seville, Seville, Spain 
 Department of Computer Science, Rochester Institute of Technology, Rochester, NY, USA 
Pages
1577-1610
Publication year
2019
Publication date
Jul 2019
Publisher
Springer Nature B.V.
ISSN
1386145X
e-ISSN
15731413
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
2050178797
Copyright
World Wide Web is a copyright of Springer, (2018). All Rights Reserved.