Deep Web crawling: a survey

Abstract

Deep Web crawling refers to the problem of traversing the collection of pages in a deep Web site, which are dynamically generated in response to a particular query that is submitted using a search form. To achieve this, crawlers need to be endowed with some features that go beyond merely following links, such as the ability to automatically discover search forms that are entry points to the deep Web, fill in such forms, and follow certain paths to reach the deep Web pages with relevant information. Current surveys that analyse the state of the art in deep Web crawling do not provide a framework that allows comparing the most up-to-date proposals regarding all the different aspects involved in the deep Web crawling process. In this article, we propose a framework that analyses the main features of existing deep Web crawling-related techniques, including the most recent proposals, and provides an overall picture regarding deep Web crawling, including novel features that to the present day had not been analysed by previous surveys. Our main conclusion is that crawler evaluation is an immature research area due to the lack of a standard set of performance measures, or a benchmark or publicly available dataset to evaluate the crawlers. In addition, we conclude that the future work in this area should be focused on devising crawlers to deal with ever-evolving Web technologies and improving the crawling efficiency and scalability, in order to create effective crawlers that can operate in real-world contexts.

Details

Title

Deep Web crawling: a survey

Author

Hernández, Inma¹

; Rivero, Carlos R²; Ruiz, David¹

¹ Department of Languages and Computer Systems, University of Seville, Seville, Spain
² Department of Computer Science, Rochester Institute of Technology, Rochester, NY, USA

Pages

1577-1610

Publication year

2019

Publication date

Jul 2019

Publisher

Springer Nature B.V.

ISSN

1386145X

e-ISSN

15731413

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1007/s11280-018-0602-1

ProQuest document ID

2050178797

Deep Web crawling: a survey

Content area

Abstract

Details

Suggested sources