SmartCrawler: A Three-Stage Ranking Based Web

Abstract

Web crawlers have evolved from performing a meagre task of collecting statistics, security testing, web indexing and numerous other examples. The size and dynamism of the web are making crawling an interesting and challenging task. Researchers have tackled various issues and challenges related to web crawling. One such issue is efficiently discovering hidden web data. Web crawler’s inability to work with form-based data, lack of benchmarks and standards for both performance measures and datasets for evaluation of the web crawlers make it still an immature research domain. The applications like vertical portals and data integration require hidden web crawling. Most of the existing methods are based on returning top k matches that makes exhaustive crawling difficult. The documents which are ranked high will be returned multiple times. The low ranked documents have slim chances of being retrieved. Discovering the hidden web sources and ranking them based on relevance is a core component of hidden web crawlers. The problem of ranking bias, heuristic approach and saturation of ranking algorithm led to low coverage. This research represents an enhanced ranking algorithm based on the triplet formula for prioritizing hidden websites to increase the coverage of the hidden web crawler.

Details

Title

SmartCrawler: A Three-Stage Ranking Based Web Crawler for Harvesting Hidden Web Sources

Author

Kaur, Sawroop; Singh, Aman; Geetha, G; Mehedi Masud; Alzain, Mohammed A

Pages

2933-2948

Section

ARTICLE

Publication year

2021

Publication date

2021

Publisher

Tech Science Press

ISSN

1546-2218

e-ISSN

1546-2226

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.32604/cmc.2021.019030

ProQuest document ID

2568299500

© 2021. This work is licensed under https://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

SmartCrawler: A Three-Stage Ranking Based Web Crawler for Harvesting Hidden Web Sources

Jump to:

Abstract

Details

Suggested sources