Full text

Turn on search term navigation

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Efficient data extraction from the ever-expanding web, including structured and unstructured sources such as newspaper databases, is critical for industries like media, research, and journalism. Traditional web crawlers, which are primarily rule-based or keyword-driven, struggle with adaptability, semantic understanding, and real-time responsiveness when working with diverse data formats and layouts found in newspaper archives. This research proposes WISE (Web-Intelligent Semantic Extractor), an intelligent, deep learning-based framework designed to overcome these challenges. By integrating Natural Language Processing (NLP) and neural networks, WISE can extract contextually relevant information from dynamic newspaper databases, improving both accuracy and efficiency in data retrieval. The system dynamically adjusts crawling strategies based on content semantics, learning patterns from diverse data sources to enhance relevance and reduce noise. WISE outperforms conventional rule-based, keyword-driven, and non-semantic crawlers by 35% in terms of extraction accuracy and 40% in terms of processing efficiency, according to experimental evaluations conducted on benchmark datasets and actual online environments. WISE showed exceptional scalability, contextual accuracy, semantic understanding, and real-time flexibility in a variety of online scenarios using the News Articles Classification Dataset (Kaggle) and real-time newspaper sources. The framework demonstrates superior performance in extracting structured data from heterogeneous sources while maintaining scalability and security. This work presents a novel, intelligent solution designed to meet the challenges of modern web environments.

Details

Title
AI driven web crawling for semantic extraction of news content from newspapers
Author
S, Saravanan 1 ; A K, Ashfauk Ahamed 2 

 Department of Computer Science and Engineering, B.S.Abdur Rahman Crescent Institute of Science & Technology, 600 048, Chennai, India (ROR: https://ror.org/01fqhas03) (GRID: grid.449273.f) (ISNI: 0000 0004 7593 9565) 
 Department of Computer Applications, B.S.Abdur Rahman Crescent Institute of Science & Technology, 600 048, Chennai, India (ROR: https://ror.org/01fqhas03) (GRID: grid.449273.f) (ISNI: 0000 0004 7593 9565) 
Pages
41673
Section
Article
Publication year
2025
Publication date
2025
Publisher
Nature Publishing Group
e-ISSN
20452322
Source type
Scholarly Journal
Language of publication
English
ProQuest document ID
3275323990
Copyright
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.