Content area

Abstract

Efficient data extraction from the ever-expanding web, including structured and unstructured sources such as newspaper databases, is critical for industries like media, research, and journalism. Traditional web crawlers, which are primarily rule-based or keyword-driven, struggle with adaptability, semantic understanding, and real-time responsiveness when working with diverse data formats and layouts found in newspaper archives. This research proposes WISE (Web-Intelligent Semantic Extractor), an intelligent, deep learning-based framework designed to overcome these challenges. By integrating Natural Language Processing (NLP) and neural networks, WISE can extract contextually relevant information from dynamic newspaper databases, improving both accuracy and efficiency in data retrieval. The system dynamically adjusts crawling strategies based on content semantics, learning patterns from diverse data sources to enhance relevance and reduce noise. WISE outperforms conventional rule-based, keyword-driven, and non-semantic crawlers by 35% in terms of extraction accuracy and 40% in terms of processing efficiency, according to experimental evaluations conducted on benchmark datasets and actual online environments. WISE showed exceptional scalability, contextual accuracy, semantic understanding, and real-time flexibility in a variety of online scenarios using the News Articles Classification Dataset (Kaggle) and real-time newspaper sources. The framework demonstrates superior performance in extracting structured data from heterogeneous sources while maintaining scalability and security. This work presents a novel, intelligent solution designed to meet the challenges of modern web environments.

Details

1009240
Title
AI driven web crawling for semantic extraction of news content from newspapers
Author
S, Saravanan 1 ; A K, Ashfauk Ahamed 2 

 Department of Computer Science and Engineering, B.S.Abdur Rahman Crescent Institute of Science & Technology, 600 048, Chennai, India (ROR: https://ror.org/01fqhas03) (GRID: grid.449273.f) (ISNI: 0000 0004 7593 9565) 
 Department of Computer Applications, B.S.Abdur Rahman Crescent Institute of Science & Technology, 600 048, Chennai, India (ROR: https://ror.org/01fqhas03) (GRID: grid.449273.f) (ISNI: 0000 0004 7593 9565) 
Volume
15
Issue
1
Pages
41673
Number of pages
20
Publication year
2025
Publication date
2025
Section
Article
Publisher
Nature Publishing Group
Place of publication
London
Country of publication
United States
Publication subject
e-ISSN
20452322
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-11-24
Milestone dates
2025-10-22 (Registration); 2025-08-04 (Received); 2025-10-22 (Accepted)
Publication history
 
 
   First posting date
24 Nov 2025
ProQuest document ID
3275323990
Document URL
https://www.proquest.com/scholarly-journals/ai-driven-web-crawling-semantic-extraction-news/docview/3275323990/se-2?accountid=208611
Copyright
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-11-26
Database
ProQuest One Academic