AI driven web crawling for semantic extraction of

Full text

Turn on search term navigation

Introduction

Background

Traditional web crawlers face difficulty in adapting to the diverse layouts and formats found in online newspaper databases, which often include news articles, images, tables, and archived content¹. These sources are more complex than typical e-commerce or research sites, as they present a blend of structured and unstructured data in formats such as HTML, PDF, and text². Therefore, existing crawlers, which are primarily rule-based, struggle to effectively retrieve and interpret this multifaceted data³. These systems function well in circumstances that are more predictable or structured, and they often have trouble with the complexity and variety of modern web content, especially when it comes to dynamic pages, unstructured data, and semantic ambiguity⁴. Because of these limits, crawling does not work well; it collects unnecessary data, and the system cannot respond to changes in content⁵.

Newspaper databases will often present a range of challenges that standard crawlers do not seem to be able to accommodate, including unstructured content, complicated layouts, and a myriad of deviations in structure⁶. Additionally, newspaper archives require crawlers to deal with irrelevant content, such as advertisements or navigation menus⁷. A relevant and current consideration, newspapers assert at this time, is that a news item present when the crawler is deployed may be altered or gone at a later time⁸. For instance, a newspaper website may have the same URL may have new articles published, or archived content removed, or archived content altered, adding further burden on the crawler⁹. The proposed framework technology assures structured data retrieval that is more relevant, timely, and accurate in a wide range of online situations that are always changing¹⁰.

The WISE framework, an intelligent web crawling system that uses natural language processing and deep learning to navigate around the limitations of conventional crawlers, is presented in the manuscript. Semantic understanding and adaptive learning are used in WISE to extract structured and contextually relevant information from various and constantly changing online newspaper sources.

Problem statement

Standard web crawlers have their advantages in ideal environments, and are ineffective in extracting data from newspaper databases. They rely on a static, keyword-based approach, which does not recognize the necessary semantic meaning of the content it is crawling. The failure to evaluate the meaning of the content in real time naturally leads to inefficient extraction processes that may not be accurate at all. WISE makes use of techniques from deep learning and NLP to provide an intelligent, aware-best context solution that solves both problems of crawling and extracting data in a useful manner from newspapers databases.

Contributions of this paper

The three key contributions of this paper are:

The article presents the proposed WISE framework, a smart web crawler developed specifically to extract data from newspaper databases, using deep learning and Natural Language Processing to recognize contextual relevance within the dynamic and unstructured news content.
The WISE framework improves precision in data extraction, accelerates data processing time, and adapts in real time to changes in newspaper content to ensure that the best relevant content is always retrieved.
WISE provided a smarter, scalable solution than traditional crawlers by extracting semantic data from heterogeneous data sources, solving the problems with newspapers’ archives complicated by formatting and content types.

The remaining section of this paper is organized as follows: Sect. “Background” reviews past studies on the methods, particularly when dealing with dynamic or unstructured web content. Section “Problem statement” describes the proposed WISE process. Section “Contributions of this paper” discusses and compares our suggested approach with other conventional approaches. Section “Related works” of the paper finishes with a discussion of possible future studies.

Related works

This literature review looks at the latest methods in web crawling, semantic extraction, deep learning, and natural language processing to uncover limits and gaps in research. By looking at these studies, which provide important background information, the suggested WISE framework can be improved even more. It supports the premise that frameworks need smart, flexible systems to get more accurate, faster, and real-time performance from web data that is relevant to the context.

Traditional web crawling techniques

This paper goes into a lot of detail about web scraping and crawling strategies. It sorts methods into groups based on their pros and cons, and then it compares rule-based, DOM-tree, and machine learning methods (R-DOM-ML)¹¹. The study says that traditional crawlers are not flexible or deep enough in their semantics to do a good job of extracting data. Instead, there is a need for intelligent, context-aware algorithms.

The authors look at how well web-crawled corpora cover, are accurate, and are valuable compared to traditional linguistic corpora. Their strategy includes both large-scale content analysis and the study of linguistic patterns¹². The results suggest that web-crawled corpora are new and can grow, and they do not necessarily have clear semantics or consistent structure; therefore, better crawling methods are needed.

Challenges in semantic web data extraction

The authors in this paper examined the connections between information extraction and the semantic web, and they gave a structure for putting methods like semantic annotation and ontology-based extraction into groups¹³. The results suggest that using semantic web technologies and machine learning (SWT-ML) together makes data more relevant. However, current extraction frameworks still have trouble with real-time adaptability and integration complexity.

The semantic web technologies, what they are like right now, and what they might be able to do in the future are discussed in this paper¹⁴. It looks at RDF, SPARQL, and OWL frameworks and explains how they make it easier for data to interact with each other. The authors think that intelligent automation is needed to handle dynamic and unstructured web information well, and that semantic integration can make context-aware extraction better.

Deep learning in web mining

A deep neural network (DNN) trained on web-mined firm-level data is used in this study by Kinne and Lenz¹⁵ to forecast the innovation potential of businesses in a variety of industries. To carry out classification and trend prediction tasks, the model incorporates characteristics that are taken from business websites, including textual descriptors, product details, and publishing frequency. The usefulness of semantic representation learning for business-related web data is demonstrated by the authors’ findings that adding deep learning to web mining increased prediction accuracy by about 20% when compared to conventional regression-based models.

In an identical manner, Patnaik et al.¹⁶ created a hybrid CNN–LSTM system that performs intelligent web data extraction by dynamically learning both spatial and sequential patterns from unstructured online information. In comparison to traditional DOM-tree-based crawlers, their tests on news and e-commerce datasets showed that integrating convolutional and recurrent layers improves context identification and lowers extraction error rates by more than 15%.

Role of NLP in data extraction

An introduction to the basics of NLP, which covers both syntactic and semantic analysis, is given in this research¹⁷. It shows how important it is for intelligent data processing to understand language and gives examples of how artificial intelligence (AI) systems use natural language processing. According to the study, adding NLP to web crawling algorithms (NLP-WCA) can make them better at understanding context and extracting meaning.

The feature extraction and text preparation methods, like tokenization, stemming, and factorization, which are considered part of NLP, are included in this article¹⁸. The success of these methods is based on how successfully they perform the next tasks. The results support the hypothesis that web crawlers could benefit from having natural language processing skills, since being ready is very important for making deep learning models that extract data better.

Real-time and adaptive crawling systems

The suggested context-aware recommender system in this paper is based on web crawling and a deep recurrent neural network that has been tuned. The model changes its suggestions based on what it learns about what users like from web content¹⁹. The results show that contextual awareness and deep learning are important for web data extraction systems because they improve personalization and prediction accuracy.

The survey defines at modern crawling tactics such focused crawling, AI-driven models, and semantic indexing. The approach taxonomy is based on three things: adaptability, scalability, and the capacity to work in real time²⁰. This study supports the assumption that intelligent, learning-based crawling frameworks are the best option because they are more efficient and accurate than standard crawlers.

Recent developments in semantic extraction and neural crawling further highlight the quick advancements that inspire our methodology²¹. In order to achieve comparable model outcomes while crawling a significantly smaller number of URLs, author suggests crawling tactics optimized for LLM pretraining by ranking pages based on their “influence” for downstream performance. By directly integrating semantic quality estimators into the crawling frontier, Neural Prioritization for Web Crawling (2025)²² increases the relevance and harvest rate of early-stage content. In terms of extraction, it outperforms baseline scrapers by about 20% in corpus quality by using neural web scraping to find clean, main text content. To enable more reliable triggering and argumentation recognition in complex situations, Event Extraction redefines event extraction inside an MRC framework²³. Last but not least, domain-specific solutions like WIEM-DL (2025) enable the discovery of semantic knowledge from diverse, multi-source web data by combining deep learning, ontology, and knowledge graphs.

The literature that was looked at focuses on the shift from crawlers that follow rules to extraction systems that use AI and understand meaning. Big problems include not being flexible, not being able to see the bigger picture, and not being able to work well in environments that are often changing. These results support the idea of WISE, a system that employs deep learning and natural language processing to fill in these gaps and give us a smart, scalable, and semantically correct approach to get data from the modern web and overall summary is given in below Table 1.

Table 1. Related works summary.

Reference	Environment	Methods	Power scheduling	Security	Energy efficiency
11	Web Crawling & Scraping	Rule-based, DOM-based, and ML techniques	✗	✗	Low
12	Web-Crawled vs. Traditional Corpora	Linguistic pattern analysis, Statistical comparison	✗	✗	Low
13	Semantic Web	Ontology-based IE, RDF/SPARQL integration	✗	✓	Medium
14	Semantic Web Technologies	Semantic annotation, Web Ontology (OWL)	✗	✓	Medium
15	Web Mining	Deep Neural Networks for Business Data Prediction	✗	✗	Medium
16	Intelligent Web Extraction	CNN + LSTM Deep Learning	✗	✗	High
17	NLP Systems	Syntactic and Semantic NLP, Text Processing	✗	✗	Medium
18	NLP Preprocessing	Tokenization, Stemming, Lemmatization, Feature Extraction	✗	✗	Low
19	Context-Aware Crawling	Optimized Deep Recurrent Neural Network (RNN)	✓	✓	High
20	Modern Crawlers Survey	AI-driven Crawlers, Semantic Indexing, Focused Crawling	✓	✓	High

Proposed methodology

The WISE framework automatically extracts structured data from newspaper databases in a time-efficient manner using deep learning and NLP. It can process dynamically generated layouts, discover relevant news content including headlines and dates, filter out irrelevant noise, and identify and structure useful information into a format. This approach allows for the successful, accurate extraction of news data based on unstructured sources in a real-time context while overcoming real-time extraction limitations posed by conventional rule-based news crawlers.

Proposed system overview

The proposed system, called WISE, is an intelligent, adaptive web crawler that exploits deep learning and NLP specifically to crawl newspaper databases. The WISE system consists of a smart crawl engine which has a URL scheduler, content fetcher and DOM analyzer that allows it to gather content from static and dynamic newspapers²⁴. WISE employs deep learning models integrated with NLP pipelines to enhance contextual understanding, semantic extraction, and adaptive content classification during web crawling and information retrieval.

Fig. 1 [Images not available. See PDF.]

Proposed system overview.

Traditional crawlers operate on static rules or keyword matches to crawl web content. By contrast, WISE semantically analyzes the newspaper content and its context in Fig. 1. The deep learning and NLP layer of the system acts to preprocess and interpret content when it is fetched, such as interpreting the headline, the body of the article, authorship data, and publication date, allowing for subsequent classification and scoring of relevance. Unlike traditional crawlers, WISE is intended to work with complex structures in newspapers. The WISE crawler navigates the newspapers directly from web pages, making it less rigid in how it works with different kinds of content (breaking news, editorials, archive content, multimedia pieces, etc.). Casual distracting content logged is filtered out well (ads, sidebars, navigational clutter, etc.) The structured data extraction layer is then used to real-time structure and filter content depending on the layout of each publisher. The processed structured data has the probability of being corrected and is scalable to the needs of the digital journalism of tomorrow.

The algorithm 1 begins with raw as input. The algorithm first executes the subroutine, which performs various cleansings of the text, e.g., tokenization, stop-word removal, lemmatization, etc. The algorithm sets a variable . This variable can be used to determine the level of relevance of terms. Then the algorithm loops over a set of predefined OntologyTerms and calculates the similarity level for each with respect to the cleaned newspaper text. If a term has a better similarity level than whatever was previously recorded, the term is added to the relevant terms and the best similarity level is replaced with what has been recorded. There may be instances where a number of terms, such as have the same best level of similarity, and both will be added to the set for this newspaper crawler. The output of the algorithm is the relevant ontology and its best similarity level, which ensures that all extracted data is semantically relevant.

Step 1: web content acquisition through intelligent crawling

The main goal of the first stage of the WISE architecture is to use an intelligent crawler engine to get web content. This engine can handle both static and dynamic/unstructured material. A URL Scheduler decides which links are most important by looking at things like domain authority, keyword relevance, and crawl feedback from the past²⁵.

Fig. 2 [Images not available. See PDF.]

Web content acquisition through intelligent crawling.

The first stage of the WISE architecture allows for the intelligent crawler engine to extract content from only newspaper-style databases of content in Fig. 2. The engine is able to crawl both static and dynamically generated news pages. The URL Scheduler ranks links according to relevant, fresh, and weak domain authority. The Fetcher gets various formats (HTML, JSON, AJAX-based). The DOM Analyzer traverses the complex layout of news, allowing WISE to find the parts of a page that contain articles while avoiding ads and sidebars. WISE’s crawler does not simply crawl and parse content on the page like most crawlers; rather, the process considers the page structure and semantic information to enhance the crawling and parsing process²⁶.

This Eq. 1 is to assign a priority number to the URLs where the framework knows the order the framework wants to crawl them . Table 2 shows the variables definition.

Table 2

Variable’s definition.

Symbol/Variable	Definition
	Priority score assigned to URL i for crawling.
	Historical relevance of a URL based on prior successful extractions.
	Keyword or semantic relevance score measuring contextual similarity.
	Domain authority score normalized between 0 and 1.
	Semantic relevance computed using cosine similarity between vectors.
( )	Target and candidate content vectors used for semantic comparison.
(	Dynamic crawl rate adjusted according to request and update frequencies.
( , )	Request rate and content update rate of the target web source.
( )	Preprocessed text obtained after cleaning, tokenization, and lemmatization.
(	Embedding vector representing the contextual meaning of word w.
(	Semantic similarity between two text segments or documents.
( )	Hidden state and input sequence at time t in recurrent neural models.
(	Total relevance score assigned to a news article or web document.
( )	Weighting factors for category and entity recognition scores.
( )	Threshold value applied for semantic filtering of irrelevant data.
( )	Extracted named entities such as names, locations, or dates from text t.
( )	HTML Document Object Model structure of a news webpage.
( )	Structured data elements extracted from the DOM hierarchy.
( )	Output format used for storing structured data (JSON, XML, or CSV).
( )	Semantic understanding or contextual fit score between extracted and expected categories.
( )	Real-time adaptability ratio indicating responsiveness to changing web content.
( )	Extraction accuracy achieved by WISE and baseline crawlers, respectively.
( )	Processing efficiency expressed as a percentage improvement over baselines.
( )	Proportion of unstructured data effectively processed by the system.
( )	Intelligent link prioritization score based on semantic quality.
( )	Scalability metric measuring performance under increasing data volume.
( )	Percentage of noise reduction achieved through semantic filtering.

The priority number is generated based on the history of crawled content, keyword relevancy, and domain authority . The purpose of this formula is to steer this system toward crawling the most relevant pages from news article by ordering links to those with the most value . In addition, Eq. 1 decreases duplicate crawls and increases the relevancy of the data extracted.

As a crawler, the system ranks links based on their semantic relevance in Eq. 2.

The semantic relevance is determined by how well a link fits into a hierarchical structure of predetermined content categories . The crawling strategy is dynamically adjusted when Eq. 2 is executed by placing priority on links that are more semantically related to the content being desired , which ensures the best news article collection method . As a result, the concurrent effort of the web crawler is modified to ensure it is more likely to retrieve meaningful information while minimizing background noise.

Eq. 3 demonstrates that crawl rates are continuously calibrated for each URL, basically, the content refresh rate .

This Eq. 3 determines the crawl rate by comparing both the request rate and the content refreshed rate . The higher the page change rate of a website, the more frequently it is crawled; static pages from news data are typically less frequently crawled . In this way, the web crawler can react in real-time , which helps to improve the overall speed and enables the robot to avoid collecting stale data.

Step 2: deep Learning-Based text preprocessing and context Understanding

Once raw newspaper content has been extracted, WISE preprocesses and semantically analyzes content using a deep learning framework. The Text Preprocessing Layer will clean up the input by doing tokenization, stop-word removal, lemmatization, and noise filtering. Once the data has been cleaned, it will go into a deep neural network to represent the data as contextual embeddings in models like BERT or Word2Vec, which are helpful to capture meaning beyond the single keywords.

Fig. 3 [Images not available. See PDF.]

Deep learning-based text preprocessing and context understanding.

Next, through sequence modelling with RNNs or CNNs, WISE can recognize deeper patterns in the high-level structure of the content at a headline, article body, and paragraph level in Fig. 3. It helps WISE disambiguate content that is useful from other contextual fillers in the event that the content is about a journalism program at a university. Once again, WISE uses artificial intelligence to do something that normal crawlers do not and to only forward the contextually aware or contextualized, meaningful content, what the process now call structured data extraction.

Eq. 4 represents the key processes that are carried out to prepare web content for deeper analysis .

It should be noted that in addition to lemmatization and tokenization in Eq. 4, as well as removal of unnecessary stop words, the step of text cleaning is to enhance and maintain the quality of data that the deep learning model will use for understanding context by filtering out the relevant and meaningful data that is forwarded to later steps .

To produce contextual embeddings for each word, the preprocessed text is passed into a BERT-based model using Eq. 5.

The end result of this Eq. 5 is a high-dimensional vector for each word in the phrase that represents its worth in the context of its co-occurring neighbours . Using this approach, the system has more accurate news information to extract for an item, and can more accurately score relevance since it has a better understanding of the semantic context of the web content in Eq. 5.

Eq. 6 is taking the embeddings of many online pages and determining how similar they are semantically .

It is measuring the cosine similarity of two bodies of text to obtain a measure of similarity by . The system is able to reduce or categorize data that is not relevant to the information needed by determining what is most relevant in this Eq. 6. In this way crawling will be more efficient and accurate news data, and produce better quality.

An RNN model is used to capture word dependencies over time since it processes sequences of text by going one word at a time as shown in Eq. 7.

When analyzing structured or dynamic content , it is important to take this step such that patterns can be recognized and information can be remembered from previous segments of text in Eq. 7. The system will be able to understand content structures that change, in which information extraction can be more precise by modelling the relationships between the data points over time²⁷.

The algorithm 2 receives raw text t as input and runs three preprocessing steps (tokenization, stop-word removal, and lemmatization) to prepare the raw newspaper text for deep learning models. After preprocessing, the algorithm initializes one variable, to infinity to help keep track of the best similarity score tallied by the algorithm. The algorithm then loops over a set of predetermined terms in the and calculates the similarity level between each term in and the text If the calculated similarity is better than the max similarity level that has been found, the term is added to the relevant terms , and the best similarity level is updated. If there are similar terms that have the same best similarity, the algorithm adds all of them to the relevant terms set from the newspaper data. The output of the algorithm is the relevant terms and meaning that the system was able to extract most semantically relevant information from the input text.

Step 3: Semantic-Based relevance scoring and data extraction

During this process, WISE is now operating on semantically-enriched newspaper text and generating information from it. During content processing, the Relevance Scoring Module scores the article on the established goals. For example, in scoring against goals such as headline accuracy, publication date, and publication-level of journalistic content, if segments contained low-value content like ads or repeating text, those segments would be ignored²⁸.

Fig. 4 [Images not available. See PDF.]

Semantic-based relevance scoring and data extraction.

The relevance scoring process uses tags and DOM mapping to extract structured content, as bylines, quotations, and named entities in Fig. 4. Each article is broken down into the standard key backgrounds, key points, and statistics for preferential extraction. Whereas rigidly defined label image classification systems depend on all values being selected by the coder to deliver organized output across certain qualities, WISE uses learned patterns to provide flexibility, high accuracy, and positive content extraction. WISE is a great tool for converting unstructured news content into structured knowledge artifacts for use in various downstream applications.

The relevance scoring for text data can be analysed in Eq. 8.

At this point, the framework uses a relevance scoring system to rank the data based on its relevance to a given category . Relevance scores are assigned to the content via weighing both category matches and entity recognition in the relevance calculation in Eq. 8. This allows the crawler to identify the most helpful and semantically relevant news information the user needs, and extracts relevant data from the web .

The semantic data filtering Eq. 9 applies a relevance threshold based on a relevance score that has already been determined²⁹.

Data extraction focuses mainly on those records that meet, at minimum, this relevance score by Eq. 9. Therefore, the web crawler will collect less noise and extra information while still collecting the most meaningful and useful news article . This is important to ensure data quality and relevance.

NER extracts in Eq. 10, specific pieces of information from texts, including and not limited to names, dates, and locations

The model, leveraging deep learning techniques, allows the model to discriminate and categorize content items . Improve the accuracy and precision of structured data extraction from unstructured web content by this Eq. 10 “to discover and extract structured information, e.g., product names or event dates . It is important to make sure the data is specific and relevant.

Step 4: output structuring and repository management

The last stage in the WISE system is to put the data in a structured, easy-to-find format. The Data Structuring Module makes sure that formats are consistent by turning them into industry-standard outputs like JSON, CSV, or XML. It marks fields that have been extracted, such as title, date, and content type. The act of cleansing the data gets rid of any mistakes or duplicate entries that are still there.

Equation 11 then uses the DOM’s structural tags to identify and extract the relevant data blocks on the page ³⁰.

The algorithm can identify structured parts of the content, such as structured lists , tables, or product specifics , and will extract the content associated with those tags in Eq. 11. The benefit of being semantically accurate in the contents being retrieved is that it will be formatted correctly , which allows for simple analysis , as well as for future use cases, such as reporting or analytics .

Fig. 5 [Images not available. See PDF.]

Output structuring and repository management.

After the extraction phase of processing is complete, WISE structures the newspaper data into various formats, including JSON, CSV, or XML, and uploads the output to a centralized repository for easy indexing, searching, and integration into third-party delivery platforms such as business intelligence (BI) platforms and custom dashboards in Fig. 5. An API-based export interface enables direct real-time access for immediate analysis or to facilitate operational automation workflows based on a newspaper conversion process. WISE takes simple newspaper content that has not been filtered, cleansed, structured, and provides semantic orientation to ensure usable ‘data’ for applications such as media monitoring, trend analysis, and content aggregation.

The data will be formatted into a structured format such as JSON, XML, or CSV once Eq. 12 is used to extract the content .

Now that the data exists in a consistent standardized format, the next step is to ensure the data is available for storage, analysis, or combination with other systems³¹. The solution ensures that the data can be effectively used in multiple applications such as databases or business intelligence tools by structuring the data.

Flow diagram

Fig. 6 [Images not available. See PDF.]

Flow diagram.

The WISE framework starts with its intelligent crawling and acquisition engine, which is comprised of a URL scheduler and content fetcher to extract content from a mixture of static and dynamic newspapers in Fig. 6. The crawling software Web content is then parsed by a DOM analyzer, which navigates the page structure and extracts relevant elements, such as article bodies, headlines, and metadata to pass on to a deep learning based Natural Language Processing layer for text cleaning and contextual analysis of the news articles. A relevance scoring module ranks and filters the relevant content based on semantic importance. The end result is structured data and the centralized storage of the content in formats such as CSV or JSON. In total, it is an end-to-end process that provides efficient and context-rich extractions of high-value data from a newspaper database.

Convolutional neural networks (CNNs) and recurrent neural networks (RNNs) are used in the WISE framework due to their complementing abilities to handle structural and sequential dependencies in text-based online data. The crawler can maintain long-term dependencies across sentences and paragraphs thanks to RNNs’ ability to effectively model temporal and contextual interactions between words. It is essential for comprehending changing storylines in news stories. CNNs, on the other hand, record localized patterns like phrase boundaries, keyword clusters, and syntactic cues that frequently show topical focus or article relevance. In contrast with rule-based or purely transformer-driven approaches, which might necessitate larger datasets and more computational resources, WISE’s hybrid sequence modeling approach enables it to achieve a balance between semantic coherence (via RNNs) and spatial pattern recognition (via CNNs), creating better contextual accuracy and adaptability. Therefore, the RNN–CNN architecture provides a foundation for real-time newspaper content extraction that is both computationally efficient and semantically strong.

The WISE framework integrates the technology of deep learning and NLP to enhance web crawling to be semantically aware, to extract based on relevance, and to adapt to changing circumstances. The WISE system is modular, which enables accurate extraction of structured, context-aware data in real-time, and provides timeliness and relevance that is often not found in traditional crawlers when exploring modern, dynamic environments such as newspaper databases.

Implementation details

Dataset Description:

The News Articles Classification Dataset from Kaggle contains a collection of labeled news articles across multiple domains such as Business, Technology, Sports, Education, and Entertainment. It includes article content, headlines, and publication dates, making it suitable for natural language processing tasks like text classification, semantic extraction, and content analysis, enabling efficient data extraction from diverse news sources³².

The WISE framework has been trained and evaluated using the Kaggle News Articles Classification Dataset. 222,937 labeled news articles from various internet sources comprise the dataset, which has been divided into five primary categories: business, technology, sports, education, and entertainment. Because each record contains the headline, body content, publication date, and category name, it can be used for tasks involving contextual extraction as well as semantic classification. To ensure class balance across all categories, the dataset has been divided at random into 70% training, 20% validation, and 10% testing subsets for model development. A preprocessing workflow that included HTML element removal, lowercasing, tokenization, stop-word elimination, and lemmatization was applied to raw articles before to training. It follows by vectorization using contextual embeddings produced by BERT-base. To preserve model efficiency, articles with more than 512 tokens were truncated.

To ensure effectiveness, modularity, and scalability, the WISE framework’s implementation combined Python-based deep learning and natural language processing ecosystems. TensorFlow 2.15 and PyTorch 2.2, which provide GPU acceleration and flexibility for neural sequence modeling, were used to create the deep learning models. The system used spaCy 3.7 for preprocessing tasks like tokenization, stop-word removal, and lemmatization, and Hugging Face Transformers (v4.41) for BERT-based contextual embedding generation for natural language processing. Pandas and NumPy enabled effective data manipulation and preprocessing, while BeautifulSoup4 and Scrapy were utilized for structured HTML parsing and extensive web crawls. The structured output was saved in MongoDB and Elasticsearch to allow for scalable indexing and search, and the real-time crawling engine has been developed using Flask RESTful APIs with Celery for task scheduling. An NVIDIA RTX A6000 GPU workstation with 128 GB of RAM has been utilized for all trials, while Ubuntu 22.04 is used as the deployment environment.

Tech stack

Deep Learning Frameworks – Used to implement neural networks for analyzing and learning from web data.
Natural Language Processing (NLP) Libraries – Enables contextual understanding and semantic analysis of web content.
Web Crawling Engine – Core component responsible for fetching and traversing web pages.
Content Prioritization Module – Applies learned patterns to determine the relevance of web links dynamically.
Real-Time Processing Infrastructure – Supports adaptive crawling strategies with low latency and high responsiveness.
Structured Data Output System – Formats and stores extracted data efficiently from heterogeneous sources.

Baseline models for comparison

Traditional Rule-Based Web Crawlers – Static systems relying on predefined rules for link traversal and content extraction.
Keyword-Driven Crawlers – Crawlers that operate based on keyword matching rather than contextual understanding.
Non-Adaptive Crawlers – Models lacking real-time responsiveness and dynamic strategy adjustment.
Non-Semantic Extraction Systems – Crawlers that do not utilize NLP or semantic context for data interpretation.
Conventional Crawlers with High Redundancy – Systems that retrieve excess or irrelevant information due to a lack of relevance filtering.
Low Scalability Crawlers – Baseline systems are unable to efficiently manage large-scale, heterogeneous web environments.

Experimental analysis & parameters

The framework WISE enhances web crawling, using deep learning and natural language processing of newspaper databases, and has demonstrated its extensive improvements to conventional web crawlers when assessed against eight key performance metrics based on the selected newspaper database: semantic understanding, accuracy, efficiency, scalability, link prioritization, adaptability, and noise mitigation. Collectively, the results clearly demonstrate WISE’s superiority over conventional crawlers in handling diverse layouts, evolving news formats, and semantically rich content.

The experiments utilize both benchmark datasets and live web extractions, ensuring that WISE’s performance is validated under controlled conditions and real-world scenarios, thereby strengthening the framework’s external validity and practical applicability.

Analysis of semantic Understanding

Fig. 7 [Images not available. See PDF.]

Semantic understanding analysis.

WISE uses natural language processing and neural networks for newspaper database crawling, helping to comprehend the context of web content meaning. With a semantic understanding, WISE is able to extract newspaper data that is contextually relevant and aware, and is analysed in Fig. 7 using Eq. 13 with an output of 96.48%. Thus, WISE offers better, more relevant extractions than conventional approaches, and was 35% more effective in this contextual relevance than conventional approaches that often fail to address implicit or dynamic newspaper information.

Eq. 13 evaluates the degree of semantic fit between the collected content and the expected target .

The related categories, as defined by the user with a lack of semantic understanding are the biggest problem in this Eq. 13. In contrast, the WISE framework uses a deep learning-based model which relies on contextual embeddings from a transformer-based model (BERT-like) to establish semantic importance . This equation is a way to put the semantic context in newspaper information retrieval context that the web crawlers will be facing , which can help information extraction to have a more semantic approach, as traditional crawlers do not have any form of semantic knowledge.

Analysis of real-time adaptability

Table 3. Real-Time adaptability Comparison.

Number of samples	R-DOM-ML	SWT-ML	DL-NN-WB	WISE
10	30.72	35.62	45.85	46.09
20	48.33	41.27	38.37	54.18
30	69.63	54.13	64.75	57.51
40	44.18	41.84	71.59	80.57
50	62.77	56.93	76.87	48.77
60	54.77	79.35	40.36	72.38
70	76.81	39.70	54.00	69.64
80	47.02	60.01	77.94	71.22
90	56.66	67.22	66.29	87.34
100	80.92	80.02	74.71	97.36

For websites that continuously update newspaper content dynamically, traditional-type crawlers are of use since it is not possible for crawlers to update information in real time, with an output of 97.36%. In response to semantic input and learned behaviours by this newspaper databases, WISE modifies its crawling technique on the fly and compares using Table 3. Therefore, it will respond 40% faster to changes to the content, and more accurately be able to collect relevant data using Eq. 14 from real-time online data sources, as well as strongly modify crawls in real time.

The comparison of real-time adaptability with traditional web crawlers for the newspaper database is given in Eq. 14.

Adaptability is a serious problem with traditional crawlers ; thus, these programs operate on a strict schedule crawling regardless of content changes . The adaptive crawls in WISE framework’s Eq. 14 evaluate the rate of content change for more dynamic newspaper content adaptation . Due to variables such as page freshness or user engagement, allowing for the crawl schedule to recalibrate automatically , this provides real-time adaptability . The equation is helpful in keeping crawling data current, which is a feature that is not possible in traditional models, since it’s can improve accuracy and reduce irrelevant data.

Analysis of the accuracy of data extraction

Fig. 8 [Images not available. See PDF.]

Accuracy of data extraction analysis.

WISE uses both deep learning and semantic understanding to increase data extraction fidelity using Eq. 15. Other programs lack any reasoning and, as a result, they often acquire newspaper information that is redundant or irrelevant to the user. WISE has decreased errors by 35% in benchmark tests and obtains 93.4% thus demonstrating its improved ability in Fig. 8 to identify and extract structured information from static and dynamic web pages from these databases.

Eq.15 accounts for the difference in the accuracy level of newspaper data extraction between the WISE framework and standard frameworks.

Since traditional crawlers rely on static, newspaper keyword-based approaches , they often yield lower accuracy in data extraction and calculate the improvement of the accuracy level of the WISE framework . The WISE framework employs deep learning models and contextual semantic relevance to ensure that contextually relevant data from the newspaper is extracted. The results in Eq. 15 show that WISE is a more effective tool for accurate data extraction than traditional data extraction systems showing an approximate accuracy improvement of 35%.

Analysis of processing efficiency

Table 4. Processing efficiency Comparison.

Number of samples	R-DOM-ML	SWT-ML	DL-NN-WB	WISE
10	37.5	32.2	28.4	32.1
20	56.8	46.3	38.2	41.6
30	42.7	46.1	48.2	51.2
40	67.9	38.9	54.6	67.2
50	74.4	54.6	65.8	48.8
60	25.6	67.2	72.5	61.8
70	55.2	72.8	75.0	72.4
80	75.3	77.2	81.3	81.6
90	68.2	72.8	73.4	84.2
100	71.1	80.7	86.1	94.9

WISE processes newspaper information much faster than traditional crawlers through intelligent link prioritization and reduced redundancy. In rule-based systems, crawling irrelevant pages incurs processing cost. WISE’s learning-based engine streamlines processing speed and resource consumption in Table 4. The framework demonstrates 40% greater processing efficiency with an output of 94.9% decreasing crawl time and increasing data value, which improves the use of newspaper articles using Eq. 16.

The crawlers of old scraped ineffectively useless data at a modest rate of data processing in Eq. 16.

To process fast, WISE utilizes advanced deep learning techniques to automatically filter and rank content . This Eq. 16 considers the processing time of conventional systems to compare against the processing time of the WISE framework for overall processing efficiency . With WISE, newspaper data extraction is fast, resources are saved, and computational effort is maximized as the WISE model is processing 40% faster than other conventional data processing systems.

Analysis of unstructured data handling

Fig. 9 [Images not available. See PDF.]

Unstructured data handling.

It is challenging to use rule-based crawlers to extract actionable intelligence from dynamic layouts, blogs, and forums. WISE was designed with natural language processing and deep learning capabilities to quickly analyze and extract newspaper information from these types of unstructured sources, with an output of 91.9%. The results in Fig. 9 above show that WISE outperformed the baseline crawlers that could not cope with non-standard or chaotic formats and maintained its high accuracy across all unstructured datasets by Eq. 17.

The evaluation metrics used in Figs. 7 and 9 are clearly described to enhance interpretability. The model’s ability to extract information while capturing contextual meaning is evaluated by Semantic Understanding Analysis (Fig. 7). A Semantic Fit Score (Sₛ), which is calculated as the cosine similarity between the reference and predicted content class embeddings, is used to quantify it. Stronger contextual understanding and precise semantic connection are indicated by higher scores. Unstructured Data Handling Analysis (Fig. 9) assesses how well the system handles untagged, noisy, or irregular online data. The Unstructured Data Handling Rate (Uₕ), which is the proportion of successfully parsed and categorized unstructured data to all extracted content, is used to quantify it. These measurements provide a more thorough comprehension of WISE’s resilience in handling diverse, real-world online data.

Traditional crawlers and data foraging approaches simply ingest unprocessed news articles data without handling it are identified using Eq. 17.

To refine and represent unstructured data , the WISE framework relies on a rich text preprocessing pipeline that performs tasks such as tokenizing, semantic analysis, and named entity recognition (NER) . The comparative efficiency of articles of both systems for working with unstructured content is based on this Eq. 17. In this regard, WISE is the superior performer , as it has increased capability to borrow structured data from noisy, chaotic web pages.

Analysis of intelligent link prioritization

Table 5. Intelligent link prioritization Comparison.

Number of samples	R-DOM-ML	SWT-ML	DL-NN-WB	WISE
10	11.5	13.2	25.9	31.5
20	21.3	38.8	39.2	46.1
30	37.2	49.1	40.7	53.7
40	42.4	23.4	52.6	69.7
50	58.4	49.1	60.3	53.3
60	52.1	51.2	71.0	77.3
70	68.2	63.3	63.0	76.4
80	78.3	86.2	80.8	85.6
90	80.12	72.8	83.4	86.9
100	79.6	81.2	86.6	93.9

Unlike standard crawlers that employ basic link-following algorithms, WISE uses content-aware link prioritization to better give preference to pages with high semantic value from newspaper data. In turn, this minimizes crawl depth and maximizes the relevance of newspaper data extraction in Table 5 with an output of 93.9%. Experimental results indicate that WISE increases extraction density from relevant links using Eq. 18, reduces the amount of time wasted loading irrelevant pages, and increases the speed with which users can converge on critical content.

The intelligent link prioritization process of the proposed system is compared using Eq. 18.

When prioritizing links in breadth-first and depth-first crawling, traditional crawlers often use a simple (and potentially deterministic) rule-based algorithm which may apply to irrelevant links or duplicate links already unrelated newspaper data collected in Eq. 18. However, WISE applies semantic analysis and classification of web content to assess the importance of each URL via a smart link prioritization algorithm. Via this equation, WISE is able to dynamically select links based on their contextual importance . This allows the crawler to retrieve the most important information meaning that data extraction of newspaper quality is orders of magnitude higher than traditional systems.

Analysis of scalability

Fig. 10 [Images not available. See PDF.]

Scalability analysis.

The architecture of WISE allows for large-scale deployment in many web environments. The fact that WISE learns and generalizes across domains means that it has much greater advantages over traditional crawlers as the size of the input data increases more and more in Fig. 10. Benchmarking results show that WISE performance and accuracy are consistent between sizes, where the system’s WISE will continue to effectively extract high-quality newspaper data as output of 96.9% from complicated, complex, and large-volume online sources using Eq. 19.

The scalability proportions measure the performance in Eq. 19 of the crawler while working with significant amounts of news data .

Traditional crawlers are limited in scalability because they are rigid and bolt-on . Whereas WISE can scale very well across multiple data sources due to its usage of deep learning and dynamic content of newspaper adjustment. This Eq. 19 compares the crawler’s ability to process a scaling number of web pages with diverse data . In general, WISE can scale better while processing than traditional systems at a level of efficiency and accuracy.

Analysis of noise reduction in data

Fig. 11 [Images not available. See PDF.]

Noise reduction analysis.

Standard algorithms usually exclude unwanted content from the newspaper such as advertisements or headers. This type of noise is eliminated by WISE’s deep learning models, trained on their semantic relevance, which are illustrated in Fig. 11 above. In the end, this provides more useful, higher-quality, extracted data as an output of 95.9%. After filtering out noise, WISE enables faster and smoother processing and analysis by Eq. 20.

Traditional crawlers often struggle with noise present in newspapers due to a lack of some capability to prioritize links or understand the semantic context and reduced using Eq. 20.

Using this Eq. 20, the framework can quantify how much redundant data each system is collecting . It considers how to limit noise by calculating how much “noisy” data or redundant data is being harvested . Once again, using deep learning models , the WISE framework is able to lower the amount of noise immensely , providing cleaner and more accurate news data collection by the filtered ability of traditional crawlers was little to none.

WISE is superior to legacy crawlers in terms of speed (WISE improves processing speed by 40%), noise (reducing noise by 45%), and accuracy (improving data extraction accuracy by 35%). WISE does this by choosing high-value links intelligently, processing newspaper unstructured content in various formats, and using real-time dynamic adaptation. WISE offers a streamlined, efficient, and scalable solution for extracting structured as well as contextual information from multifaceted news. Integrating with deep learning and NLP ensures a superior semantic understanding and reliable performance with respect to numerous newspaper databases and changing digital content formats.

Instead of using general methodological categories, WISE was benchmarked against particular state-of-the-art baseline systems. The Intelligent Hidden Web Crawler (IHWC), put forth by Kaur et al. (2023), combines rule-based DOM parsing with machine learning classifiers, and is equivalent to the R-DOM-ML group. The Ontology-Driven Semantic Extractor (ODSE), created by Martinez-Rodriguez et al. (2020), combines supervised semantic labeling with ontology mapping to represent the SWT-ML category. WISE has been compared to the CNN–LSTM Adaptive Web Extractor by Patnaik et al. (2021) and the Context-Aware RNN Web Crawler by Boppana and Sandhya (2021) for deep learning comparisons. As a non-semantic baseline, a conventional keyword-based crawler in the vein of of Shamrat et al. (2020) was also used. These particular benchmark models guarantee an impartial and transparent evaluation of WISE’s progress in semantic knowledge, contextual accuracy, and adaptability.

The performance measurements obtained from several trial runs are subjected to significance testing in order to confirm the statistical reliability of the stated gains. Using randomly mixed train-test splits, each experiment was conducted five times under the same conditions. WISE’s accuracy, efficiency, and semantic fit measures have been compared to each baseline system using the paired t-test. The findings show that WISE’s gains are consistent and not the result of chance, with improvements statistically significant at the 95% confidence level (p < 0.05) across all parameters. Furthermore, the standard deviation (σ) values for multiple runs was less than 2.5%, indicating the stability and repeatability of the experiment. These results support the validity of WISE’s performance gains above neuronal, rule-based, and ontology-driven baselines.

WISE improves accuracy and adaptability substantially, but there are still certain drawbacks. The current architecture is limited to English-language text, and the use of pre-trained language models raises the computing cost during large-scale crawling. Additional evaluation can be necessary for domain-specific adaptability and long-term stability in dynamic online contexts. Future research will be focused on increasing multilingual support, enhancing adaptation across various domains, and maximizing model efficiency.

Conclusion and future works

The WISE is an intelligent web crawler that leverages deep learning and NLP to optimize the structural challenges of typical crawlers. Through semantic comprehension, WISE can dynamically change its crawling techniques, therefore making the data it collects and extracts more accurate and relevant compared to rule-based processes. The WISE algorithm has a clear advantage over competitors, with 40% more processing efficiency and 35% more reliable data extraction speed compared to traditional processes. WISE provides a great mechanism for automating the extraction of newspaper data; moreover, it solves the issues of potential changes in unstructured, dynamic, and heterogeneous data, as well as providing intelligence processing capability in a context-aware manner, which makes it dramatically different from the limited systems available for traditional media and journalistic applications as a crawler. Processing speed and reliability are gained through its intelligent connection prioritizing, real-time adaptability, and noise management. Whether these are in the e-commerce, research, or cybersecurity field, the framework can be used for structured data from multiple unstructured sources quickly and efficiently.

Future works

In the future, enhancements to WISE will implement reinforcement learning for the autonomous decision-making of crawling. WISE will be able to learn the best strategies based on both feedback from the user and patterns in how the user interacts with WISE. WISE will be improved in the future with respect to its capabilities of being able to extract more complex information from newspaper databases, multimodal extraction (like images, audio, etc.), and to use deep learning models that can perform news sentiment analysis, article categorization, and headline summarization. Finally, real-world industrial applications with large datasets can be used to refine WISE for scaling, performance, and adaptability across different web ecosystems, which would allow for further validation, benchmarking, and testing.

Future work will focus on integrating reinforcement learning to enable autonomous decision-making during crawling and expanding the framework to support multilingual and multimodal news content. Additionally, we plan to evaluate WISE on large-scale, real-world deployments to further enhance scalability, adaptability, and domain generalization. These directions will ensure continuous improvement of the framework’s intelligence and practical applicability.

Author contributions

The authors confirm their contributions to the paper as follows:

Conceptualization, Methodology: SS; Formal analysis and investigation: SS & AKA; Writing - original draft preparation: SS; Writing - review and editing: SS, & AKA; Supervision: AKAAll authors reviewed the results and approved the final version of the manuscript. All authors reviewed the results and approved the final version of the manuscript.

Funding

The authors received no specific funding for this study.

Data availability

The data used in this research are available in the following links: https://www.kaggle.com/datasets/banuprakashv/news-articles-classification-dataset-for-nlp-and-ml?utm_source=perplexity.

Declarations

Competing interests

The authors declare no competing interests.

Conflict of interest

The authors declare that there are no conflicts of interest regarding the publication of this paper. The authors have no financial or personal relationships that could influence the research outcomes, or the interpretation of the data presented in this manuscript.

Publisher’s note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

References

1. Khandoker, S; Kabir, MH. Digital news story extractor (DNSE): A system to extract data from online news archives. SAGE Open.; 2023; 13, 3 pp. 1-15.

2. Peng, H; Li, Q. Research on the automatic extraction method of web data objects based on deep learning. Intelligent Autom. & Soft Computin.; 2020; 26, 3 pp. 609-616. [DOI: https://dx.doi.org/10.32604/iasc.2020.013939]

3. Shrivastava, GK; Pateriya, RK; Kaushik, P. An efficient focused crawler using LSTM-CNN based deep learning. Int. J. Syst. Assur. Eng. Manage.; 2023; 14, 1 pp. 391-407. [DOI: https://dx.doi.org/10.1007/s13198-022-01808-w]

4. Anwar, M; Shoaib, I. Nano-optimized deep learning for early cancer detection using nanosensor data. J. Nano Mol. Intell. Virtual Health Syst.; 2025; 1, 1 pp. 27-40. [DOI: https://dx.doi.org/10.70023/nmivhs.251103]

5. Kaur, S; Singh, A; Geetha, G; Cheng, X. IHWC: intelligent hidden web crawler for harvesting data in urban domains. Complex. Intell. Syst.; 2023; 9, 4 pp. 3635-3653. [DOI: https://dx.doi.org/10.1007/s40747-021-00471-1]

6. Bifulco, I; Cirillo, S; Esposito, C; Guadagni, R; Polese, G. An intelligent system for focused crawling from big data sources. Expert Syst. Appl.; 2021; 184, 115560. [DOI: https://dx.doi.org/10.1016/j.eswa.2021.115560]

7. Jarrah, A; Abu Asfar, G. Swarm robotics optimization using deep Q-learning for cooperative search and rescue missions. PatternIQ Min.; 2025; 2, 1 pp. 47-58. [DOI: https://dx.doi.org/10.70023/sahd/250205]

8. Koloveas, P; Chantzios, T; Alevizopoulou, S; Skiadopoulos, S; Tryfonopoulos, C. Intime: A machine learning-based framework for gathering and leveraging web data to cyber-threat intelligence. Electronics; 2021; 10, 7 818. [DOI: https://dx.doi.org/10.3390/electronics10070818]

9. Hwang, J; Kim, J; Chi, S; Seo, J. Development of training image database using web crawling for vision-based site monitoring. Autom. Constr.; 2022; 135, 104141. [DOI: https://dx.doi.org/10.1016/j.autcon.2022.104141]

10. Shamrat, FJM; Tasnim, Z; Rahman, AS; Nobel, NI; Hossain, SA. An effective implementation of web crawling technology to retrieve data from the world wide web (www). Int. J. Sci. Technol. Res.; 2020; 9, 01 pp. 1252-1256.

11. Khder, MA. Web scraping or web crawling: state of art, techniques, approaches and application. International J. Adv. Soft Comput. & its Applications.; 2021; 13, 3 pp. 145-168.4261574 [DOI: https://dx.doi.org/10.15849/IJASCA.211128.11]

12. Cvrček, V et al. Comparing web-crawled and traditional corpora. Lang. Resour. Evaluation; 2020; 54, 3 pp. 713-745. [DOI: https://dx.doi.org/10.1007/s10579-020-09487-4]

13. Martinez-Rodriguez, JL; Hogan, A; Lopez-Arevalo, I. Information extraction Meets the semantic web: a survey. Semantic Web; 2020; 11, 2 pp. 255-335.

14. Patel, A; Jain, S. Present and future of semantic web technologies: a research statement. Int. J. Comput. Appl.; 2021; 43, 5 pp. 413-422.

15. Kinne, J; Lenz, D. Predicting innovative firms using web mining and deep learning. PloS One; 2021; 16, 4 e0249071.1:CAS:528:DC%2BB3MXosFSrsLo%3D [DOI: https://dx.doi.org/10.1371/journal.pone.0249071] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/33793626][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8016297]

16. Patnaik, SK; Babu, CN; Bhave, M. Intelligent and adaptive web data extraction system using convolutional and long short-term memory deep learning networks. Big Data Min. Analytics; 2021; 4, 4 pp. 279-297. [DOI: https://dx.doi.org/10.26599/BDMA.2021.9020012]

17. Fanni, SC; Febi, M; Aghakhanyan, G; Neri, E. Natural Language processing. Introduction To Artificial Intelligence; 2023; Springer International Publishing: pp. 87-99. [DOI: https://dx.doi.org/10.1007/978-3-031-25928-9_5]

18. Tabassum, A; Patil, RR. A survey on text preprocessing & feature extraction techniques in natural Language processing. Int. Res. J. Eng. Technol. (IRJET); 2020; 7, 06 pp. 4864-4867.

19. Boppana, V; Sandhya, P. Web crawling based context aware recommender system using optimized deep recurrent neural network. J. Big Data; 2021; 8, 1 144. [DOI: https://dx.doi.org/10.1186/s40537-021-00534-7]

20. Chang, Z. A survey of modern crawler methods. In Proceedings of the 6th International Conference on Control Engineering and Artificial Intelligence (pp. 21–28). (2022), March.

21. Xu, Z. et al. Cleaner pretraining corpus curation with neural web scraping. ArXiv Preprint arXiv :2402.14652. (2024).

22. Liu, L; Liu, M; Liu, S; Ding, K. Event extraction as machine reading comprehension with question-context bridging. Knowl. Based Syst.; 2024; 299, 112041. [DOI: https://dx.doi.org/10.1016/j.knosys.2024.112041]

23. Feng, Y; Zhang, F; Zhang, Y; Dong, J; Wang, P. A hybrid extraction model for semantic knowledge discovery of water conservancy big data. PeerJ Comput. Sci.; 2025; 11, e2960. [DOI: https://dx.doi.org/10.7717/peerj-cs.2960] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/40989349][PubMedCentral: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453640]

24. Dhanith, P. R., Surendiran, B. & Raja, S. P. A word embedding based approach for focused web crawling using the recurrent neural network. (2021).

25. Zhouyi, X; Weijun, H; Yanrong, H. Intelligent acquisition method of herbaceous flowers image based on theme crawler, deep learning and game theory. Кронос; 2022; 7, 4 (66) pp. 44-52.

26. Neelakandan, S et al. An automated word embedding with parameter tuned model for web crawling. Intell. Autom. Soft Comput.; 2022; 32, 3 pp. 1617-1632. [DOI: https://dx.doi.org/10.32604/iasc.2022.022209]

27. Campos Macias, N; Düggelin, W; Ruf, Y; Hanne, T. Building a technology recommender system using web crawling and natural Language processing technology. Algorithms; 2022; 15, 8 272. [DOI: https://dx.doi.org/10.3390/a15080272]

28. Chaitanya, A; Shetty, J; Chiplunkar, P. Food image classification and data extraction using convolutional neural network and web crawlers. Procedia Comput. Sci.; 2023; 218, pp. 143-152. [DOI: https://dx.doi.org/10.1016/j.procs.2022.12.410]

29. Sharma, AK; Shrivastava, V; Singh, H. Experimental performance analysis of web crawlers using single and Multi-Threaded web crawling and indexing algorithm for the application of smart web contents. Mater. Today: Proc.; 2021; 37, pp. 1403-1408.

30. Namoun, A., Alshanqiti, A., Chamudi, E. & Rahmon, M. A. Web design scraping: Enabling factors, opportunities and research directions. In 2020 12th International Conference on Information Technology and Electrical Engineering (ICITEE) (pp. 104–109). IEEE. (2020), October.

31. Akhtar, MJ et al. An efficient mechanism for product data extraction from E-Commerce website. Computers Mater. & Continua.; 2020; 65, 3 pp. 2639-2663. [DOI: https://dx.doi.org/10.32604/cmc.2020.011485]

32. https://www.kaggle.com/datasets/banuprakashv/news-articles-classification-dataset-for-nlp-and-ml?utm_source=perplexity

Word count: 8671

Show less

© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Efficient data extraction from the ever-expanding web, including structured and unstructured sources such as newspaper databases, is critical for industries like media, research, and journalism. Traditional web crawlers, which are primarily rule-based or keyword-driven, struggle with adaptability, semantic understanding, and real-time responsiveness when working with diverse data formats and layouts found in newspaper archives. This research proposes WISE (Web-Intelligent Semantic Extractor), an intelligent, deep learning-based framework designed to overcome these challenges. By integrating Natural Language Processing (NLP) and neural networks, WISE can extract contextually relevant information from dynamic newspaper databases, improving both accuracy and efficiency in data retrieval. The system dynamically adjusts crawling strategies based on content semantics, learning patterns from diverse data sources to enhance relevance and reduce noise. WISE outperforms conventional rule-based, keyword-driven, and non-semantic crawlers by 35% in terms of extraction accuracy and 40% in terms of processing efficiency, according to experimental evaluations conducted on benchmark datasets and actual online environments. WISE showed exceptional scalability, contextual accuracy, semantic understanding, and real-time flexibility in a variety of online scenarios using the News Articles Classification Dataset (Kaggle) and real-time newspaper sources. The framework demonstrates superior performance in extracting structured data from heterogeneous sources while maintaining scalability and security. This work presents a novel, intelligent solution designed to meet the challenges of modern web environments.

Details

Title

AI driven web crawling for semantic extraction of news content from newspapers

Author

S, Saravanan¹; A K, Ashfauk Ahamed²

¹ Department of Computer Science and Engineering, B.S.Abdur Rahman Crescent Institute of Science & Technology, 600 048, Chennai, India (ROR: https://ror.org/01fqhas03) (GRID: grid.449273.f) (ISNI: 0000 0004 7593 9565)
² Department of Computer Applications, B.S.Abdur Rahman Crescent Institute of Science & Technology, 600 048, Chennai, India (ROR: https://ror.org/01fqhas03) (GRID: grid.449273.f) (ISNI: 0000 0004 7593 9565)

Pages

41673

Section

Article

Publication year

2025

Publication date

2025

Publisher

Nature Publishing Group

e-ISSN

20452322

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1038/s41598-025-25616-x

ProQuest document ID

3275323990

AI driven web crawling for semantic extraction of news content from newspapers

Jump to:

Full text

Abstract

Details

Suggested sources