Content area

Abstract

Recent advancements in applications such as natural language processing (NLP), applied linguistics, indexing, data mining, information retrieval, and machine translation have emphasized the need for robust datasets and corpora. While there exist many Arabic corpora, most are derived from social media platforms like X or news sources, leaving a significant gap in datasets tailored to academic research. To address this gap, the ARPD, Arabic Research Papers Dataset, is developed as a specialized resource for Arabic academic research papers. This paper explains the methodology used to construct the dataset, which consists of seven classes and is publicly available in several formats to benefit Arabic research. Experiments conducted on the ARPD dataset demonstrate its performance in classification and clustering tasks. The results show that most of the classical clustering algorithms achieve low performance compared to bio-inspiration algorithms such as Particle Swarm Optimization (PSO) and Gray Wolf Optimization (GWO) based on the Davies–Bouldin index measure. For classification, the Support Vector Machine (SVM) algorithm outperformed others, achieving the highest accuracy, with other classifiers ranging from 89% to 99%. These findings highlight the ARPD’s potential to enhance Arabic academic research and support advanced NLP applications.

Details

1009240
Title
Open source Arabic research paper dataset for natural language processing
Author
Almutairi, Tahani M. 1 ; Saifuddin, Shireen R. 1 ; Alotaibi, Reem M. 1 ; Sarhan, Shahendah 2 

 Department of Information Technology, King Abdulaziz University, Jeddah, Saudi Arabia (ROR: https://ror.org/02ma4wv74) (GRID: grid.412125.1) (ISNI: 0000 0001 0619 1117) 
 Computers Science Department, Faculty of Computers and Information, Mansoura University, 35516, Mansoura, Egypt (ROR: https://ror.org/01k8vtd75) (GRID: grid.10251.37) (ISNI: 0000 0001 0342 6662); School of Computer Science and Technologies, VIZJA University, Warsaw, Poland (ROR: https://ror.org/00523a319) (GRID: grid.17165.34) (ISNI: 0000 0001 0682 421X) 
Volume
15
Issue
1
Pages
31631
Number of pages
21
Publication year
2025
Publication date
2025
Section
Article
Publisher
Nature Publishing Group
Place of publication
London
Country of publication
United States
Publication subject
e-ISSN
20452322
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-08-27
Milestone dates
2025-08-18 (Registration); 2025-06-30 (Received); 2025-08-18 (Accepted)
Publication history
 
 
   First posting date
27 Aug 2025
ProQuest document ID
3244166900
Document URL
https://www.proquest.com/scholarly-journals/open-source-arabic-research-paper-dataset-natural/docview/3244166900/se-2?accountid=208611
Copyright
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by-nc-nd/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-09-22
Database
ProQuest One Academic