Content area

Abstract

Data preprocessing is usually necessary before running most machine learning classifiers. This work compares three different preprocessing techniques, minimal preprocessing, Principal Components Analysis (PCA), and Linear Discriminant Analysis (LDA). The efficiency of these three preprocessing techniques is measured using the Support Vector Machine (SVM) classifier. Efficiency is measured in terms of statistical metrics such as accuracy, precision, recall, the F-1 measure, and AUROC. The preprocessing times and the classifier run times are also compared using the three differently preprocessed datasets. Finally, a comparison of performance timings on CPUs vs. GPUs with and without the MapReduce environment is performed. Two newly created Zeek Connection Log datasets, collected using the Security Onion 2 network security monitor and labeled using the MITRE ATT&CK framework, UWF-ZeekData22 and UWF-ZeekDataFall22, are used for this work. Results from this work show that binomial LDA, on average, performs the best in terms of statistical measures as well as timings using GPUs or MapReduce GPUs.

Details

1009240
Business indexing term
Title
Analyzing Performance of Data Preprocessing Techniques on CPUs vs. GPUs with and Without the MapReduce Environment
Author
Bagui, Sikha S 1   VIAFID ORCID Logo  ; Eller, Colin 1 ; Armour Rianna 1 ; Singh, Shivani 1 ; Bagui, Subhash C 2   VIAFID ORCID Logo  ; Mink Dustin 3   VIAFID ORCID Logo 

 Department of Computer Science, The University of West Florida, Pensacola, FL 32514, USA; [email protected] (C.E.); [email protected] (R.A.); [email protected] (S.S.) 
 Department of Mathematics and Statistics, The University of West Florida, Pensacola, FL 32514, USA; [email protected] 
 Department of Cybersecurity, The University of West Florida, Pensacola, FL 32514, USA; [email protected] 
Publication title
Volume
14
Issue
18
First page
3597
Number of pages
27
Publication year
2025
Publication date
2025
Publisher
MDPI AG
Place of publication
Basel
Country of publication
Switzerland
Publication subject
e-ISSN
20799292
Source type
Scholarly Journal
Language of publication
English
Document type
Journal Article
Publication history
 
 
Online publication date
2025-09-10
Milestone dates
2025-08-11 (Received); 2025-09-08 (Accepted)
Publication history
 
 
   First posting date
10 Sep 2025
ProQuest document ID
3254508438
Document URL
https://www.proquest.com/scholarly-journals/analyzing-performance-data-preprocessing/docview/3254508438/se-2?accountid=208611
Copyright
© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Last updated
2025-09-26
Database
ProQuest One Academic