Content area
Full text
Abstract: Bioinformatics pipelines, which process vast amounts of sensitive biological data, are increasingly targeted by cyberattacks. Traditional security measures often fail to provide adequate protection due to the unique computational and network characteristics of these pipelines. This study proposes a machine learning-based Intrusion Detection System (IDS) tailored specifically for bioinformatics workflows. While the CICIDS2017 dataset serves as the primary benchmark, we augment the study with bioinformatics-specific network traffic to ensure relevance. We compare the performance of four machine learning algorithms Random Forest (RF), Support Vector Machine (SVM), Convolutional Neural Network (CNN), and Gradient Boosting Machine (GBM) and explore hybrid models for enhanced detection. Our findings highlight GBM's superior accuracy (98.3%) while also addressing its computational overhead and susceptibility to adversarial attacks. The study contributes novel insights by integrating real-world bioinformatics traffic data and proposing adaptive security strategies for genomic research environments.
Keywords: Machine learning, Intrusion detection, Algorithms, Cyber-Biosecurity
1. Introduction
Bioinformatics pipelines play a crucial role in genomic research and healthcare, processing petabytes of sensitive biological data. However, their high computational complexity and interconnected architectures expose them to sophisticated cyber threats. Previous studies have explored machine learning approaches for IDS using generic datasets, but few have tailored these systems to the unique network behavior of bioinformatics workflows. This research aims to bridge that gap by introducing a dataset augmentation strategy that incorporates bioinformatics-specific traffic patterns, thereby improving model relevance and accuracy.
1.1 Literature Review
The development of intrusion detection systems for bioinformatics environments has evolved significantly over the past decade, with various approaches and methodologies being explored. This section provides a comprehensive review of existing work, organized by key themes and methodological approaches.
1.1.1 Traditional IDS Approaches in Scientific Computing
Early attempts to secure bioinformatics pipelines relied heavily on traditional signature-based detection methods. Wurmus et al. (2018) implemented basic pattern matching techniques in genomic workflows, achieving moderate success but struggling with novel attack vectors. Building on this foundation, Islam et al. (2019) introduced heuristic-based detection methods specifically designed for high-throughput sequencing environments, though their approach suffered from high false positive rates in production settings.
1.1.2 Machine learning applications in cybersecurity
The integration of machine learning in cybersecurity has seen remarkable progress. Al-Qatf et al. (2018) proposed a deep learning framework combining sparse autoencoders with...




