Content area

Abstract

Feature selection is the data analysis process that selects a smaller, curated subset of the original dataset by filtering out irrelevant or redundant features. The most important features can be ranked and selected based on statistical measures, such as mutual information. Feature selection not only reduces the size of the dataset, and thus the time required to train machine learning models, but also improves the accuracy of the model by increasing the quality of the data. However, two major challenges arise: mutual information calculations are computationally expensive, and the data transfer overhead becomes a bottleneck for feature selection on large datasets.

This thesis introduces the first near-storage acceleration of feature selection to address these challenges, filtering the data where it lies. The Mutual Information Maximisation (MIM) algorithm is chosen for acceleration as it requires only a single pass over the dataset, making it well-suited for near-storage deployment. Targeting FPGA-based computational storage devices (CSDs) with inherent hardware resource constraints, this work develops a methodology for approximating MIM. The approximation reduces the computational load and hardware requirements of mutual information, while maintaining a good trade-off between accuracy and FPGA resources through a design space exploration (DSE) process. The DSE results guide the hardware implementation, with the design focussing on high-speed histogram counting, which is identified as the primary bottleneck in mutual information.

The novel FPGA-based accelerator is implemented on a Samsung/AMD SmartSSD CSD. It fully utilises the SSD NVMe bandwidth to perform feature selection at SSD read speed while minimising data transfers to the main processor. The evaluation shows that the near-storage MIM accelerator achieves a speedup of up to 37× compared to mainstream multiprocessing Python ML libraries, and a speedup of up to 21×compared to an optimised C library running on an Intel i9. Additionally, it is more than 70× more energy efficient for large out-of-core datasets, while occupying only up to 15% of the available FPGA resources.

Finally, this thesis extends the aforementioned work to support Minimum Redundancy - Maximum Relevance (mRMR), a more advanced mutual-information-based feature selection algorithm that requires multiple passes over the data. The higher data reuse of mRMR poses a challenge for near-storage processing, as the benefits expectedly diminish when the dataset fits within the memory of the host/GPU. However, this thesis reasons and demonstrates that for larger-than-memory datasets, the same principles as MIM apply, making the proposed near-storage processing approach the most efficient solution. 

Details

1010268
Business indexing term
Title
Near-Storage Acceleration of Feature Selection
Number of pages
184
Publication year
2025
Degree date
2025
School code
1543
Source
DAI-A 87/4(E), Dissertation Abstracts International
ISBN
9798297674790
Advisor
University/institution
The University of Manchester (United Kingdom)
University location
England
Degree
Ph.D.
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32293324
ProQuest document ID
3267146034
Document URL
https://www.proquest.com/dissertations-theses/near-storage-acceleration-feature-selection/docview/3267146034/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic