Content area

Abstract

Large Language Models (LLMs) have emerged as transformative tools across numerous domains, yet their deployment is predominantly confined to powerful, energy-intensive cloud-based servers. This reliance on centralized infrastructure is the home of significant challenges related to latency, data privacy, and energy consumption. The complexity, sequential nature, and computational demands of the inference process pose a significant challenge: how to execute these models efficiently on resource-constrained edge devices for use in real-time applications like autonomous systems and advanced 5G/6G networks.

It is necessary to run Small Language Models (SLMs) inference locally on edge devices in an energy-efficient and performant way. This would eliminate latency, ensure data privacy, and enable a new class of real-time Artificial Intelligence (AI) applications without relying on constant cloud connectivity, providing valuable capabilities to a new generation of intelligent devices.

The main goal of this work is to study techniques for accelerating SL.Ms inference on Field Programmable Gate Arrays (FPGAs). This work tests the hypothesis that offloading the computationally intensive parts of the inference process to a custom hardware accelerator is a well-suited. approach, and that this hardware-software co-design can result in a more energy-efficient and high-performance solution compared to purely software-based execution on an embedded processor. A solution, a hardware accelerator for the entire transformer forward pass, is proposed for this purpose, designed using High-Level-Synthesis (HLS).

For validation, an experimental procedure is specified, involving a functional prototype developed for the PYNQ-Z2 board, a low-cost System-on-Chip platform. Tests with the TinyStories 260k parameter model are conducted to assess the proposed solution's performance against a software-only baseline running on the board's integrated ARM processor. The evaluation involves iterative hardware designs, including baseline, memory-optimized, and quantized versions, to measure throughput and identify implementation bottlenecks.

The results show that the hardware accelerator could not outperform the software-only execution on the same device, therefore invalidating the proposed hypothesis for this specific hardware platform. The negative performance result is a valuable finding, underscoring the significant impact of hardware resource limitations on achievable parallelism. However, the work contributes to the field by identifying key implementation pitfalls and outlines a clear path for future research, including porting the design to more capable FPGAs and redesigning the datapath to be fully integer-based.

Details

1010268
Business indexing term
Title
Hardware Accelerated Edge Inference of Small Language Models on FPGA
Number of pages
71
Publication year
2025
Degree date
2025
School code
5896
Source
MAI 87/6(E), Masters Abstracts International
ISBN
9798265495532
Committee member
Figueiredo, Mónica
University/institution
Universidade do Porto (Portugal)
University location
Portugal
Degree
Master's
Source type
Dissertation or Thesis
Language
English
Document type
Dissertation/Thesis
Dissertation/thesis number
32426826
ProQuest document ID
3288188715
Document URL
https://www.proquest.com/dissertations-theses/hardware-accelerated-edge-inference-small/docview/3288188715/se-2?accountid=208611
Copyright
Database copyright ProQuest LLC; ProQuest does not claim copyright in the individual underlying works.
Database
ProQuest One Academic