Content area
Large Language Models (LLMs) have emerged as transformative tools across numerous domains, yet their deployment is predominantly confined to powerful, energy-intensive cloud-based servers. This reliance on centralized infrastructure is the home of significant challenges related to latency, data privacy, and energy consumption. The complexity, sequential nature, and computational demands of the inference process pose a significant challenge: how to execute these models efficiently on resource-constrained edge devices for use in real-time applications like autonomous systems and advanced 5G/6G networks.
It is necessary to run Small Language Models (SLMs) inference locally on edge devices in an energy-efficient and performant way. This would eliminate latency, ensure data privacy, and enable a new class of real-time Artificial Intelligence (AI) applications without relying on constant cloud connectivity, providing valuable capabilities to a new generation of intelligent devices.
The main goal of this work is to study techniques for accelerating SL.Ms inference on Field Programmable Gate Arrays (FPGAs). This work tests the hypothesis that offloading the computationally intensive parts of the inference process to a custom hardware accelerator is a well-suited. approach, and that this hardware-software co-design can result in a more energy-efficient and high-performance solution compared to purely software-based execution on an embedded processor. A solution, a hardware accelerator for the entire transformer forward pass, is proposed for this purpose, designed using High-Level-Synthesis (HLS).
For validation, an experimental procedure is specified, involving a functional prototype developed for the PYNQ-Z2 board, a low-cost System-on-Chip platform. Tests with the TinyStories 260k parameter model are conducted to assess the proposed solution's performance against a software-only baseline running on the board's integrated ARM processor. The evaluation involves iterative hardware designs, including baseline, memory-optimized, and quantized versions, to measure throughput and identify implementation bottlenecks.
The results show that the hardware accelerator could not outperform the software-only execution on the same device, therefore invalidating the proposed hypothesis for this specific hardware platform. The negative performance result is a valuable finding, underscoring the significant impact of hardware resource limitations on achievable parallelism. However, the work contributes to the field by identifying key implementation pitfalls and outlines a clear path for future research, including porting the design to more capable FPGAs and redesigning the datapath to be fully integer-based.