Content area
Networks-on-Chip (NoCs) have become an integral part of modern systems-on-chip (SoCs) to connect several components (mainly processing elements) with high area and energy efficiency, and strong quality-of-service (QoS) guarantee. Inspired by this, an emerging trend is the Network-on-Memory (NoM) architecture, where a NoC is used as the main interconnect for memory modules (e.g., DRAM banks) to build a high performance memory subsystem. We analyze the last level cache (LLC) misses of a wide-range of data-intensive workloads from SPEC, APACHE, PARSEC, and in-memory computing benchmark suites, in a baseline DRAM-based NoM. We show that these workloads easily overwhelm the limited NoC resources (e.g., buffers and link bandwidth) inside the NoM, increasing the memory service latency and data communication energy.
In this thesis, as our first contribution, we propose a static data compression scheme implemented within a DRAM-based NoM architecture. In essence, our approach enables data compression on the NoC inside a NoM architecture to reduce the NoC traffic, and, therefore, improve the memory service latency and energy consumption. Our approach uses a static lookup table (LUT) to store compressed codes of common data patterns and exploits this LUT during last level cache (LLC)misses to transmit these codes via NoC, instead of the original uncompressed data. We store this LUT in the DRAM subarrays within DRAM banks. We formulate compression and decompression mechanisms as a combination of LUT-based pattern matching and prefix concatenation, and implement them using low-latency DRAM internal circuitry and analog properties.
Our proposed static data compression scheme reduces the compression and decompression latency by exploiting subarray-level parallelism to compress and decompress several CPU data misses, simultaneously. We evaluate our approach using data-intensive workloads from SPEC, APACHE, PARSEC, and in-memory computing benchmark suites. Our results show that compared to a baseline NoM architecture, our approach brings significant improvements to the performance and energy consumption of the NoM.
The substantial performance and energy improvements on the NoC from our proposed static data compression scheme motivate us to explore the application of static data compression in emerging architectures, such as neuromorphic hardware. A scalable neuromorphic hardware is designed as a many-core architecture, where a shared interconnect, such as NoC, communicates spikes between the neuromorphic cores. Digital neuromorphic systems use address-event-representation (AER) protocol, which converts each spike into a digital NoC packet representing the address of it’s originating neurons. Using four realistic workloads, we show that the AER protocol generates a significantly large volume of packets on the NoC within a state-of-the-art many-core neuromorphic architecture, increasing NoC congestion, communication latency, and energy consumption. Intuitively, the static compression technique would reduce NoC traffic and, therefore, lower the communication latency and energy consumption, leading to overall application-level performance and energy improvements.
Many-core neuromorphic systems present unique architectural challenges that necessitate a specialized static compression hardware different from our approach for NoM architectures. As our second contribution in this thesis, we address these challenges by introducing a novel protocol for communicating spikes on the NoC of a many-core neuromorphic hardware. Our proposed protocol organizes the output spikes of a neuromorphic core (in space and time) as a single binary spike string, and compresses it into fewer bits using a specialized static compression hardware. The proposed static compression technique involves collecting a static set of common data patterns within the binary spike strings into an LUT and using a compressed code for each such pattern.
We evaluate our approach for a state-of-the-art NoC-based many-core neuromorphic hardware using four realistic workloads. We show that compared to the baseline AER protocol, our approach significantly reduces the spike traffic on the NoC, lowering NoC congestion, latency and energy of communicating spikes, leading to higher application-level performance and energy efficiency. Moreover, we show that the compression and decompression hardware introduce a minimal area and energy overhead to a neuromorphic core.
As our third contribution in this thesis, we propose a novel approximation technique which leverages the error resilience in SNNs and trades off accuracy for additional gains in performance and energy efficiency of spike communication in NoC-based many-core neuromorphic hardware. Our approach uses lightweight error and pattern compute units to approximate data patterns within the output spikes of a neuromorphic core to the nearest compressible patterns defined by an underly-ing compression hardware. In this way, our approach maximizes the compression hit rate, which reduces the spike traffic and improves the latency and energy of spike communication on the NoC.
We use Torch dialects (e.g., snnTorch) and build approximated SNN models by configuring our proposed technique at different fully-connected (FC) layers of a baseline SNN model in a plug and play manner. We train the baseline SNN models for three realistic workloads and show that simply implementing our approximation technique in the baseline models reduces the test accuracy of these models.
We further introduce approximation-aware retraining, where we retrain these approximated models for various epochs and provide sensitivity studies of the model accuracy behavior across various epochs, SNN models, and approximation configurations. Our sensitivity studies show that there is a trade-off between the improvements in model accuracy, and the resulting latency overhead of approximation-aware retraining. Moreover, we evaluate our approach in a state-of-the-art many-core neuromorphic hardware. Our results using three realistic workloads show that our proposed technique significantly increases the compression hit rate, improving the spike communication latency and energy consumption on the NoC.
