Content area
Reconfigurable computing (RC) theory aims to take advantage of the flexibility of general-purpose processors (GPPs) alongside the performance of application specific integrated circuits (ASICs). Numerous RC architectures have been proposed since the 1960s, but all are struggling to become mainstream. The main factor that prevents RC to be used in general-purpose CPUs, GPUs, and mobile devices is that it requires extensive knowledge of digital circuit design which is lacked in most software programmers. In an RC development, a processor cooperates with a reconfigurable hardware accelerator (HA) which is usually implemented on a field-programmable gate arrays (FPGAs) chip and can be reconfigured dynamically. It implements crucial portions of software (kernels) in hardware to increase overall performance, and its design requires substantial knowledge of digital circuit design. In this paper, a novel RC architecture is proposed that provides the exact same instruction set that a standard general-purpose RISC microprocessor (e.g., ARM Cortex-M0) has while automating the generation of a tightly coupled RC component to improve system performance. This approach keeps the decades-old assemblers, compilers, debuggers and library components, and programming practices intact while utilizing the advantages of RC. The proposed architecture employs the LLVM compiler infrastructure to translate an algorithm written in a high-level language (e.g., C/C++) to machine code. It then finds the most frequent instruction pairs and generates an equivalent RC circuit that is called miniature accelerator (MA). Execution of the instruction pairs is performed by the MA in parallel with consecutive instructions. Several kernel algorithms alongside EEMBC CoreMark are used to assess the performance of the proposed architecture. Performance improvement from 4.09% to 14.17% is recorded when HA is turned on. There is a trade-off between core performance and combination of compilation time, die area, and program startup load time which includes the time required to partially reconfigure an FPGA chip.
This is an open access article under the terms of the Creative Commons Attribution License, which permits use, distribution and reproduction in any medium, provided the original work is properly cited.
1. Introduction
A program written in any imperative programming language such as C/C++ and Java is all Turing equivalent and can be used to solve any computable problem [1]. The focus of this research is to explore ways of improving the performance of Turing equivalent machines.
The Von Neumann (VN) architecture is the cornerstone of general-purpose computing and is based on adaptation of algorithm for a fixed hardware. Algorithms that usually perform the same set of inherent parallel operations on a huge set of data are not good candidates for implementation on a VN machine [2]. In application-specific instruction-set processors (ASIPs), hardware parts that perform the pipeline instruction cycles (IR, ID, OR, EX, and W) are replaced by several functional units (FUs). Each unit has one or more specific functions and usually operates in parallel to implement algorithm that the ASIP is designed for.
We can define two main criteria to characterize processors: flexibility and performance. The VN processors are very flexible, whereas the ASIPs bring highest performance. Ideally, the goal is to have a machine that possesses the flexibility of general-purpose processors (GPPs) and performance of the ASIPs. This can be achieved with digital hardware that is able to adapt to the algorithm on the fly. Such a system is called a reconfigurable hardware or a reconfigurable processing unit (RPU). This paper focuses on reconfigurable computing (RC) architectures and proposes a new methodology to take advantages of both GPP and ASIP.
The smaller transistor size (e.g., 2 nm transistors [3]) paves way for fabrication of devices that can host huge number of reprogrammable cells. RC cells can be easily constructed by combining lookup tables (LUTs) that their inputs are derived by static memory elements (SRAMs). Currently, the state-of-the-art reconfigurable systems use hardware description languages such as Verilog and VHDL to define a hardware architecture and use C or scripting languages such as Python to govern these hardware designs and reprogram FPGA devices (either fully or partially) based on a demand or a scenario. The proposed architecture uses a combination of C/C++ (as static compiler passes) and VHDL language to generate digital circuits performing as hardware accelerator (HA). They can be implemented on the fly either by FPGA partial programming technique or writing into SRAMs that drive LUTs.
This paper has the following two major contributions: (1) An integrated reconfigurable software and hardware architecture based on ARM Cortex-M0 and the LLVM compiler infrastructure that produces reconfigurable miniature accelerators (MAs). The term “MA” is defined as dynamically generated tiny hardware modules that run in parallel with the main microprocessor. An MA produces the same result as the main microprocessor when activated at the right moment but finishes its task in fewer clock cycles. (2) An acceleration architecture based on MAs that can be turned on/off and does not break backward compatibility. It needs no user hardware expertise and requires no modification in programming paradigm (e.g., mandatory C pragma statements). The proposed accelerator improves the performance from 4.09% to 14.17%. The two minor contributions are as follows: (1) exploitation of dual-port memories (common in most FPGA platforms) to eliminate inherent processor stalls due to branch and data dependency. (2) ARM Cortex-M0 implementation verification using the IAR Embedded Workbench simulator using tracing technique.
2. Related Work
The concept of RC is initiated in the early1960s when Gerald Estrin proposed the concept of a computer made of a standard processor and an array of reconfigurable hardware [4, 5]. Programmable active memories (PAM) [6], Garp [7] (integrates RC with a standard MIPS processor), NGEN [8], POLYP [9], and MereGen [10] (which are massively parallel reconfigurable computers based on hundreds of FPGAs coupled with SRAMs) are few examples of early RC projects.
Dynamic instruction sets also began in the 1960s [11]. For example, if the code calls a Hamming function frequently, then the compiler would reconfigure a part of FPGA to perform the function faster by producing a specific instruction to call the associated hardware routine. The PRISM [12] project is an example of such systems.
There are published ideas that offer a road map to an evolving microprocessor such as the work provided by Huebner et al. [13], but no tangible working system has been developed. XiRisc [14] is a Very Large Instruction Word (VLIW) processor with reconfigurable instruction set using a processor coupled with an RC module. The WASMII [15] project focuses on data-driven computation and tries to implement large circuits on FPGAs. The dynamic instruction set computer (DISC) [16] processor implements special instructions in its instruction set as independent circuit modules.
Hartenstein [17] argues against VN machines and proposes data-stream-driven computing instead of VN paradigm that is instruction-stream-driven. Takano [18] proposes Cache Architecture for Configurable Hardware Engine (CACHE) where high design complexity introduces a lot of configuration overhead which can be considered a major drawback, and among three benchmarked applications, only one shows slight performance improvement.
The CHIMAERA [19] introduces a reconfigurable FU (RFU) on a superscalar out-of-order processor with a GCC-based C compiler that maps groups of instructions to RFUs. The MOLEN [20] polymorphic processor places an RP next to a GPP (a PowerPC core) with claim of performance improvement up to 300 folds. The ρ-VEX [21] is extension of the MOLEN architecture and allows instruction stream to be fed to either GPP or RP with eight pipelines (each has pipeline of its own), and a 15% improvement is reported.
There is a myriad of projects with the aim of accelerating only one specific algorithm. Sudarshan et al. [22] use FPGA for convolutional neural network (CNN) inference acceleration, and Ng, Liu, and So [23] couple an FPGA fabric with an RISC-V soft-core to improve matrix multiplication and FIR filter. Microsoft’s Catapult [24] uses FPGA for accelerating large-scale data center services. Another cloud-scale acceleration architecture proposed by Microsoft [25] places FPGAs between network switches and servers to accelerate service acceleration. The CORDIC hardware acceleration [26] uses DMA-based ISA extension that allows efficient implementation of new trigonometric ALU-based custom instructions.
Yu, Lemieux, and Eagleston [27] attempt to solve the manual design of accelerators by replacing the FPGA with a vector processing core that is coupled with a soft-core. Shaoyi’s PhD thesis [28] lays out the details of accelerator synthesis and integration process for CPU + FPGA systems. Manor and Greenberg [29] use synchronous data flow (SDF) to construct an LLVM graph and partition a program into hardware and software implementation where the hardware part is implemented using custom instruction wrappers alongside a scheduler.
Finally, modern heterogeneous computing offers combination of CPUs, GPUs, neural processing units (NPUs) either tight together on a motherboard or inside system-on-chip (SoC). NPUs are used to accelerate AI tasks and workloads by taking advantage of parallel processing, low-precision arithmetic, high-bandwidth memory, and HA. A modern CPU/NPU coupling example is Erez and Shlomo’s work [30] where a loosely coupled NPU that is connected to an MCU is proposed to accelerate machine learning (ML) inference on edge devices.
The covered related works reveal several shortcomings: (1) To implement an algorithm or optimize its performance, there is at least one part of the system that needs to be converted to digital circuit hardware. This conversion requires hardware design expertise which programmers usually lack. (2) Either custom ISA (e.g., DISC or πISA) or academic-level ISAs (e.g., MIPS or RISC-V) are used. Few works use notable architectures such as PowerPC but utilization of industry-level architectures such as Intel and ARM is missing. (3) Most works modify either microprocessor ISA, compiler, and processor architecture or executable binary format which breaks backward compatibility. (4) High level of complexity increases the required development time and debugging effort and prevents adaptation in nonexperimental computational systems where reliability and testability is essential.
To mitigate the mentioned disadvantages, an architecture based on MAs is proposed that can optimize all sequential algorithms implementable on a Turing equivalent machine. The work adapts ARMv6-M architecture [31] used in Cortex-M0 [32].
No ISA modification is applied, and no special instructions are added; therefore, the machine code (MC) that is produced by the proposed system can be executed on any traditional Cortex-M0 core in the absence of the HA, allowing backward compatibility.
3. Methodology
In this section, the general overview of proposed architecture is laid out, and a brief description of all components involved in the implementation process is provided.
The proposed system consists of three major components:
1. A microprocessor: ARMv6-M Cortex-M0.
2. A compiler: LLVM Infrastructure [33].
3. Adaptive reconfigurable digital circuits: A partially reconfigurable FPGA chip is used as hardware platform, and automatically generated VHDL code is used to reconfigure the FPGA hardware.
Figure 1 depicts the overview of the proposed system. The input to the system is an algorithm (e.g., FFT) that is implemented in C programming language. The algorithm can be implemented in any language supported by LLVM infrastructure frontend. Next step is to convert the program to LLVM intermediate representation (IR) language. The IR code undergoes several optimization stages that exist in standard LLVM infrastructure compilation process and then reaches the backend stage.
[figure(s) omitted; refer to PDF]
The LLVM MC framework is a submodule of the backend. It is modified to add new adaptive passes to the end of MC framework pass pipeline. These new passes are responsible for generating both machine code and adaptive hardware components. Initially, the universal LLVM IR code is converted (lowered) into ARM machine code and the interested machine instruction pairs are identified. Based on the selected pair memory addresses, a series of VHDL modules is generated to produce the exact result of the paired instructions. In general, the LLVM backend produces two sets of files: (1) A Standard ARMv6 ELF file. (2) Three automatically generated VHDL modules that contain information about the relative memory address of identified instruction pairs and control signals required to change the standard behavior of ARMv6 Cortex-M0 microprocessor.
The ELF file is uploaded onto an FPGA block RAM as main program memory. The VHDL modules are converted to a bitstream file and then sent onto the FPGA chip and are tightly coupled with a VHDL ARMv6 Cortex-M0 soft-core. The coupling can be turned on or off via VHDL generic statements. If the adaptive VHDL modules are decoupled, then the system performance is identical to a standard microprocessor; otherwise, adaptive submodules and their control signals to the Cortex-M0 core will be on and will remove instruction pairs from pipeline dynamically.
There is performance improvement if eliminated pairs are in kernel functions. The trade-off is the cost of execution of a dynamic profiler and extra time required to partially reconfigure the FPGA chip before program execution. The FPGA programming overhead can be eliminated if the whole system is implemented on SoC. The kernel function selection is done via dynamic profiling that measures instruction execution count. The IAR Embedded Workbench for AMR is temporarily used for dynamic profiling. The LLVM infrastructure offers the same result using basic block instrumentation [34] and Clang profile guided optimization [35]. Currently the automatic linkage of dynamic profiling result and the HA generation is done manually, but the selection of LLVM basic blocks and passing them to accelerator generator code can be easily scripted.
First, a Cortex-M0 compatible core is developed from scratch. This step is essential as any serious attempt to modify a microprocessor architecture requires extensive knowledge of the core. Next, LLVM compiler infrastructure is picked for analyzing and machine code modification. After having a verified microprocessor and a reliable compiler, the adaptive parts are developed.
Initially, C/C++ program is cross-compiled to ARMv6 machine code on an x86_64 host machine. The details of this crucial step are illustrated in Supporting A. The choice of LLVM allows us to reuse the frontend and common optimizer and to solely focus on the backend. Passes (C code that call LLVM APIs) are added to the LLVM backend to generate ARM ELF file alongside of adaptive hardware (in VHDL).
3.1. ARMv6-M Cortex-M0
Figure 2 shows the implemented Cortex-M0 3-stage pipeline [36]. The detailed VHDL implementation of ARM Cortex-M0 is available in [37]. Only the ARM Cortex-M0 core and the AHB-Lite interface [38] are implemented to allow communication between the core and external memory. Detailed system block diagram of the core in Xilinx Vivado design suite is included in Supporting B1, and its simplified version is in Supporting B2. The proposed RTL-level Cortex-M0 core is written in VHDL language. It is cycle-accurate which means it precisely follows instructions cycle count according to the specification stated in the Cortex-M0’s technical reference manual [38]. The Cortex-M0 has a nonuniform number of clock cycles required per each category of instructions ranging from 1 up to
[figure(s) omitted; refer to PDF]
3.2. LLVM Compiler Infrastructure
3.2.1. Overview
The most popular design for a traditional static compiler is the three-phase design [40]. If a compiler uses a common code representation in its optimizer (such as LLVM IR representation), then a frontend can be written for any language that can compile to it, and a backend can be written for any target that can compile from it. To master the LLVM infrastructure, a backend is developed for a 16-bit processor [41] from scratch. It demonstrates how one can rapidly develop a compiler (only an LLVM backend needs to be developed) for a new processor.
3.2.2. LLVM Compilation to Target ARM Cortex-M0 Bare-Metal Architecture
When there is no operating system, then the software is called bare-metal, and consequently, bare-metal programming is programming directly for a processor that lacks an operating system. To have full control over all aspects of software and hardware, bare-metal programming approach is taken.
The LLVM infrastructure supports cross-compiling natively, but to target Cortex-M0 on x86-64 host machine, the LLVM source code must be compiled first for x86_64 and then used to cross-compile C/C++ programs to ARM architecture. Although this step is important and must be included in this paper, to conserve space, an extensive and detailed guide is written and placed online in public GitHub domain as well as in Supporting A.
After the steps listed in Supporting A, compiling C/C++ programs to generate Cortex-M0 machine code is possible. Initially, the goal of improving the most important numerical algorithm of our lifetime is set: The fast Fourier transform (FFT). Later, more algorithms are used for benchmarking, and finally, CoreMark benchmark is used for performance evaluation. The C source code for FFT algorithm is listed in Supporting C. It is passed to the LLVM frontend (clang compiler), and then, the generated IR code is passed to optimization layer, and finally, the optimized IR code is passed to the backend to emit ARM machine code. The frontend and optimization components remain intact, but the backend must be modified to add new passes that are responsible to generate HA VHDL modules.
3.2.3. LLVM MC Framework
In LLVM, backend pipeline instructions travel through several phases: LLVM IR ⟶ SelectionDAG ⟶ MachineDAG ⟶ MachineInstr ⟶ MCInst [42] where each phase is just a C++ class. The LLVM superpasses composed of several smaller passes internally. These passes are critical to the success of the backend, while the smaller ones are not. The IR is converted into SelectionDAG (DAG stands for Directed Acyclic Graph). Then, SelectionDAG legalization occurs where illegal instructions are mapped onto the legal operations permitted by the target machine. After this stage, SelectionDAG is converted to MachineDAG, which is basically an instruction selection supported by the backend [42]. However, CPUs do not execute DAGs, and they execute a linear sequence of instructions. LLVM’s code generator schedules and then converts the MachineDAG to MachineBasicBlock. If the execution path reaches a basic block, then all the instructions inside the block will be executed. We can add a simple counter to the end (or beginning) of these basic blocks to help the instrumentation dynamic profiling of programs.
The instructions in MachineBasicBlock take the MachineInstr form (MI), and the DAG can be destroyed. The MI instructions are then converted to MCInst in the MC Framework. The MCInst class defines a lightweight representation for instructions. Compared with MIs, MCInsts carry less information about the program [43].
The main job of MC framework is to receive MachineInst and convert it to MCInstr.
There are two options: emit assembly or binary instructions. Note that any modification here results in generation of an incompatible ELF file. To retain compatibility, a compiler pass is added after the ARMELFStreamer class to generate an ELF file. The pass is named OpcodeAnalysis and analyzes the ELF file and produces VHDL reconfigurable components without modifying the original ELF file structure. The source code of this pass is listed in Supporting D.
3.3. Adaptive ARM Cortex-M0 Soft-Core
3.3.1. Adaptive LLVM Backend Pass
3.3.1.1. Finding Most Frequent Instruction Pairs
The opcode analysis pass receives a sequence of machine code as input and analyzes it according to the following rules:
1. The code sequence portion that needs to be accelerated must be identified by dynamic profiling. The function name that wraps the target code sequence is also identified.
2. The mapping of the function in the ELF file must be determined.
3. The pair depth p must be set which is a variable that determines the number of instructions in an instruction pair, where p ≥ 2.
4. An instruction pair can be any sequence of consecutive instructions that has no data dependency either based on register contents or machine state inside a basic block.
5. An instruction pair must discontinue if it reaches a branch instruction regardless of the branch will be taken or not (end of a basic block).
6. Two types of instruction pairs are defined: (1) Type A: pairs that produce only result and contain no branch instruction; (2) Type B: pairs that produce result and end with a branch instruction (a branch happens whenever the program counter (PC) register is modified so POP and PUSH instruction can be considered as branch instructions if a new value is popped/pushed into PC).
7. The pair must produce only one result (a 32-bit data value) and alter only one register.
3.3.1.2. MA Generation
The IAR Embedded Workbench IDE for ARM [44] is used to perform dynamic instruction count profiling. The detailed steps are included in Supporting E, where it shows how the most frequently executed code blocks can be identified by a simulation run with dynamic profiling option turned on. The most frequent pair is defined as a pair of instructions that has the highest number of occurrences in a basic code block. Using a heuristic algorithm (depth-first search) with pair depth p = 2, the adaptive pass finds the [MOV, BL] pair as the most frequent pair (see Figure 1.8 in Supporting E where it shows the pairs visually) which is a Type B instruction pair. The pass then extracts the memory address location of each (MOV, BL) and generates the first reconfigurable VHDL module called RC_PC_sensitivity as shown in Figure 3. This is the first step in the MA generator process. This digital module is combinational logic (C.L.); it receives PC as its input and generates the invoke_accelerator signal as output. The RC_PC_sensitivity module simply asserts the invoke_accelerator signal whenever the PC points to (MOV, BL) pair.
[figure(s) omitted; refer to PDF]
The second step in the MA generation process is to create a reconfigurable VHDL module. It is generated by another LLVM pass and is named m0_PC_OP.
This digital module is also combinational logic; it receives PC as its input and outputs the operands of both MOV and BL instructions. With these two RC modules in place, when invoke_accelerator is high, the instruction located at current value of PC should not be fetched. Instead, for type A instruction pairs, instruction at instruction pair target address, and for type B instruction pairs, instruction at PC + size_of (instruction pair) must be fetched. This results in skipping over the entire instruction pair in the memory by the microprocessor’s fetching unit as shown in Figure 4.
[figure(s) omitted; refer to PDF]
Removing instruction pairs and skipping over them on the fly are the cornerstone of the proposed MA architecture. However, instructions removal without considering their effect on the processor/memory is illegal. Recall that for both type A and B instruction pairs, a result is assumed. This result is the effect of instruction execution on the machine if they were executed normally, which can be a new value that must be saved in a register (instructions that do not modify registers but alter the machine state also fall in this category, e.g., the compare instructions that might set/unset the zero-flag bit in status register).
The number of instructions in pairs can be increased from 2 to n as long as the pair’s result remains one (a 32-bit data value). The limitation is due to the infeasibility of performing two writes on register bank simultaneously. Therefore, if an instruction pair is identified, then the final effect of all its instructions in the pair must update only one register. The result that is generated by the instruction pair is stored in accelerator_reg_data signal, and the register number that must be updated with this data is stored in accelerator_reg_target signal as shown in Figure 3.
3.4. Coupling of ARM Cortex-M0 Core With Generated MA Modules
A compiler generates a sequence of instructions based on constructs that appear in high-level programming language. These constructs are fixed and repeated across the whole program (e.g., if-else or loop constructs). They generate fixed machine instruction sequences. For example, an if statement will always get converted to the following sequence of instructions: (1) compare registers, (2) set status flags accordingly, and (3) branch to target address based on status flags or no branch. If a program is not handwritten in assembly language, then it is guaranteed to have repetitive instruction pair patterns generated by compiler.
The efficient conversion of core kernel functions (e.g., Hamming, FFT, or matrix multiplication functions) to hardware requires manual work of hardware designers. This fact prevents the HAs from being used easily in general-purpose programming environment. This is due to lack of undergraduate courses that teach hardware design with rapid development paradigm in mind (an important factor in industry). In contrast, simple instruction pairs can be converted to hardware automatically without a need to have a digital circuit designer.
3.4.1. Retaining Backward Compatibility
One of the major obstacles in adoption of academic research architectures in industry is their incompatibility with the existing systems. The programs and libraries in-use and their dependency on hardware architecture introduce a strong inertia against any positive modification to either software or hardware. Therefore, if an architecture hopes to become mainstream, it must consider the backward compatibility factor.
To achieve backward compatibility, any modification to the Cortex-M0 core which is implemented in VHDL language is prohibited (self-imposed). This module is a soft-core and can be changed at will, but although one can manipulate a soft-core but doing so will break software backward compatibility.
The RC components are tightly coupled with the Cortex-M0 core, and Boolean generic USE_ACCELERATOR is defined to allow the turning of the adaptive accelerator on or off. All changes to host adaptive components are within the processor core and are hidden from application layer; therefore, operating system or bare-metal programs cannot detect whether the processor is running in normal or accelerated mode.
3.4.2. Acceleration by Bypassing Instruction Pair
When the invoke_accelerator signal is activated, the PC_value signal (the next value of PC) deviates from normal operation PC = PC + 2. The goal is to execute an instruction pair in parallel with the next instruction after it or the instruction located at branch target.
In [45], a technique is proposed that uses dual-port memory block RAM (DP-BRAM) to fetch two memory locations instead of one, per each clock cycle to eliminate pipeline stalls. Figure 5 shows the program memory BRAM that adopts the technique with its microprocessor interfacing. The second port is used to fetch either target branch address or (PC + instruction pair size) as shown in Figure 3. DP-BRAM allows fetching the right next instruction from memory when the invoke_accelerator signal is asserted simultaneously with current instruction. The extra fetched instruction is stored and used in the next clock cycle to avoid the pipeline flush penalty.
[figure(s) omitted; refer to PDF]
Figure 4 has three set of waveforms. The top one (typical) shows the typical branch execution on a standard Cortex-M0 core. A branch instruction is located at memory address 0 × 8A, and the branch target is located at memory address 0 × 40. The PC_value signal updates the PC register on every rising edge of the clock cycle. AHB-Lite interface (HADDR signal) gets updated by PC. The memory read occurs on the rising edge of the clock. The read value from memory is placed on hrdata_program_value signal and then placed on hrdata_program signal with one clock cycle delay. The reason behind this is the existence of registers that are placed between read memory (fetch stage) and decoding circuitry (decode stage) to form a 3-stage pipeline.
The Cortex-M0 prefetches two 16-bit instructions (placed on 32-bit hrdata_program signal). The current_instruction signal holds the value of current instruction which is either the first 16 bit or the second one (controlled by PC [1] bit). Instead of placing a specific instruction, an iXX pattern is used. For example, i86 means an arbitrary instruction at memory location 0 × 86. The FETCH, DECODE, and EXECUTE signals in Figure 4 are virtual signals and are depicted to assist tracking the pipeline stages. For example, f86 means instruction fetch of instruction at location 0 × 86, d86 means decode of instruction at memory location 0 × 86, and e86 refers to execution of instruction at location 0 × 86.
In Figure 4, we start tracing signals by looking at EXECUTE signal. After the execution of e8A (which is a branch instruction), the PC_value and PC signals deviate from normal operation PC = PC_value + 2 (0 × 8A, 0 × 8C, 0 × 8E, 0 × 90, 0 × 92, …) and are set to branch target values (0 × 8A, 0 × 8C, 0 × 8E, 0 × 90, 0 × 40, …). The PC is set to the target address of branch (e.g., 0 × 40), and PC_value is set to target address plus two (e.g., 0 × 40 + 0 × 2 = 0 × 42). This deviation from PC increments by 2 is marked in blue, and red cycles indicate the execution cycle where deviation starts. If a branch from 0 × 8A to 0 × 40 must occur, then the already fetched and decoded instructions in pipeline must be discarded (instructions at locations 0 × 8C, 0 × 8E, …). This is done by invocation of disable_executor signal. Figure 4 shows that a pipeline flush due to branch takes three cycles to complete, and it takes two cycles to execute the branch instruction, resulting in total five cycles.
The invoke_accelerator signal is set to high when the first instruction in an instruction pair is reached by PC. In Figure 4, the second set of waveforms (acceleration invocation problem) shows an example of an instruction pair sitting at memory locations (0 × 88 and 0 × 8A). When PC is 0 × 88, the invoke_accelerator signal initiates a deviation from normal PC = PC_value + 2 operation and sets the PC to target branch address (e.g., 0 × 40) and PC_value to target branch address plus two (e.g., 0 × 40 + 0 × 2 = 0 × 42). The problem is that this deviation occurs one clock cycle late; therefore, Cortex-M0 places the value 0 × 88 on address bus, and the first instruction of pair is already fetched. To accelerate, both instructions at 0 × 88 and 0 × 8A must be removed from the pipeline. Here, instruction at 0 × 8A is successfully removed, but because 0 × 88 is already fetched, it goes through the pipeline and its removal fails (marked with red).
One solution is to assert the acceleration signal one clock cycle earlier by checking the PC_value instead of PC for value 0 × 88 and block fetching it into the pipeline. The problem with this approach is that it is not guaranteed to have the condition of PC_value equal to 0 × 88 at all circumstances. For example, a branch instruction to 0 × 88 bypasses the PC_value signal and directly writes into the PC register. In case of a branch (or POP instruction into PC), the PC_value signal does not get incremented by 2 but is set to target branch or POP value directly and results in acceleration invocation failure. Meanwhile, one clock cycle delay defeats the purpose of the proposed miniature acceleration that relies on removing one or two instructions to merely save one or two clock cycles; therefore, a workaround must be found.
The third set of waveforms (Acceleration Invocation Solution) in Figure 4 shows how the instruction pair at location (0 × 88, 0 × 8A) can be successfully removed from pipeline without any penalty using dual-port program memory. When PC register is set to value 0 × 88, the invoke_accelerator signal is asserted, which consequently places the target address branch (e.g., 0 × 40) on second port of block RAM. It also sets next PC to target address plus two (e.g., 0 × 42) and PC_value to target address plus four (0 × 44).
The fetched instruction at location 0 × 40 is on the second address bus (HRDATAB). A multiplexer controlled by invoke_accelerator signal then gets this fetched instruction and places it on hrdata_program_value instead of its standard drive which comes from the first port of block RAM. It can be seen how instruction pair (0 × 88, 0 × 8A) is removed from pipeline with zero clock cycle as penalty (see EXECUTE virtual signal that shows instruction execution sequence: e86, e40, e42, e44, …).
Successful removal of a type A pair saves 2 and a type B pair saves 5 clock cycle (3 clock cycles penalty to flush the pipeline is also eliminated). But instructions removal without taking their effect is illegal, unless they are executed in parallel with consecutive instructions.
3.4.3. Parallel Execution of Eliminated Instruction Pairs
The RC component in Figure 3 outputs two signals: (1) accelerator_reg_data and (2) accelerator_reg_target. Note that one of the constraints on mining the pairs is that they must only produce one 32-bit value result and update only one register. The accelerator_reg_data stores the result, and accelerator_reg_target stores the target register number. Two simultaneous writes to register bank is illegal. Therefore, if two operations are performed in parallel, then their result cannot be simultaneously written back to processor register banks. To solve this problem, a delayed-write approach is proposed. Assuming two data are generated and must be written to register 1 and 2. The processor first writes to register 1 and then advances to one clock cycle, and it then checks the WE signal of register bank to see if there is any operation that requires a write to register bank. If no, then the register bank is free for writing and the value of register 2 (pointed by accelerator_reg_target value) will be updated; otherwise, the processor holds back data until a free time slot is available for writing.
Meanwhile, during the delayed-write period, if there are no available time slots, then future reads from register 2 will be from accelerator_reg_data signal instead of processor register bank.
Figure 6 shows actual instruction sequence at memory locations 0 × 82 to 0 × 8A. The instruction pair [0 × 88, 0 × 8A] is removed from instruction stream. The result of removed pair is available in parallel with the next instruction after the pair (accelerator_reg_data signal).
[figure(s) omitted; refer to PDF]
Assertation of accelerator_reg_needs_update signal indicates that the accelerator result is available but still has not been written to register bank. The processor must wait for an available free cycle to write the result. When the accelerator_reg_invoke_WE_possible signal is high, the write to register bank is permitted and the result of MA (a 32-bit value) is written to register bank. Based on the pair type, the accelerator hardware follows different states. If it is a type A pair, then the core follows the following states:
1. STATE_DEACTIVE: No data are waiting to be written to register bank.
2. STATE_WRITE_DATA: Data are being written to register bank.
3. STATE_WRITE_DONE: Write to register bank is done.
If it is a type B pair (contains branch instruction), then the core follows the following states:
1. STATE_DEACTIVE: No data are waiting to be written onto register bank.
2. STATE_WRITE_DATA: Data are being written onto register bank.
3. STATE_WRITE_LR: Data are being written onto LR register.
4. STATE_WRITE_DONE: Writing data onto register bank are done.
When a branch instruction is reached, the return address must be stored in LR register which is another write operation to register bank. Therefore, beside instruction pair result, the LR also must be written to register bank. Similarly, if the processor cannot find a free slot to update LR register, it holds the updated value (in accelerator_reg_LR signal) and uses it over the old value until a free slot is found.
The final issue is solving the situation where two consecutive instruction pairs without any free clock cycles between them have occurred. Two mechanisms have been employed to deal with this scenario: (1) In pair selection LLVM pass, a simple check is placed to statistically verify the distance between pairs. There must be at least one instruction between pairs that does not write to register bank. (2) As dynamic (run-time) behavior of program cannot be predicted, the chance a pair branch directly to beginning of another pair does exist. Therefore, a simple extra state is added to the processor such that if invocation of two consecutive invoke_accelerator signal is detected, then a one clock cycle delay is inserted to let the previous pair result to be written to register bank.
3.5. Verification and Benchmarking
3.5.1. Verification
To verify the VHDL code used in Cortex-m0 core and the accelerator module, a C program is compiled to machine code and passed to IAR Embedded Workbench for the ARM simulator in ELF file format. The output of IAR Workbench is compared against the Vivado ISIM simulator. The details of the verification process are included in Supporting F.
3.5.2. FFT Benchmark
The compilation of fft() function yields 207 instructions. The LLVM compiler pass identifies 13 pairs of (MOV, BL) instructions when size of pair (p) is set to 2. Turning on the accelerator reduced the 207 instructions to 207 – (13 × 2) = 181 instructions which is 12.56% decrease in instruction count. Dynamic profiling of the function results in total execution of
3.5.3. Additional Kernels Benchmark
Six additional algorithms besides FFT are used in performance evaluation: MD5, SHA-1, FIR, IIR, Matrix Multiplication, and Quick Sort. Figure 7 shows that the improvement percentage achieved pair size is set to two instructions.
[figure(s) omitted; refer to PDF]
Figure 8 compares the proposed accelerator implemented on Xilinx ZCU104 development board against the mainstream STM32F098VC microcontroller that has an ARM Cortex-M0 with maximum clock frequency of 48 MHz. To prevent compiler performance variations, the same compiled code from the LLVM clang is executed on STM32F098VC by creating “externally built executable” project in IAR Embedded Workbench for ARM v8.40.1.21539.
[figure(s) omitted; refer to PDF]
3.5.4. Power and Resource Utilization
Table 1 shows FPGA resource utilization of the proposed Cortex-M0 on Xilinx ZCU104 FPGA development board versus the same core when the MA is on. Table 2 lists various power factors for both systems. The dynamic and I/O power have increased threefold due to added logic and dual-port memory.
Table 1
System resource utilization of Cortex-M0 versus Cortex-M0 with miniature accelerator on.
| FPGA resource | Cortex-M0 | Cortex-M0 with MA | Percentage increase after adding MA (%) |
| CLB LUTs | 3385 | 3612 | 6.71 |
| CLB | 804 | 848 | 6.47 |
| Carry 8 | 82 | 94 | 14.63 |
| F7 muxes | 43 | 43 | 0 |
| F8 muxes | 0 | 1 | 100 |
| CLB | 655 | 860 | 31.23 |
| LUT as logic | 3385 | 3556 | 5.05 |
Table 2
Power consumption of Cortex-M0 versus Cortex-M0 with miniature accelerator on.
| Powera | Cortex-M0 (W) | Cortex-M0 with MA (W) | Percentage increase after adding MA (%) |
| Total on-chip | 0.639 | 0.762 | 19.25 |
| Dynamic | 0.047 | 0.169 | 259.57 |
| Static | 0.592 | 0.593 | 0.17 |
| Clocks | 0.006 | 0.008 | 33.33 |
| Signals | 0.009 | 0.038 | 322.22 |
| Logic | 0.008 | 0.025 | 212.5 |
| DSP | 0.001 | 0.002 | 100 |
| I/O | 0.023 | 0.095 | 313.04 |
aJunction temperature is 25.7°C.
3.6. MA Performance Comparison With the State-of-the-Art
The EEMBC CoreMark [47] benchmark is used with compiler parameters set to the LLVM equivalent of the following IAR compiler flags: “--endian = little --cpu = Cortex-M0 --fpu = None ---Ohs --use_c++_inline --no_size_constraints”, and the result is shown in Table 3. The benchmark suite covers over 160 kernels used in their top 19 benchmarks [48].
Table 3
Standard ARM Cortex-M0 versus Cortex-M0 equipped with miniature accelerator (MA) CoreMark.
| Processor | Cortex-M0 (48 MHz) | Cortex-M0 with MA (48 MHz) | Percentage improvement (%) |
| CoreMark | 112.094 | 125.579 | 12.03 |
| CoreMark (MHz) | 4.6706 | 5.233 | 12.03 |
Table 4 depicts the quantitative comparison of acceleration-based architectures versus the standard ones when it comes to specific algorithms such as FFT, FIR, MM, KM, and SE. Table 5 provides the qualitative comparison of recently published architectures. The backward compatibility column shows if an architecture is still backward compatible with its standard core after acceleration hardware is added. Standard cores such as Cortex-M0 or ORCA RISC-V are always backward compatible as they do not modify their standard architecture while cores coupled with custom acceleration hardware might lose the compatibility factor. The HA design column in Table 5 indicates whether the HA is designed manually, or via semi-/fully automated processes.
Table 4
Acceleration-based architectures vs. standard cores—quantitative comparison.
| Processor | FFT1 (s) | FIR2 (s) | MM3 (s) | KM4 (s) | SE5 (s) | LUT6 |
| Cortex-M0 (48 MHz) | 0.336 | 0.277 | 0.216 | N/A | N/A | 3448 |
| Cortex-M0 with MA (48 MHz) | 0.296 | 0.225 | 0.181 | N/A | N/A | 4031 |
| Ho-Cheung et al.’s soft processor (148 MHz) [23] | N/A | 2.37 | 13.29 | 0.22 | 2.62 | 1279 |
| ORCA RISC-V core (200 MHz) [23] | N/A | 2.53 | 14.39 | 0.23 | 2.84 | 1438 |
| Shaoyi PhD thesis [28] | N/A | N/A | 0.33–0.65 | N/A | 0.28–0.82 | 9441–43,500 |
1Fast Fourier transform.
2Finite impulse response.
3Matrix multiplication.
4K-mean clustering algorithm.
5Sobel edge detector.
6Lookup table count.
Table 5
Acceleration-based architectures vs. standard cores—qualitative comparison.
| Processor | Backward compatibility | HA design |
| Cortex-M0 | Yes | No HA |
| Cortex-M0 with MA | Yes | Automated |
| Ho-Cheung et al.’s soft processor [23] | No | Manual |
| ORCA RISC-V core [23] | Yes | No HA |
| Microsoft catapult [24] | No | Manual |
| Nios II [27] | Yes | No HA |
| Nios II with vector processor [27] | Yes | Semiautomated |
| Shaoyi PhD thesis [28] | No | Semiautomated |
4. Conclusions
In this paper, a reconfigurable MA architecture based on Cortex-M0 processor is proposed. The design uses the LLVM backend to analyze machine code and then detect the most frequent pair and replace it with digital circuit VHDL module that operates in parallel with next instructions. Core performance improvement from 4.09% to 14.17% is achieved. This work postpones power consumption and maximum achievable clock frequency optimization as it solely tries to demonstrate the feasibility of the proposed architecture and not building the most optimized version of it. The proposed architecture retains software backward compatibility and demonstrates processor performance improvement by replacing a series of instructions with reconfigurable logic. The performance improvement trade-offs with threefold die area. The future work could be the optimization of the proposed system by shrinking the logic circuit area.
Funding
This research was supported financially by Assumption University of Thailand.
[1] I. Hodkinson, Computability, Algorithms, and Complexity-Course 240, 1991. https://www.doc.ic.ac.uk/%7Eimh/teaching/Turing_machines/240.pdf
[2] C. Bobda, Introduction to Reconfigurable Computing Architectures, Algorithms, and Applications, 2007.
[3] "Introducing the World's First 2 Nm Node Chip," . https://research.ibm.com/blog/2-nm-chip
[4] G. Estrin, "Reconfigurable Computer Origins: The UCLA Fixed-Plus-Variable (F+V) Structure Computer," IEEE Annals of the History of Computing, vol. 24 no. 4,DOI: 10.1109/MAHC.2002.1114865, 2002.
[5] G. Estrin, "Organization of Computer Systems: The Fixed Plus Variable Structure Computer," Western Joint IRE-AIEE-ACM Computer Conference, pp. 33-40, 1960.
[6] P. Bertin, D. Roncin, J. Vuillemin, Introduction to Programmable Active Memories, 1990.
[7] J. R. Hauser, J. Wawrzynek, "Garp: A MIPS Processor With a Reconfigurable Coprocessor," Proceedings. The 5th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Cat, pp. 12-21, DOI: 10.1109/FPGA.1997.624600, 1997.
[8] J. S. McCaskill, T. Maeke, U. Gemm, L. Schulte, U. Tangen, "NGEN: A Massively Parallel Reconfigurable Computer for Biological Simulation: Towards a Self-Organizing Computer," ICES 1996: Evolvable Systems: From Biology to Hardware, vol. 1259, pp. 260-276, 1996.
[9] U. Tangen, J. S. McCaskill, "Hardware Evolution With a Massively Parallel Dynamically Reconfigurable Computer: POLYP," ICES 1998: Evolvable Systems: From Biology to Hardware, vol. 1259, pp. 364-371, 1998.
[10] U. Tangen, Th. Maeke, J. S. McCaskill, "Advanced Simulation in the Configurable Massively Parallel Hardware MereGen," Coupling of Biological and Electronic Systems: Proceedings of the 2nd Caesarium, vol. 1259, pp. 107-118, 2000.
[11] Rauscher and Agrawala, "Dynamic Problem-Oriented Redefinition of Computer Architecture via Microprogramming," IEEE Transactions on Computers, vol. 27 no. 11, pp. 1006-1014, DOI: 10.1109/TC.1978.1674990, 1978.
[12] P. M. Athanas, H. F. Silverman, "Processor Reconfiguration Through Instruction-Set Metamorphosis," Computer, vol. 26 no. 3, pp. 11-18, DOI: 10.1109/2.204677, 1993.
[13] M. Huebner, D. Goehringer, C. Tradowsky, J. Henkel, J. Becker, "Adaptive Processor Architecture—Invited Paper," 2012 International Conference on Embedded Computer Systems (SAMOS), pp. 244-251, DOI: 10.1109/SAMOS.2012.6404181, 2012.
[14] A. Lodi, M. Toma, F. Campi, A. Cappelli, R. Canegallo, R. Guerrieri, "A VLIW Processor With Reconfigurable Instruction Set for Embedded Applications," IEEE Journal of Solid-State Circuits, vol. 38 no. 11, pp. 1876-1886, 2003.
[15] X. Ling, H. Amano, "WASMII: A Data Driven Computer on a Virtual Hardware," Proceedings IEEE Workshop on FPGAs for Custom Computing Machines, pp. 33-42, DOI: 10.1109/FPGA.1993.279481, 1993.
[16] M. J. Wirthlin, B. L. Hutchings, "A Dynamic Instruction Set Computer," Proceedings IEEE Symposium on FPGAs for Custom Computing Machines, pp. 99-107, DOI: 10.1109/FPGA.1995.477415, 1995.
[17] R. Hartenstein, "Morphware and Configware," Handbook of Nature-Inspired and Innovative Computing, pp. 343-386, DOI: 10.1007/0-387-27705-6_11, 2006.
[18] S. Takano, "Adaptive Processor: A Dynamically Reconfiguration Technology for Stream Processing," FPL 2003: Field Programmable Logic and Application, pp. 952-955, DOI: 10.1007/978-3-540-45234-8_93, 2003.
[19] Z. A. Ye, A. Moshovos, S. Hauck, P. Banerjee, "CHIMAERA: A High-Performance Architecture With a Tightly-Coupled Reconfigurable Functional Unit," Proceedings of 27th International Symposium on Computer Architecture (IEEE Cat. No.RS00201), pp. 225-235, DOI: 10.1109/ISCA.2000.854393, 2000.
[20] S. Vassiliadis, S. Wong, G. Gaydadjiev, K. Bertels, G. Kuzmanov, E. M. Panainte, "The MOLEN Polymorphic Processor," IEEE Transactions on Computers, vol. 53 no. 11, pp. 1363-1375, DOI: 10.1109/TC.2004.104, 2004.
[21] J. Hoozemans, J. van Straten, S. Wong, "Using a Polymorphic VLIW Processor to Improve Schedulability and Performance for Mixed-Criticality Systems," 2017 IEEE 23rd International Conference on Embedded and Real-Time Computing Systems and Applications (RTCSA),DOI: 10.1109/RTCSA.2017.8046315, 2017.
[22] S. Srinivasan, P. Janedula, S. Dhoble, "High Performance Scalable FPGA Accelerator for Deep Neural Networks," arXiv, 2019.
[23] H. Ng, C. Liu, H. K. So, "A Soft Processor Overlay With Tightly-Coupled FPGA Accelerator," arXiv, 2016.
[24] A. Putnam, "A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services," IEEE Micro, vol. 35 no. 3, pp. 10-22, 2015.
[25] A. M. Caulfield, "A Cloud-Scale Acceleration Architecture," 2016 49th Annual IEEE/ACM International Symposium on Microarchitecture (MICRO),DOI: 10.1109/MICRO.2016.7783710, 2016.
[26] E. Manor, A. Ben-David, S. Greenberg, "CORDIC Hardware Acceleration Using DMA-Based ISA Extension," Journal of Low Power Electron, Appleseeds, vol. 12,DOI: 10.3390/jlpea12010004, 2022.
[27] J. Yu, G. Lemieux, C. Eagleston, "Vector Processing as a Soft-Core CPU Accelerator," Proceedings of the 16th International ACM/SIGDA Symposium on Field Programmable Gate Arrays (FPGA '08), pp. 222-232, DOI: 10.1145/1344671.1344704, 2008.
[28] S. Cheng, Accelerator Synthesis and Integration for CPU+FPGA Systems, 2016. PhD Thesis
[29] E. Manor, S. Greenberg, "Using HW/SW Codesign for Deep Neural Network Hardware Accelerator Targeting Low-Resources Embedded Processors," IEEE Access, vol. 10, pp. 22274-22287, DOI: 10.1109/ACCESS.2022.3153119, 2022.
[30] E. Manor, S. Greenberg, "Custom Hardware Inference Accelerator for TensorFlow Lite for Microcontrollers," IEEE Access, vol. 10, pp. 73484-73493, DOI: 10.1109/ACCESS.2022.3189776, 2022.
[31] Arm, ARMv6-M Architecture Reference Manual-ARM DDI 0419E (ID070218), 2007. https://developer.arm.com/documentation/ddi0419/latest/
[32] Arm, "Cortex-M0 Devices Generic User Guide-ARM DUI 0497A (ID112109)," 2009. https://developer.arm.com/documentation/dui0497/latest/
[33] "The LLVM Compiler Infrastructure Project," . https://llvm.org/
[34] A. Ramachandran, "Profiling Code With LLVM-The Woes of Non-Determinism," . https://medium.com/delta-force/profiling-code-with-llvm-f32c5292750a
[35] "Clang Compiler User’s Manual," . https://clang.llvm.org/docs/UsersManual.html#profile-guided-optimization
[36] Arm, "ARM Cortex-M Programming Guide to Memory Barrier Instructions Application Note 321," 2012. https://developer.arm.com/documentation/dai0321/latest/
[37] E. Ali, W. Pora, "VHDL Implementation of ARM Cortex-M0 Laboratory for Graduate Engineering Students," 2020 5th International STEM Education Conference (iSTEM-Ed), pp. 69-72, DOI: 10.1109/iSTEM-Ed50324.2020.9332721, 2020.
[38] Arm, Cortex-M0 Revision: R0p0 Technical Reference Manual-ARM DDI 0432C (ID113009), 2009. https://developer.arm.com/documentation/ddi0432/latest/
[39] Aem, "Cortex-M0+ Technical Reference Manual R0p1-Instruction Set Summary," . https://developer.arm.com/documentation/ddi0484/c/CHDCICDF
[40] A. Brown, G. Wilson, Llvm Chris Lattner, "The Architecture of Open Source Applications: Elegance, Evolution, and a Few Fearless Hacks," Creative Commons, vol. 11, pp. 151-157, 2011.
[41] E. Ali, W. Pora, "A Guideline for Rapid Development of Assembler to Target Tailor-Made Microprocessors," 15th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology (ECTI-CON), pp. 596-599, DOI: 10.1109/ECTICon.2018.8619960, 2018.
[42] M. Pandey, S. Sarda, "Writing an LLVM Backend," LLVM Cookbook, pp. 207-208, 2014.
[43] B. C. Lopes, R. Auler, The Backend, Getting Started with LLVM Core Libraries, 2014.
[44] "IAR Embedded Workbench for ARM," . https://www.iar.com/products/architectures/arm/iar-embedded-workbench-for-arm/
[45] E. Ali, W. Pora, "A Deterministic Branch Prediction Technique for a Real-Time Embedded Processor Based on PicoBlaze Architecture," Electronics, vol. 11, 2022.
[46] J. L. Hennessy, D. A. Patterson, Computer Architecture: A Quantitative Approach, 2019.
[47] "CoreMark® an EEMBC Benchmark," . https://www.eembc.org/coremark/
[48] Kernels, "Workloads and Benchmarks," , . https://www.eembc.org/products/kernels.php
Copyright © 2025 Ehsan Ali. Journal of Electrical and Computer Engineering published by John Wiley & Sons Ltd. This is an open access article under the terms of the Creative Commons Attribution License (the “License”), which permits use, distribution and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/