Content area
Computer vision algorithms, specifically convolutional neural networks (CNNs) and feature extraction algorithms, have become increasingly pervasive in many vision tasks. As algorithm complexity grows, it raises computational and memory requirements, which poses a challenge to embedded vision systems with limited resources. Heterogeneous architectures have recently gained momentum as a new path forward for energy efficiency and faster computation, as they allow for the effective utilisation of various processing units, such as Central Processing Unit (CPU), Graphics Processing Unit (GPU), and Field Programmable Gate Array (FPGA), which are tightly integrated into a single platform to enhance system performance. However, partitioning algorithms over each accelerator requires careful consideration of hardware limitations and scheduling. We propose two low-high power heterogeneous systems and a method of partitioning CNNs and a feature extraction algorithm (SIFT) onto the hardware. We benchmark feature detection and image classification algorithms on heterogeneous systems and their discrete accelerator counterparts. We demonstrate that both systems outperform FPGA/GPU-only accelerators. Experimental results show that for the SIFT algorithm, there is 18% runtime improvement over the GPU. In the case of MobilenetV2 and ResNet18 networks, the high power system achieves 17.75%/5.55% runtime and 6.25%/2.08% energy improvements respectively, against their discrete counterparts. The low-power system achieves 6.32%/16.21% runtime and 7.32%/3.27% energy savings. The results show that effective partitioning and scheduling of imaging algorithms on heterogeneous systems is a step towards better efficiency over traditional FPGA/GPU-only accelerators.
Introduction
Integration of edge-based deep learning (DL) processing within vision systems has become prevalent throughout many resource constrained application areas. Traditionally, computer vision algorithms were implemented on homogeneous architectures, typically CPUs or GPUs. These architectures offer a uniform processing landscape where all cores/processing units share the same instruction set and capabilities. While effective, the inherent limitations of homogeneous designs become bottlenecks for complex tasks. These tasks often have diverse compute requirements, with some operations well-suited to one architecture over the other.
To address the hardware limitation, heterogeneous architectures are emerging as a new paradigm of computation. These systems combine different types of processors (CPUs, GPUs, FPGAs and specialised AI accelerators), allowing algorithms to offload certain computational stages to the most suitable processing element. GPUs are highly efficient in handling pixel processing streams due to their architecture’s high throughput and parallel processing capabilities. With thousands of cores working simultaneously, GPUs can rapidly execute pixel-level computations across a variety of tasks. FPGAs feature a reconfigurable architecture composed of configurable logic blocks interconnected by programmable routing resources. This design allows for tailored designs optimised for specific imaging algorithms. However, current developments in targeting heterogeneous environments are still primitive and require careful consideration of algorithmic partitioning [1], communication latency [2], and scheduling [3, 4].
Convolutional Neural Networks (CNN) and feature extraction algorithms are widely used in various problem domains, such as object detection [5], image classification [6], and segmentation [7]. Typically, image processing algorithms are designed and implemented on GPUs, which provide thousands of compute cores coupled with high-bandwidth memory. GPUs enable the efficient execution of single instruction, multiple data operations, making them ideal for processing vast amounts of data in parallel. However, executing algorithms on GPUs comes at the cost of power, size and latency [8].
As CNNs continue to grow in model size, which in turn requires significant memory resources to store their weights [9, 10], implementing them on low-resource and energy-constrained platforms is limited. Despite this, leveraging the advantages of true heterogeneous computing [11] allows run-time and power efficient designs to be realised by exploiting architectures with sufficient resources and processing capacity.
In this paper, we develop heterogeneous hardware and adapt imaging algorithms on each platform. We start with an extensive analysis of popular feature extraction algorithms such as SIFT [12] and two CNN architectures, ResNet18 [13] and MobilenetV2 [14], and subsequently evaluate their performance and suitability on the proposed heterogeneous hardware platform. The feasibility of implementing these algorithms and networks onto heterogeneous systems is investigated by identifying the optimal stage in each network/algorithm to be mapped onto a specific accelerator. A comprehensive benchmarking analysis of the CNNs and SIFT is conducted by performing image classification and feature extraction on various platforms to partition the layers or stages that exhibit the highest energy consumption, inference, and total runtime. Two new heterogeneous platforms are constructed, one comprising high-performance accelerators and the other an embedded system with power-optimised processors. The algorithms and networks are implemented and evaluated on both platforms using a fine-grained partitioning strategy. Heterogeneous results are compared to the homogeneous accelerator counterparts to determine the best-performing architecture.
The main contributions of this paper are as follows:
Development of a heterogeneous scheduling algorithm that maps CNN layer to the most suitable accelerator.
The development of heterogeneous platforms consisting of two configurations, high-performance and a power-optimised embedded system.
An efficient deployment strategy is proposed for CNN and SIFT algorithms, enabling improved computational runtime while minimising energy consumption.
Partitioning methods on heterogeneous architectures are introduced by studying the features of CNNs and stages of SIFT to identify characteristics used to determine a suitable accelerator.
Benchmarking and evaluating runtime, energy, and inference metrics of popular convolution neural networks and SIFT on various processing architectures and heterogeneous systems.
Related work
Heterogeneous computing. Heterogeneous architectures have drawn increasing attention in recent years [15, 16, 17, 18, 19, 20, 21, 22, 23, 24–25], as it has become an alternative path to overcome the performance wall of homogeneous processors. This growing demand has been driven by new commercial designs that increasingly integrate multiple accelerators onto a single chip [26]. Previous studies [27, 28, 29, 30–31] support that not all algorithms are suitable for one accelerator by comparing the energy efficiency of architectures for image processing tasks, showing that, on one hand, GPUs consume less energy/frame in comparison to CPU and FPGA. However, for more complex kernels and complete vision pipelines, FPGAs outperform the other accelerators. In the case of CNNs, FPGAs efficiently compute inference in contrast to GPUs, concerning energy and time [32]. Investigation by the work in [20] studied the challenges and trade-offs in heterogeneous systems (FPGA-GPU). Significant improvements were observed from executing matrix–vector multiplications, which are typically building block operations found within the imaging domain.
Embedded vision. Deep-learning algorithms deployed on embedded hardware have been driving the adoption of real-time processing on edge [33, 34, 35–36]. Embedded computing systems are designed to process data in real-time directly on the device, improving efficiency and user experience [37]. As a result of keeping compute local, it enhances security and privacy since data is not transmitted to remote servers. However, despite these advantages, running deep neural networks (DNNs) on embedded systems presents several challenges. One such challenge is the growing complexity of modern DNNs, which now have hundreds of layers to achieve high accuracy. This leads to significant computational demands (e.g., ResNet50 requires over 4 billion floating-point operations (FLOPs) and contains 25 million parameters [38]). Various techniques have been explored such as model compression [39, 40], that make deep learning models more efficient without significantly reducing accuracy. Strategies like pruning [41, 42], quantization [43], and knowledge distillation [44] help reduce the size and processing requirements of neural networks, making it possible to run powerful models on resource-constrained embedded devices.
Research into implementing CNNs on heterogeneous embedded processors has focused on layer-based partitioning strategies to enhance performance and energy efficiency. Work by [45] proposed an approach to partition the fully connected layer onto the FPGA and the convolutional layers on a GPU. Both devices communicate through a Universal Asynchronous Receiver/Transmitter (UART) serial connection. Although the network was relatively small with only a few layers, the majority of the computation was offloaded to the GPU, which still resulted in performance improvements. The work [46] builds upon the previous study by exploring the Direct Hardware Mapping (DHM) approach for three CNNs. The modules on each network were evaluated, and it was shown that CNNs partitioned at layer-level benefit from FPGA-GPU platforms. In all heterogeneous implementation works, CNN algorithms were implemented partially (e.g., Conv. Layer only) or did not efficiently pass data to other accelerators, limiting true heterogeneity.
Scheduling. Scheduling algorithms determine how work is ordered and distributed across available resources, mapping operations to different accelerators in a heterogeneous environment, which introduces additional complexities such as device-specific constraints and data transfers. In the works [47, 48, 49, 50, 51, 52–53], derived insights into scheduling to exploit collaborative execution between accelerators, which resulted in a reduction of execution time. However, effective scheduling requires careful data or dynamic partitioning to minimise idle time. Task partitioning allows higher kernel duplication on FPGAs, but benefits vary by algorithm and can be limited by memory bandwidth. Deploying First-Come First-Serve (FCFS) scheduling in edge environments offers minimal computational overhead and efficient task management, making it ideal for resource-constrained devices. In addition, processing tasks by arrival order, FCFS ensures predictable execution with low complexity, benefiting real-time applications. However, its lack of prioritisation can cause job starvation under uneven workloads or urgent task requirements [54, 55].
Data latency. Previous works have explored two types of heterogeneous architectures: those featuring processors integrated into a single chip die with a shared memory subsystem and those composed of multiple discrete processing units communicating over high-speed interconnects. In the former, tightly coupled architectures such as system-on-chip (SoC) designs leverage shared memory and cache-coherent interconnects to enable efficient low-latency communication between processing elements, often employing protocols like ARM’s Advanced Microcontroller Bus Architecture (AMBA) [56]. The AMBA protocol, particularly AXI (Advanced eXtensible Interface), provides high-throughput, low-latency memory-mapped communication between CPU cores, GPUs, DSPs, and dedicated accelerators within an SoC.
In contrast, heterogeneous systems composed of discrete processing units rely on high-speed interconnects such as PCIe [57] for data transfer. These architectures often incorporate memory coherence protocols and data-sharing frameworks. However, latency poses a challenge since the system has to use host memory as an intermediary to transfer data from one accelerator to another, resulting in bandwidth and latency inefficiency. Early work in FPGA-GPU communication by [58, 59] leverages existing direct memory access (DMA) engines on GPUs to execute DMA operations to FPGAs, resulting in better throughput depending on the transfer size. To address existing gaps, we propose to develop heterogeneous computer vision systems leveraging the power of all three architectures: CPU, GPU and FPGA.
Despite advancements in heterogeneous computing for vision tasks, existing works exhibit several limitations. Notably, prior studies often use simple partitioning methods, which do not intelligently allocate CNN subgraphs to the most suitable hardware based on their computational characteristics. In addition, many works do not fully leverage all available hardware components, with some accelerators underutilised or omitted entirely from the computation pipeline. Therefore, simple scheduling approaches fail to exploit the potential performance and efficiency gains achievable by carefully considering subgraph-specific metrics like computational complexity, data locality, and hardware affinity.
Design and development
Hardware design consideration
In realising the full potential of heterogeneous architectures, two systems are developed, targeting two power domains: high-power and low-power, as shown in Fig. 1.
[See PDF for image]
Fig. 1
a Heterogeneous high-power system, b low-power embedded system
Low-power system: The constructed system consists of a custom carrier board which is equipped with several key components, including an Artix-7 (XC7 A200 T) FPGA, a Jetson Xavier NX, and an ARM CPU. To provide additional storage space, the Linux image is flashed onto the SD card rather than the 16 GB eMMC. Communication between the FPGA and the Xavier NX is achieved through a PCIe gen2 4-lane interface, which is connected via an M.2 key-M connector.
High-power system: The system consists of CPU (AMD 5900x), GPU (GTX 3070) and FPGA (Xilinx ZCU106), integrated into a desktop with 32 GB 3200 MHz DDR4 Memory. Both devices are interfaced via PCIe Gen3 and the communication to the host CPU uses direct memory access (DMA), allowing the movement of data between host memory and subsystems. The GPU and FPGA drivers are used to program the DMA engine and DMA/bridge subsystem IP. Idle CPU/GPU is frequency-scaled down to reduce power consumption.
Heterogeneous scheduler
Benchmarking eight popular scheduling strategies on a simulated CNN (Table 1) on a single accelerator identified the most efficient approach. Each layer was assigned to a target architecture with CPU execution times uniformly sampled between 10–30 ms, and corresponding GPU times scaled to 30–60% of the CPU cost, reflecting typical performance characteristics. Results indicated that advanced strategies such as Min-Min, Max-Min, and Earliest-Finish-Time (EFT) yielded competitive performance. However, these approaches incurred scheduling overhead and depended on global task knowledge or speculative mapping. By contrast, First-Come, First-Served (FCFS) achieved the lowest runtime (0.0539 s) and matched the best energy efficiency (4.31 J), all without requiring detailed layer profiling, affinity estimation, or task reordering. Therefore, FCFS is selected as the base scheduler to be adapted to handle multiple accelerators.
Table 1. Runtime and energy of scheduling algorithms on a 10-layer CNN
Scheduler | Runtime (s) | Energy (J) | Description |
|---|---|---|---|
FCFS | 0.0539 | 4.31 | First-come, first-served |
Static pipeline | 0.1093 | 8.74 | Fixed alternation |
Dataflow | 0.0752 | 6.02 | Per-layer greedy |
Round-Robin | 0.1093 | 8.74 | Cyclic device assignment |
Min–Min | 0.0549 | 4.39 | Shortest-task first |
Max–Min | 0.0570 | 4.56 | Longest-task first |
EFT | 0.0560 | 4.48 | Earliest finish first |
Consequently, the selected scheduling algorithm was improved to handle a heterogeneous environment, shown in Fig. 2. The scheduler distributes and manages workloads across processing units by applying a partitioning strategy that breaks the model into subgraphs. These subgraphs are staged in an FCFS queue, ensuring tasks execute in linear order while adhering to data dependencies required by image processing algorithms.
[See PDF for image]
Fig. 2
The scheduler partitions the neural network into subgraphs, enqueuing them in an FCFS task queue. The CPU manages execution, weight loading, and memory transfers, dispatching tasks to the GPU or FPGA. Intermediate feature maps are stored for reuse, and final results are merged and post-processed by the CPU
Subgraph partition and device assignment
Prior to runtime, the network is parsed and segmented into subgraphs. Each subgraph may comprise one or more layers, typically grouped according to hardware specialisation and defined algorithmic heuristics. For example:
GPU subgraphs: Contain layers that benefit from the GPU’s high parallelism (e.g., large convolutions, activations, and batch-normalisation).
FPGA subgraphs: Incorporate layers (or series of layers) that exploit the FPGA’s dataflow architecture (e.g., small kernel convolutions or fused operations amenable to custom logic and streaming I/O).
Device allocation: Indicates whether a subgraph runs on the GPU or FPGA.
Input/output tensors: Identifies the feature-map tensors consumed and produced at subgraph boundaries, essential for cross-device data transfers.
Boundaries and dependencies: Specifies how outputs from one subgraph become inputs to the next, preserving correct ordering in multi-stage pipelines.
In future work, the collected profiling data provides a rich feature set for training a machine-learning model that can automatically predict optimal partition boundaries. For example, unsupervised clustering (such as k-means) or reinforcement-learning algorithms can learn the mapping from layer metrics to hardware assignments, replacing manual partitioning while decreasing the overall computational pipeline runtime (Fig. 3).
[See PDF for image]
Fig. 3
Subgraph computation unit implemented on the FPGA. Input feature-maps and weights are streamed through a systolic Multiply-Accumulate (MAC) array backed by dual-level on-chip buffers
FPGA micro-architecture: All CNN subgraphs mapped to the FPGA are executed on a Xilinx
Grid of identical convolution engines (“compute units”) built from DSP-based MAC arrays;
Two-level on-chip buffer hierarchy comprising per-CU input/weight caches and a shared SRAM; and
Micro-coded controller that streams instructions from an on-chip instruction RAM.
Subgraph FPGA execution During synthesis, the tool-chain automatically tiles each layer based on the following criteria:
The input feature map tile and its corresponding weights must fit within the available on-chip buffer memory.
The tiling must maximise the utilisation of the DPU’s MAC array to improve compute efficiency.
Scheduling algorithm (FCFS)
All subgraphs are enqueued for execution in a FCFS manner. This means the first subgraph to arrive (i.e. the earliest in the model’s layer order) is executed before subsequent subgraphs:
Fetch and decode: The scheduler reads the next subgraph from the queue, checking the device assignment and any required data-transfer information.
Execution initiation: The scheduler signals the appropriate driver (GPU or FPGA) to initialise memory allocations and transfer input tensors. This includes
cudaMalloc andcudaMemcpy for the GPU, or DMA buffer setup for the FPGA.Task completion: Upon subgraph completion, the device textbf updates a status register which the scheduler uses to confirm the output is ready.
Output handling: The output tensors are either relayed directly to the next subgraph’s device or temporarily stored in CPU memory if cross-device streaming is unavailable.
GPU subgraph execution
When a subgraph is assigned to the GPU:
Memory allocation: Host code calls CUDA routines, such as
cudaMalloc , to reserve device memory for the subgraph’s weights and intermediate tensors.Data transfer: If the subgraph inputs originate from the CPU or were produced by the FPGA, data is copied into the GPU memory using
cudaMemcpy . In some architectures, direct transfers (e.g., GPUDirect) may bypass the CPU.Kernel invocation: Each layer in the subgraph is executed via CUDA kernels. Synchronisation points ensure the subgraph completes before the scheduler enqueues the next task.
Result readout: The outputs remain in GPU memory for subsequent GPU subgraphs, or are sent back to the FPGA or CPU, depending on the next subgraph’s device assignment.
FPGA subgraph execution
For FPGA-assigned subgraphs:
Driver and DMA setup: The scheduler interacts with the FPGA driver to allocate buffers and set up a DMA descriptor. The descriptor indicates the source address of the incoming tensors (CPU or GPU memory address) and the destination within FPGA-accessible memory.
Computation: The FPGA executes the custom hardware logic (e.g., large convolution blocks or dataflow pipelines). Partial results are streamed locally or stored in on-chip memory.
Output transfer: Once completed, final feature maps are either returned to CPU space or transferred directly to the GPU if the next subgraph resides there. This transfer may employ GPUDirect RDMA to reduce host overheads and latency.
Synchronisation and finalisation
The scheduler maintains a global status register for each subgraph, waiting, executing, or completed states. Each transition is triggered by hardware events or drivers. After the final subgraph completes, the outputs (e.g., classification logits) are copied to CPU memory for post-processing. This sequential approach avoids data hazards, preserves ordering, and effectively balances GPU and FPGA resources.
[See PDF for image]
Algorithm 1
Heterogeneous scheduler
Algorithm description. Algorithm 1 demonstrates the heterogeneous scheduler for a subgraph-partitioned model, in which the entire network is split into multiple subgraphs assigned either to the GPU or to the FPGA. Each task corresponds to one subgraph and is enqueued in a First-Come, First-Served (FCFS) manner.
Subgraph partitioning and metadata. Prior to runtime, the model graph is analysed to identify suitable partition boundaries (e.g., after major convolutions or pooling layers). Each partition or subgraph is annotated with its target device (
FPGA orGPU ), as well as input/output tensor names that define the subgraph boundary. These annotated subgraphs form the list oftasks .FCFS queue. The subgraphs (
tasks ) are inserted into an FCFS queue, which preserves linear ordering. This is vital in imaging algorithms and neural networks, where operations must proceed in sequence to maintain correct data dependencies (e.g., one subgraph’s output is the next subgraph’s input).ExecuteOnFPGA. Subgraphs assigned to the FPGA require the CPU host to initialise the FPGA driver, allocate DMA buffers, and transfer the relevant input feature maps or parameters to the FPGA’s memory. The FPGA processes the subgraph (e.g., small convolution kernels) and signals completion via its driver interface. The output is either stored locally or transferred back for downstream subgraphs.
ExecuteOnGPU. If the subgraph is mapped to the GPU, the CPU code allocates GPU memory and uploads the input tensors via
cudaMemcpy (unless GPU Direct RDMA is employed from an FPGA output). The subgraph’s layers then run on the GPU in sequence, often using CUDA kernels or libraries such as cuDNN. Synchronisation ensures that the output is only retrieved when the GPU has finished execution.Data transfers and final output. After each subgraph completes, its output either remains on the same device (if subsequent subgraphs also target that device) or is transferred to the next device. Once the entire queue of subgraphs is exhausted, the final output is transferred to the CPU for post-processing. Tasks that finish on the GPU or FPGA are consolidated via
TransferData calls, which handle the correct source and destination addresses.Post-processing and display. Finally, the scheduler calls
PostProcess to apply any additional logic on the final output tensor (e.g., a softmax or thresholding step), then displays the results as needed.
Algorithm deployment
The following section describes the architecture of two widely used CNNs, ResNet18 and MobileNetV2 and a feature extraction algorithm SIFT. Each algorithm is profiled layer by layer and a partitioning strategy is developed to execute on the heterogeneous platform.
Scale-invariant feature transform
The Scale-Invariant Feature Transform (SIFT) algorithm is a method in computer vision used to detect and describe local features in images. The SIFT algorithm is designed to be robust to changes in scale, rotation, and partial occlusion. It works in several stages: First, it identifies key points in the image through scale-space extrema detection. These keypoints are then localised more accurately and assigned an orientation. Finally, a descriptor is computed for each keypoint, capturing its local image gradient patterns. These descriptors are used for matching keypoints across different images.
Profiling of the SIFT algorithm, several key stages demonstrate varying computational complexities and hardware implications:
Gaussian pyramid construction
The Gaussian stage in the SIFT algorithm imposes significant computational and memory demands on hardware. Intensive convolution operations and the generation of multiple intermediate images at different scales lead to substantial memory consumption, especially with high-resolution images. For an image, the convolution complexity grows quadratically with , and this process is repeated across scales, amplifying the computational load. High memory bandwidth is also required for accessing large filter kernels and image matrices. Architectures with enhanced parallelism and memory caches are more suitable for this stage due to the considerable memory footprint.
Extrema detection
This stage focuses on identifying local extrema within Difference of Gaussian images. A pixel is considered an extremum if:Determining these extrema involves a straightforward pixel-by-pixel comparison, an operation per pixel. While this is performed for every pixel, it requires relatively low computational resources.
Orientation and magnitude assignment
To achieve rotation invariance, the SIFT algorithm computes the gradient magnitude and orientation around each keypoint. The magnitude quantifies the strength of intensity changes, calculated using horizontal and vertical intensity differences squared, summed, and square-rooted. The gradient orientation is derived from the arctangent of the ratio of vertical to horizontal intensity differences, indicating the direction of the most pronounced intensity change.
These values populate a weighted orientation histogram for the keypoint, where the peak signifies the dominant orientation assigned to the keypoint. While square root and arctangent operations are less intensive than convolutions, they still pose computational challenges when applied to a large number of pixels. Hardware support for fixed-point or floating-point arithmetic can significantly accelerate these calculations.
Descriptor generation
This stage involves weighted histogram binning, expressed as:where is the histogram, is the orientation, is the weight, and is the magnitude. With keypoints, the complexity is . This stage is well-suited for processors that efficiently handle integer or fixed-point arithmetic.
[See PDF for image]
Fig. 4
Operation stage run-time profiling: SIFT
SIFT profiling and partitioning strategy
The selected profiling times for each stage and on each hardware are observed in Fig. 4. In addition, the partitioning strategy shown in Fig. 5, focuses on striking a balance between potential energy consumption and the execution time of the heterogeneous platform.
[See PDF for image]
Fig. 5
SIFT algorithm and partitioning strategy (divided by dashed lines)
The profiling results reveal that the CPU is substantially slower in execution time than GPU and FPGA by on average. Even though the CPU has the highest clock speed, the lack of many processing cores results in poor for-loop unrolling optimisations for parallelisation. Comparing only GPU and FPGA, the overall Total runtime has shown the GPU being faster. In the Gaussian Pyramid and Orientation & Magnitude stage, the GPU outperforms the FPGA by . On the other hand, the FPGA outperforms the GPU in the Extrema Detect stage by . The GPU and FPGA architectures are comparable in performance when generating keypoint descriptors due to a lower amount of operations. Overall, the lower GPU runtime in most stages is attributed to having a significantly higher clock speed (e.g., 1725MHz vs 300MHz) and more processing cores than the FPGA to maximise throughput.
The RGB to greyscale colour space conversion is a computationally lightweight operation. It involves a linear combination of the red, green, and blue colour channels with specific weightings to produce a greyscale image. The decision to execute
The
CNN architecture descriptions
ResNet-18: is a deep convolutional neural network architecture observed in Fig. 6, that employs several key techniques to achieve higher performance in classification tasks. Its design is driven by the need to train very deep neural networks while mitigating issues like vanishing gradients. The key components of ResNet-18 are as follows:
Convolutional layers: ResNet-18 consists of 18 layers organized into stages. It starts with a convolutional layer with stride 2, followed by a max-pooling layer with stride 2. The large initial kernel captures larger spatial features.
Batch normalization and ReLU activation: After each convolutional layer, batch normalization stabilizes and accelerates training, followed by the Rectified Linear Unit (ReLU) activation function to introduce non-linearity. This combination aids in faster convergence and regularization.
Residual connections: A distinctive feature of ResNet-18 is its use of residual connections, introducing identity mappings that allow the network to learn residual information between layers. The output of a residual block is:
1
where is the input, represents the transformation applied, and is the output. These connections add the input directly to the block’s output, facilitating gradient flow through skip connections and easing the training of deep networks.Downsampling blocks: Each stage contains multiple residual blocks. After a series of convolutions, a downsampling block is applied using a convolutional layer with stride 2. This reduces the spatial dimensions of feature maps while increasing the number of channels, reducing computational overhead while preserving important information.
Classification layer: The final layer is a fully connected layer followed by a softmax function, computing a probability distribution over output classes based on the features from preceding layers.
[See PDF for image]
Fig. 6
ResNet-18 architecture and highlighted hardware partitioning strategy (divided by dashed lines)
[See PDF for image]
Fig. 7
MobileNetV2 architecture and highlighted hardware partitioning strategy (divided by dashed lines)
[See PDF for image]
Fig. 8
Layer run-time profiling for ResNet18: first convolution (Conv1) layer (L), fully connected (FC)/bottleneck layer (BN)
[See PDF for image]
Fig. 9
Layer run-time profiling for MobileNetV2: bottleneck layer (BN)
MobileNetV2: shown in Fig. 7, is an embedded-optimised convolutional neural network architecture that uses a range of techniques to achieve high accuracy with low computational cost. Key details of MobileNetV2 include:
Depthwise separable convolution: MobileNetV2 uses depthwise separable convolution, dividing standard convolution into depthwise and pointwise steps. Depthwise convolution performs separate convolutions for each input channel using a (k, k) kernel, significantly reducing computational load.
Pointwise convolution: Following depthwise convolution, pointwise convolution with kernels combines the results by performing a linear combination of the channels. This captures complex relationships between channels and maintains model accuracy.
Reduced computational cost and parameters: The combination of depthwise and pointwise convolutions significantly reduces computational cost, making MobileNetV2 suitable for resource-constrained embedded devices. Additionally, the ReLU6 activation function clamps outputs at 6 to improve robustness for low-precision computations.
Linear bottlenecks: MobileNetV2 introduces linear bottlenecks convolutions placed between a ReLU activation and a convolution. These bottlenecks keep computational cost low while ensuring the network maintains high accuracy. The ReLU activation introduces non-linearity, while the subsequent convolution captures more complex features.
CNN profiling and partitioning strategy
Both CNN architectures are analysed and partitioned onto the appropriate accelerator based on their runtime profiles. The CNN hardware comparison results are displayed in Figs. 8 and 9.
Resnet18
The Resnet18 results in Fig. 8 show the fastest hardware for executing the model is the GPU, with the total execution time of 0.18 s, while the slowest is the CPU with 0.29 s. The FPGA’s total execution time is between the two with 0.19 s.
The
The and Fully Connected (FC) layers take relatively less time to execute. The size of feature maps decreases as they progress through the layers due to downsampling operations like pooling and strides. The convolutional and average pool layers can be executed on the FPGA since fewer MAC operations are occurring for the GPU to be fully utilised while taking advantage of power efficient architecture.
MobilenetV2
The Fig. 9 results show that the total execution time for the CNN on the CPU was 0.241 s, while on the GPU and the FPGA, it was 0.23 s and 0.20 s, respectively. The bottleneck layer with the longest execution time for all three devices was BN1, with a time of 0.02202 ms on the CPU, 0.015 ms on the 3070 GPU, and 0.0034 ms on the FPGA.
The runtime for each bottleneck layer decreases as it moves from
Direct, custom and optimised routing between logic allows efficient data-flow transfer and locality.
Separable filters and feature maps have a reduced memory footprint, which can be efficiently managed.
Efficient use of pipelining for convolutional operations and reduced data dependency (e.g., ResNet18 skip connections).
Experimental setup
The proposed partitioning is tested using two developed heterogeneous platforms containing high-low power components, as shown in Table 2.
Table 2. Hardware/software environment
Accelerator | High-power | Low-power | Software |
|---|---|---|---|
CPU | AMD 5900x (4.8 GHz) | ARMv8.2 (1400 MHz) | Python/Pytorch 2.0 [60] |
GPU | Nvidia 3070 (1730 MHz) | Xavier NX (1100 MHz) | Python/Pytorch 2.0 |
FPGA | ZCU106 (300 MHz) | Artix-7 (100 MHz) | Verilog/Vivado/Vitis [61] |
Dataset. The test images used in the experiments are from LIU4 K-v2 dataset [62], which is a high resolution data-set that includes 2000 images. The images contain a variety of backgrounds and objects.
Measurement metrics
Execution time
The evaluation of the overall system performance considers both latency and compute factors, reporting performance metrics for total time, inference, and other significant layers while using floating point 16 precision. Other devices, such as i9-11900 KF (5.30 GHz), are also benchmarked for additional insight. The run-time is measured using the host platform’s built-in time libraries. The network performance is estimated by executing and averaging the results of 100 images. The frame per second (FPS) metric is computed using Eq. 2:
2
Power consumption
Two common methods used for measuring power are software and hardware based. Accurate power estimation is always challenging for software tools because they have to assume various factors in their models. Additionally, Taking the instantaneous power or TDP of a device is not accurate since power consumption varies on the specific workload. Therefore, measuring power over the time it takes for the algorithm to execute improves accuracy as opposed to using just fixed Wattage. The hardware measurement approach uses a current clamp meter shown in Fig. 10, which outputs a voltage for every Amp measured. The Otii Arc Pro [63] data-logger captures the time series data from the current clamp and generates a graph showing the current consumption over time. A script is developed to start and stop the measurements during the algorithm’s execution. The mean current is averaged and multiplied by the input voltage to determine the energy consumed in Joules (J). The energy consumption is obtained using Eq. (3) where E is energy, P represents power and t time.
3
[See PDF for image]
Fig. 10
a Power measurement using current clamp, b connected to a data logger at 4 kilo samples per second
Results and discussion
SIFT results
In achieving that, a custom pipeline was created by targeting various algorithm components on different hardware based on their suitability obtained from the benchmarking framework. This includes the latency of transferring image data between memory and the accelerators. The heterogeneous architecture empowers the ability to pick and execute operations within the image processing algorithms on each architecture to meet the speed and power target. However, within the scope of this work, only an initial configuration of the SIFT algorithm is reported, which establishes preliminary steps toward future work on finding the most optimal configuration for the algorithms (Table 3).
SIFT runtime
Table 4 shows the execution time of the SIFT algorithm on a heterogeneous platform. The table includes memory transfer latency between the host and devices, an aspect frequently overlooked in similar analyses.
The results reveal that the heterogeneous platform (Excl. Memory Transfer) outperforms all discrete architectures, CPU, GPU and FPGA by , and , respectively. However, taking data transfer into account, the heterogeneous architecture increases in execution time due to host task scheduling.
[See PDF for image]
Fig. 11
SIFT power consumption, baseline homogeneous and heterogeneous implementation comparison (CPU: 5900X, GPU: 3070, FPGA: ZCU106, HP: Heterogeneous)
[See PDF for image]
Fig. 12
Energy consumption and total runtime comparison for ResNet18: CPU:(I9, 5900X), GPU:(3070, A200, Xavier NX), FPGA:(Artix-7, U50, ZCU106), High-power (HP)
[See PDF for image]
Fig. 13
Energy consumption and total runtime comparison for MobileNetV2: CPU:(I9, 5900X), GPU:(3070, A200, Xavier NX), FPGA:(Artix-7, U50, ZCU106), Low-power (LP)
SIFT energy consumption
The energy consumption results for single accelerator and heterogeneous system (high-power) are observed in Fig. 11 and Table 4. The results indicate that the CPU is the most energy-consuming hardware for all stages of the SIFT algorithm. Among these stages, the Gaussian Pyramid consumes the most energy, while RGB2Gray consumes the least. In contrast, the HP, GPU and FPGA consumed , and less energy than the CPU for total energy. However, the CPU has competitive energy profile for RGB2GRay. The
SIFT discussion
FPGAs are suitable in areas where GPUs and CPUs are not well-optimised, such as memory-intensive streaming computations with known spatial and temporal locality patterns, pipelined computations lacking massive fine-grained data parallelism, operations on short integer and user-defined data types. However, the FPGAs logic-configurable nature of its architecture can be quite sensitive to the implementation quality of the underlying design, therefore performance can vary significantly based on how well the design exploits parallelism, data flow, and memory access patterns. The power measured on the single FPGA includes the programmable chip, the DRAM, and all the I/O on board. Therefore, energy consumption would be lower, if unused peripherals are removed.
CPUs, being general-purpose processors, are designed to handle a wide variety of tasks. They excel in scenarios requiring high single-threaded performance and complex control logic. However, their generality comes at the cost of energy efficiency, particularly for highly parallel tasks like those found in the SIFT algorithm. CPUs tend to consume the most energy across all stages of the SIFT algorithm due to their relatively lower parallelism compared to specialised accelerators.
GPUs are designed for high-throughput parallel processing, making them well-suited for the computationally intensive stages, such as the Gaussian Pyramid. They achieve significant energy savings compared to CPUs by leveraging thousands of smaller, simpler cores that can perform parallel computations efficiently. The GPU’s architecture is optimised for tasks that can be broken down into smaller, independent operations.
[See PDF for image]
Fig. 14
Frames per second (FPS) for inference on CPU:(I9-9900 K, 5900X) GPU:(GTX 3070, Xavier NX) FPGA:(Artix-7, ZCU106), High-power (HP), Low-power (LP). (+) Denotes components in HP platform and (–) denotes LP platform
Heterogeneous CNN results
The results in Figs. 12 and 13 show the total run-time and energy consumption of ResNet18 and MobileNetV2 on each architecture and heterogeneous platform, while Fig. 14 shows the inference in Frames per Second. Tables 5 and 6, summarises the results for the Total Execution Time, Inference, Convolution, Fully Connected, Total Energy, CPU Energy, Device Energy & Datalogger
CNN inference
According to Fig. 14, Resnet18 CNN architecture had the highest FPS value at 270 in contrast to MobileNetV2 243 FPS on high and low power heterogeneous architecture. This difference in FPS can be attributed to the network depths and parameters, with ResNet-18 having 18 layers and MobileNetV2 having 53 layers, leading to differences in computational complexity. Considering individual hardware only, the ’3070’ GPU achieved the highest FPS on Resenet18 and the ’ZCU106’ FPGA for MobileNetv2. On the other hand, the Artix-7 has the lowest FPS in both architectures due to limited compute resources (e.g., DSP Slices) and lower clock speed (100 MHz). In the case of both heterogeneous systems, HP & LP architectures achieve higher FPS than their individual GPU counterparts.
Resnet18 execution time and energy consumption
Total Execution time speedups of high (HP) and Low (LP) power systems are compared against their fastest discrete components within each system. The ’HP’ system demonstrated a speedup of over the ’GPU:3070’ for ResNet18, while the ’LP’ system exhibited a speedup of over the ’GPU: Xavier NX’. The most time is spent performing convolution operations in the
MobileNetV2 execution time and energy consumption
As for MobileNetV2 in Fig. 13, the ’FPGA: Artix-7’ platform also leads with the highest total execution time at 1.4 s, while the ZCU102 classifies the image in 0.20 s than the GPU at 0.23 s. The better execution on the FPGA can be attributed to the separable filters executed more efficiently and utilising DSPs. Including both heterogeneous platforms, HP and LP slightly outperform the discrete GPUs and FPGAs in both execution time and energy.
The energy consumption shows a and reduction in energy consumption for the ’HP’ and ’LP’ systems, respectively, in comparison to their most efficient discrete FPGA counterpart. The measurement difference between datalogger and software-based CPU + device was at 13.27% which the software estimation model is close to the hardware-based measurement.
Discussion
The results in Table 4 illustrate that the heterogeneous approach achieves competitive and balanced performance in energy and runtime for CNN-based algorithms (MobileNetV2 and ResNet-18) compared to other heterogeneous CPU-GPU-FPGA implementations, highlighting effective partitioning strategies suited for embedded CNN deployments. However, it is important to consider the full data/memory transfer latency for total execution time, as both heterogeneous architecture implementations are impacted by interconnect (PCIe) and distance bottlenecks. These bottlenecks reduce the FPS of HP and LP systems by 10–25%, allowing their counterpart GPUs to outperform in runtime and potentially energy consumption metrics. The compounding latency of memory transfers between accelerators for CNNs makes it vital to optimise data transfer and task partitioning to maintain performance. Future hardware designs with on-chip integration of accelerators would reduce distance latency and simplify data transfer protocols for efficient processing. While homogeneous architectures like GPUs have higher memory transfer and initialisation times, heterogeneous systems can mask this time. During the FPGA’s initial preprocessing of algorithms, the GPU can be set up for CNN inference. Additionally, the patch-based method exploits global pipelining by sending each processed patch from the FPGA to the GPU for convolution within CNNs. The runtime disparity between the
Conclusion
In this paper, partitioning strategies are introduced to map the layers of two widely used convolutional neural networks, namely,
Table 3. Energy and Runtime of heterogeneous architecture implementations compared to their best discrete counterpart (i.e.,, GPU/FPGA)
Work | Heterogeneous platform | Partitioning strategy | Algorithms | Energy gain | Runtime speedup | |
|---|---|---|---|---|---|---|
Hyungmin et al. [19] | GPU+CPU | ARM+P100 | Element-wise | Long Short-term Memory | 0.34x | 4.2x |
FPGA | Zync Ultrascale | |||||
Hosseinabady et al. [20] | GPU+CPU | ARM+Jetson TX1 | Element-wise | Histogram | 2.29x | 1.79x |
FPGA | Virtex-7 Zync Ultrascale | Dense Matrix-Vector Multiplication | 1.19x | 1.48x | ||
Sparse Matrix-Vector Multiplication | 1.23x | 1.25x | ||||
Yuexuan et al. [64] | GPU+CPU | ARM+Jetson TX2 | Hybrid | LeNet-5 N=16 | 2.11x | 1.3x |
FPGA | Nexys Artix-7 | |||||
Carballo-Hernandezet al. [65] | GPU+CPU | ARM+Jetson TX2 | Layer-Wise | SqueezeNet Fire | 1.34x | 1.01x |
FPGA | Cyclone-10 GX | MobileNet V2 Bottleneck | 1.55x | 1.26x | ||
Shufflenet V2 Stage | 1.39x | 1.35x | ||||
Sumeet et al. [21] | GPU+CPU | ARM+A100 | Grouped Layer-Wise | ResNet-18 | 1.14x | - |
FPGA | Xilinx Alveo U280 | ResNet-50 | 1.08x | |||
VGG16-bn | 1.12x | |||||
Ours, Low-Power | GPU+CPU | ARM + Xavier NX | Layer-Wise | ResNet-18 | 1.03 | 1.21 |
FPGA | Xilinx Artix-7 | MobileNetV2 | 1.25 | 1.07 | ||
Ours, High-Power | GPU+CPU | AMD 5900x + RTX 3070 | Layer-Wise | ResNet-18 | 1.07 | 1.05 |
FPGA | Xilinx ZCU106 | MobileNetV2 | 1.17 | 1.22 | ||
The table only includes works where algorithms are partitioned and processed on all accelerators
Table 4. Execution time on individual hardware and heterogeneous platform (CPU: 5900X, GPU: RTX-3070, FPGA: ZCU102)
Algorithm | Singular accelerator baseline (ms) | Heterogeneous architecture (ms) | ||||
|---|---|---|---|---|---|---|
CPU | GPU | FPGA | Selected accelerator for partitioned algorithm | Runtime (ms) | Energy consumption (Joules) | |
RGB2Gray | 0.80 | 0.54 | 0.40 | CPU | 0.64 | 51.2 |
Gaussian pyramid | 684 | 3 | 6 | GPU | 3 | 108 |
Extrema detection | 112 | 2 | 3 | GPU | 2 | 68 |
Orientation magnitude | 97 | 4 | 2 | FPGA | 2 | 48 |
Descriptor generation | 21 | 1 | 1 | FPGA | 1 | 25 |
Total | 896.8 | 10.54 | 12.4 | CPU+GPU+ FPGA | 8.64 | 234 |
Runtimes excludes memory latency
Table 5. ResNet-18: result summary of energy consumption and execution time on each architecture
Accelerator | Execution time (s) | Energy consumption (J) | ||||||
|---|---|---|---|---|---|---|---|---|
Total execution time | Inference | Convolution | Fully Connected | Sum CPU + Device | Total CPU | Total device | Data logger | |
CPU: (i9-11900 KF) | 0.25 | 0.021 | 0.019 | 0.0009 | 20.035 | 20.04 | 18.23 | |
CPU: (5900X) | 0.29 | 0.022 | 0.018 | 0.0009 | 24.43 | 24.4267 | 22.48 | |
GPU: (GTX 3070) | 0.18 | 0.004 | 0.0028 | 0.008 | 19.08 | 8.7 | 10.38 | 9.11 |
Jetson (Xavier NX) | 0.74 | 0.0128 | 0.0109 | 0.001 | 13.58 | 13.5716 | 10.56 | |
FPGA: (ARCTIX7) | 1.1 | 0.070 | 0.062 | 0.0009 | 6.10 | 6.10 | 5.8 | |
FPGA: (ZCU106) | 0.19 | 0.0042 | 0.0033 | 0.0009 | 9.12 | 9.12 | 8.07 | |
5900X + 3070 + ZCU106 | 0.17 | 0.0037 | 0.0027 | 0.0009 | 8.93 | 8.5 | ||
ARM + Xavier + Artix-7 | 0.62 | 0.012 | 0.011 | 0.0009 | 5.90 | 5.44 | ||
Bold: Best Runtime Performance, excludes accelerator transfer latency. Symbol "~" measurement not available (component absent or total power measured only)
Table 6. Mobilenet-V2: result summary of energy consumption and execution time on each architecture
Accelerator | Execution time (s) | Energy consumption (J) | |||||
|---|---|---|---|---|---|---|---|
Total execution time | Inference | Convolution | Sum CPU + Device | Total CPU | Total device | Data logger | |
CPU: (i9-11900 KF) | 0.28 | 0.023 | 0.02 | 24.4 | 24.4 | 20.23 | |
CPU: (5900X) | 0.31 | 0.025 | 0.022 | 25.3 | 25.3 | 21.48 | |
GPU: (GTX 3070) | 0.231 | 0.0048 | 0.0045 | 21.945 | 9.24 | 12.70 | 19.43 |
Jetson (Xavier NX) | 0.79 | 0.018 | 0.0125 | 15.28 | 15.28 | 14.92 | |
FPGA: (ARCTIX7) | 1.4 | 0.098 | 0.088 | 7.32 | 7.32 | 6.24 | |
FPGA: (ZCU106) | 0.20 | 0.0046 | 0.0036 | 10.55 | 10.55 | 9.65 | |
5900X + 3070 + ZCU106 | 0.19 | 0.0041 | 0.0029 | 9.89 | 9.01 | ||
ARM + Xavier + Artix-7 | 0.74 | 0.0145 | 0.015 | 6.80 | 5.86 | ||
Bold: Best Runtime Performance, Runtimes exclude accelerator transfer latency. Symbol "~" measurement not available (component absent or total power measured only)
Future work
Future work will explore advanced scheduling methods capable of dynamically assigning CNN subgraphs based on various metrics such as runtime, resource utilisation, power consumption, and memory latency. Further enhancements will include automatically scaling down the operating frequency of idle accelerators to optimise energy efficiency and integrating additional computing units such as Neural Processing Units (NPUs) to enrich system heterogeneity. Another direction is applying the unsupervised learning methods to intelligently partition CNN subgraphs by analysing algorithmic and hardware-specific characteristics. Additionally, a comprehensive investigation into novel approaches to minimise data transfer overhead between accelerators by leveraging emerging interconnect technologies and high-speed shared memory interfaces, thereby reducing the data transfer overhead.
Author contributions
All authors have made substantial contributions to the conception and design of the work.
Funding
Not Applicable.
Data availability
Not Applicable.
Materials availability
Not Applicable.
Code availability
Code provided upon request to the authors.
Declarations
Ethics approval and consent to participate
Not Applicable.
Consent for publication
Not Applicable.
Competing interests
The authors declare that they have no conflict of interest to report.
Publisher’s Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
References
1. Khaleghzadeh, H; Manumachu, RR; Lastovetsky, A. A novel data-partitioning algorithm for performance optimization of data-parallel applications on heterogeneous HPC platforms. IEEE Trans Parallel Distrib Syst; 2018; 29,
2. Kim Y, Kim J, Chae D, Kim D, Kim J. layer: low latency on-device inference using cooperative single-layer acceleration and processor-friendly quantization. In: Proceedings of the 14th European Conference on Computer Systems, EuroSys 2019. Association for Computing Machinery, Inc, https://doi.org/10.1145/3302424.3303950. Conference date: 25-03-2019 Through 28-03-2019.
3. Kang W, Lee K, Lee J, Shin I, Chwa HS. Lalarand: flexible layer-by-layer cpu/gpu scheduling for real-time DNN tasks. In: 2021 IEEE Real-Time systems symposium (RTSS), 2021. p. 329–341. https://doi.org/10.1109/RTSS52674.2021.00038.
4. Lane ND, Bhattacharya S, Georgiev P, Forlivesi C, Jiao L, Qendro L, Kawsar F. Deepx: A software accelerator for low-power deep learning inference on mobile devices. In: 2016 15th ACM/IEEE international conference on information processing in sensor networks (IPSN), 2016. p. 1–12. https://doi.org/10.1109/IPSN.2016.7460664.
5. Zhao, W-L; Ngo, C-W. Flip-invariant sift for copy and object detection. IEEE Trans Image Process; 2012; 22,
6. Rawat, W; Wang, Z. Deep convolutional neural networks for image classification: a comprehensive review. Neural Comput; 2017; 29,
7. Minaee, S; Boykov, Y; Porikli, F; Plaza, A; Kehtarnavaz, N; Terzopoulos, D. Image segmentation using deep learning: a survey. IEEE Trans Pattern Anal Mach Intell; 2022; 44,
8. Pouyanfar, S; Sadiq, S; Yan, Y; Tian, H; Tao, Y; Reyes, MP; Shyu, M-L; Chen, S-C; Iyengar, SS. A survey on deep learning: algorithms, techniques, and applications. ACM Comput Surv; 2018; 51,
9. Motamedi, M; Fong, D; Ghiasi, S. Machine intelligence on resource-constrained IoT devices: the case of thread granularity optimization for CNN inference. ACM Trans Embed Comput Syst; 2017; 16,
10. Hailesellasie, MT; Hasan, SR. Mulnet: a flexible CNN processor with higher resource utilization efficiency for constrained devices. IEEE Access; 2019; 7, pp. 47509-47524. [DOI: https://dx.doi.org/10.1109/ACCESS.2019.2907865]
11. Chung ES, Milder PA, Hoe JC, Mai K. Single-chip heterogeneous computing: Does the future include custom logic, FPGAs, and GPGPUs? In: IEEE/ACM international symposium on microarchitecture. 2010. https://doi.org/10.1109/MICRO.2010.36.
12. Lowe DG. Object recognition from local scale-invariant features. In: International conference on computer vision, 20–25 September, 1999, Kerkyra, Corfu, Greece, Proceedings, 1999. vol. 2, p. 1150–1157. https://doi.org/10.1109/ICCV.1999.790410.
13. Sandler M, Howard A, Zhu M, Zhmoginov A, Chen L-C. Mobilenetv2: inverted residuals and linear bottlenecks. In: IEEE/CVF conference on computer vision and pattern recognition. 2018. https://doi.org/10.1109/CVPR.2018.00474.
14. He K, Zhang X, Ren S, Sun J. Deep residual learning for image recognition. In: 2016 IEEE conference on computer vision and pattern recognition (CVPR). 2016. p. 770–778. https://doi.org/10.1109/CVPR.2016.90.
15. Ali T, Paul G, Nicol R, Bhowmik D. Scheduling algorithms on heterogeneous architecture for efficient vision systems. In: Tescher AG, Ebrahimi T, editors. Applications of digital image processing XLVII. 2024. vol. 13137, p. 131370. https://doi.org/10.1117/12.3031278. SPIE International Society for Optics and Photonics.
16. Roozmeh M, Lavagno L. Implementation of a performance optimized database join operation on fpga-gpu platforms using opencl. In: 2017 IEEE nordic circuits and systems conference (NORCAS): NORCHIP and international symposium of System-on-Chip (SoC). 2017. p. 1–6. https://doi.org/10.1109/NORCHIP.2017.8124981.
17. Kobayashi R, Fujita N, Yamaguchi Y, Boku T, Yoshikawa K, Abe M, Umemura M. Accelerating radiative transfer simulation with GPU-FPGA cooperative computation. In: IEEE international conference on Application-specific systems, architectures and processors. 2020. https://doi.org/10.1109/ASAP49362.2020.00011.
18. Wang X, Liu L, Huang K, Knoll A. Exploring FPGA-GPU heterogeneous architecture for adas: Towards performance and energy. In: International conference on algorithms and architectures for parallel processing. 2017. https://doi.org/10.1007/978-3-319-65482-9_3.
19. Cho, H; Lee, J; Lee, J. Farnn: FPGA-GPU hybrid acceleration platform for recurrent neural networks. IEEE Trans Parallel Distrib Syst; 2022; 33,
20. Hosseinabady M, Zainol MAB, Núñez-Yáñez JL. Heterogeneous FPGA+GPU embedded systems: Challenges and opportunities. ArXiv 2019. https://doi.org/10.48550/arXiv.1901.06331.
21. Sumeet N, Rawat K, Nambiar M, Singhal R. Hetero-vis: a framework for latency optimized heterogeneous deployment of convolutional neural networks. In: Euro-Par 2022: Parallel Processing Workshops: Euro-Par 2022 International Workshops, Glasgow, UK, August 22–26, 2022, Revised Selected Papers. Springer, Berlin, Heidelberg; 2022. p. 171–183. https://doi.org/10.1007/978-3-031-31209-0_13.
22. Liu X, Ounifi HA, Gherbi A, Lemieux Y, Li W. A hybrid GPU-FPGA-based computing platform for machine learning. Proc Comput Sci. 2018;141:104–11. https://doi.org/10.1016/j.procs.2018.10.155. The 9th International Conference on Emerging Ubiquitous Systems and Pervasive Networks (EUSPN-2018).
23. Liang W, Fujita N, Kobayashi R, Boku T. Using intel oneapi for multi-hybrid acceleration programming with GPU and FPGA coupling. HPCAsia ’24 Workshops. Association for Computing Machinery, New York, NY, USA; 2024. p. 69–76. https://doi.org/10.1145/3636480.3637220.
24. Qiao T, Xie Y, Chen H, Xie Y. An FPGA-GPU heterogeneous system and implementation for on-board remote sensing data processing. In: 2023 international conference on field programmable technology (ICFPT). 2023. p. 254–257. https://doi.org/10.1109/ICFPT59805.2023.00035.
25. Karatzas A, Anagnostopoulos I. Balancing throughput and fair execution of multi-dnn workloads on heterogeneous embedded devices. IEEE trans on emerging topics in comp. 2024;1–14. https://doi.org/10.1109/TETC.2024.3407055.
26. Xilinx: UG1137 - Zynq UltraScale+ MPSoC: Software Developers Guide (v2020.1). https://www.xilinx.com/support/documentation/sw_manuals/xilinx2020_1/ug1137-zynq-ultrascale-mpsoc-swdev.pdf.
27. Qasaimeh M, Denolf K, Lo J, Vissers K, Zambreno J, Jones PH. Comparing energy efficiency of cpu, gpu and fpga implementations for vision kernels. In: 2019 IEEE international conference on embedded software and systems (ICESS). 2019. p. 1–8. https://doi.org/10.1109/ICESS.2019.8782524.
28. Asano S, Maruyama T, Yamaguchi Y. Performance comparison of FPGA, GPU and CPU in image processing. In: 2009 international conference on field programmable logic and applications. 2009. p. 126–131. https://doi.org/10.1109/FPL.2009.5272532.
29. Georgis, G; Lentaris, G; Reisis, D. Acceleration techniques and evaluation on multi-core CPU, GPU and FPGA for image processing and super-resolution. J Real-Time Image Process; 2019; 16,
30. Nurvitadhi E, Sheffield D, Sim J, Mishra A, Venkatesh G, Marr D. Accelerating binarized neural networks: Comparison of FPGA, CPU, GPU, and ASIC. In: 2016 international conference on Field-Programmable technology (FPT). 2016. p. 77–84. https://doi.org/10.1109/FPT.2016.7929192.
31. Asano S, Maruyama T, Yamaguchi Y. Performance comparison of fpga, gpu and cpu in image processing. In: 2009 international conference on field programmable logic and applications. 2009. p. 126–131. https://doi.org/10.1109/FPL.2009.5272532.
32. Blott, M; Halder, L; Leeser, M; Doyle, L. Qutibench: benchmarking neural networks on heterogeneous hardware. J Emerg Technol Comput Syst; 2019; 15,
33. Lin, J; Zhu, L; Chen, W-M; Wang, W-C; Han, S. Tiny machine learning: progress and futures [feature]. IEEE Circuits Syst Mag; 2023; 23,
34. Banbury C, Zhou C, Fedorov I, Matas R, Thakker U, Gope D, Janapa Reddi V, Mattina M, Whatmough P. Micronets: Neural network architectures for deploying tinyml applications on commodity microcontrollers. In: Smola A, Dimakis A, Stoica I, editors. Proceedings of Machine Learning and Systems. 2021. 2021. p. 517–32. https://doi.org/10.48550/arXiv.2010.11267.
35. Moosmann J, Giordano M, Vogt C, Magno M. Tinyissimoyolo: A quantized, low-memory footprint, tinyml object detection network for low power microcontrollers. In: 2023 IEEE 5th international conference on artificial intelligence circuits and systems (AICAS). 2023. p. 1–5. https://doi.org/10.1109/AICAS57966.2023.10168657.
36. Whatmough PN, Zhou C, Hansen P, Venkataramanaiah SK, Seo J-S, Mattina M. Fixynn: Efficient hardware for mobile computer vision via transfer learning. 2019. arXiv preprint arXiv:1902.11128. https://doi.org/10.48550/arXiv.1902.11128.
37. Singh, R; Gill, SS. Edge AI: a survey. Internet Things Cyber-Phys Syst; 2023; 3, pp. 71-92. [DOI: https://dx.doi.org/10.1016/j.iotcps.2023.02.004]
38. Luo, X; Liu, D; Kong, H; Huai, S; Chen, H; Xiong, G; Liu, W. Efficient deep learning infrastructures for embedded computing systems: a comprehensive survey and future envision. ACM Trans Embed Comput Syst; 2024; 24,
39. Cheng, Y; Wang, D; Zhou, P; Zhang, T. Model compression and acceleration for deep neural networks: the principles, progress, and challenges. IEEE Signal Process Mag; 2018; 35,
40. Deng, L; Li, G; Han, S; Shi, L; Xie, Y. Model compression and hardware acceleration for neural networks: a comprehensive survey. Proc IEEE; 2020; 108,
41. Zhu M, Gupta S. To prune, or not to prune: exploring the efficacy of pruning for model compression. 2017. arxiv:1710.01878.
42. Shkolnik M, Chmiel B, Banner R, Shomron G, Nahshan Y, Bronstein A, Weiser U. Robust quantization: One model to rule them all. In: Larochelle, H, Ranzato, M, Hadsell, R, Balcan, M.F, Lin, H, editors. Advances in neural information processing systems, 2020. vol. 33. https://doi.org/10.48550/arXiv.2002.07686.
43. Gholami A, Kim S, Dong Z, Yao Z, Mahoney MW, Keutzer K. A Survey of quantization methods for efficient neural network inference. 2021. arxiv:2103.13630.
44. Gou, J; Yu, B; Maybank, SJ; Tao, D. Knowledge distillation: a survey. Int J Comput Vision; 2021; 129,
45. Tu Y, Sadiq S, Tao Y, Shyu M-L, Chen S-C. A power efficient neural network implementation on heterogeneous FPGA and GPU devices. In: 2019 IEEE 20th international conference on information reuse and integration for data science (IRI). 2019. p. 193–199. https://doi.org/10.1109/IRI.2019.00040.
46. Carballo-Hernández W, Pelcat M, Berry F. Why is FPGA-GPU heterogeneity the best option for embedded deep neural networks? ArXiv 2021. https://doi.org/10.48550/arXiv.2102.01343.
47. Huang S, Chen D, Hwu W-M, Chang L-W, Hajj I, Garcia De Gonzalo S, Gómez-Luna J, Chalamalasetti SR, El-Hadedy M, Milojicic D, Mutlu O. Analysis and modeling of collaborative execution strategies for heterogeneous CPU-FPGA architectures. 2019. p. 79–90. https://doi.org/10.1145/3297663.3310305.
48. Rodríguez, A; Navarro, A; Asenjo, R; Corbera, F; Gran, R; Suárez, D; Nunez-Yanez, J. Exploring heterogeneous scheduling for edge computing with CPU and FPGA mpsocs. J Syst Architect; 2019; 98, pp. 27-40. [DOI: https://dx.doi.org/10.1016/j.sysarc.2019.06.006]
49. Zhang, C; Yu, H; Zhou, Y; Jiang, H. Arabnia, HR; Deligiannidis, L; Grimaila, MR; Hodson, DD; Joe, K; Sekijima, M; Tinetti, FG. High-performance and energy-efficient FPGA-GPU-CPU heterogeneous system implementation. Adv Parallel Distributed Process Appl; 2021; Cham, Springer: pp. 477-492.
50. Khetawat H, Mueller F. Workload scheduling on heterogeneous devices. In: ISC high performance 2024 research paper proceedings (39th International Conference). 2024. p. 1–11. https://doi.org/10.23919/ISC.2024.10528933.
51. Belviranli, ME; Bhuyan, LN; Gupta, R. A dynamic self-scheduling scheme for heterogeneous multiprocessor architectures. ACM Trans Archit Code Optim; 2013; 9,
52. Topcuoglu, H; Hariri, S; Wu, M-Y. Performance-effective and low-complexity task scheduling for heterogeneous computing. IEEE Trans Parallel Distrib Syst; 2002; 13,
53. Kang, D; Oh, J; Choi, J; Yi, Y; Ha, S. Scheduling of deep learning applications onto heterogeneous processors in an embedded device. IEEE Access; 2020; 8, pp. 43980-43991. [DOI: https://dx.doi.org/10.1109/ACCESS.2020.2977496]
54. Zhao W, Stankovic JA. Performance analysis of FCFS and improved FCFS scheduling algorithms for dynamic real-time computer systems. In: [1989] Proceedings. Real-Time systems symposium, 1989. p. 156–165. https://doi.org/10.1109/REAL.1989.63566.
55. Madej A, Wang N, Athanasopoulos N, Ranjan R, Varghese B. Priority-based fair scheduling in edge computing. In: 2020 IEEE 4th international conference on fog and edge computing (ICFEC). 2020. p. 39–48. https://doi.org/10.1109/ICFEC50348.2020.00012.
56. ARM Ltd. AMBA Advanced Microcontroller Bus Architecture Specification. 2023.
57. PCI-SIG: PCI express base specification revision, v2.1 edn. 2023.
58. Bittner R, Ruf E. Direct GPU/FPGA communication via pci express. In: 2012 41st International conference on parallel processing workshops. 2012. p. 135–139. https://doi.org/10.1109/ICPPW.2012.20.
59. Thoma Y, Dassatti A, Molla D. FPGA2: An open source framework for FPGA-GPU pcie communication. In: 2013 international conference on reconfigurable computing and FPGAs (ReConFig). 2013. p. 1–6. https://doi.org/10.1109/ReConFig.2013.6732296.
60. Paszke A, Gross S, Massa F, Lerer A, Bradbury J, Chanan G, Killeen T, Lin Z, Gimelshein N, Antiga L, Desmaison A, Köpf A, Yang E, DeVito Z, Raison M, Tejani A, Chilamkurthy S, Steiner B, Fang L, Bai J, Chintala S. PyTorch: an imperative style, high-performance deep learning library. 2019. https://doi.org/10.48550/arXiv.1912.01703.
61. Xilinx. Vitis Unified Software Platform Documentation, Ug1400 (v2019.2) edn. Xilinx, San Jose, California, United States (2020). Xilinx.
62. Liu J, Liu D, Yang W Xia S, Zhang X, Dai Y. A comprehensive benchmark for single image compression artifacts reduction. In: arXiv 2019. https://doi.org/10.48550/arXiv.1909.03647.
63. Otii Arc Pro. 2023. https://www.qoitech.com/otii-arc-pro/.
64. Tu Y, Sadiq S, Tao Y, Shyu M-L, Chen S-C. A power efficient neural network implementation on heterogeneous fpga and gpu devices. In: 2019 IEEE 20th international conference on information reuse and integration for data science (IRI). 2019. p. 193–199. https://doi.org/10.1109/IRI.2019.00040.
65. Carballo-Hernández W, Pelcat M, Berry F. Why is FPGA-GPU heterogeneity the best option for embedded deep neural networks? ArXiv. 2021. https://doi.org/10.48550/arXiv.2102.01343.
© The Author(s) 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.