Full Text

Turn on search term navigation

1. Introduction

Computer vision applications like image recognition, object detection, feature extraction, and gesture recognition are key operations in industrial automation. These applications are only successful when their accuracy and efficiency are on par with human error rates. With Convolutional Neural Network (CNN) models that surpass human accuracy, such applications are now a reality. A CNN is a type of neural network that has been highly instrumental in the processing of image data for feature extraction [1]. As shown in Figure 1, each CNN layer performs a combination of operations, such as convolution, normalization, activation, and grouping, on the input image or Input Feature Map (IFM) to generate an Output Feature Map (OFM). Several such layers are cascaded to form a whole network to complete the image analysis.

Several CNN models have been developed over the years to improve accuracy and reduce error rates by increasing the number of network layers. Notable milestones include LeNet-5 with seven layers (1998), AlexNet with eight layers, ZFNet (2013), VGGNet-16 with 16 layers (2014), GoogLeNet with 22 layers (2014), and finally ResNet (2015) with 152 layers. With skip connections, ResNet has achieved an error rate of 3.57%, exceeding human-level precision, as shown in Figure 2.

(1) $T_{C} = \sum_{i = 0}^{n} C_{L} [i] \times W_{C} [i]$

(2) $T_{M A C} = \sum_{i = 0}^{n} C_{L} [i] \times W_{C} [i] \times O_{F M} [i]$

(3) $T_{G O P} = \frac{T_{M A C} \times 2}{10^{9}}$

where TC denotes the total number of convolutions for all layers; i denotes the layer index; n denotes the total number of layers; CL denotes the number of input channels in a layer; WC denotes the number of convolution weights in a layer; TMAC denotes the total number of MAC operations for all layers; and TGOP denotes the total giga floating-point operations required to complete the convolution across all layers.

While increasing the number of CNN layers has improved the error rate, it has also increased the network’s computation requirements, which can be analyzed using Equations (1)–(3). Applying these equations to the AlexNet, VGG16, and ResNet18 models, the number of convolutions leading to the number of floating-point (FP) operations can be derived, as shown in Table 1.

To improve the performance of CNN models, several novel features have been proposed to reduce memory usage and computations while improving accuracy. Activation functions like the Rectified Linear Unit (ReLU) have added non-linearity to enhance learning capacity, normalization techniques like batch normalization (BNORM) have addressed issues with unbounded activation outputs, and pooling layers have reduced the size of feature maps while introducing translation invariance. Opting for lower-precision data types, like FP16 and INT8, through quantization has mitigated the increased computational demands and memory footprint, thus reducing computation time, power consumption, and memory bandwidth requirements and enabling deeper network designs [2,3,4,5,6].

1.1. FPGA for CNN Acceleration

Hardware platforms for processing CNN models include CPUs, GPUs, and FPGAs, with their respective advantages listed in Table 2 [1,7,8]. GPUs are known for their high processing capability and have no competition. Unlike CPUs, GPUs are not a standalone solution and require a host CPU to deploy payloads through compute libraries like Open Computing Language (OpenCL) and Compute Unified Device Architecture (CUDA). These libraries require frequent intervention from the CPU throughout CNN operation.

Unlike CPUs and GPUs, FPGAs are available with a wide range of configurable resources and are cost-effective and energy-efficient. In addition to these advantages, FPGAs can be used to offload complete or partial computations, making them a good choice for both standalone hardware processing and accelerators for CPUs [9].

FPGAs have LUTs, DSP blocks, BRAM, and hard macros supported with various IO interfaces, but for CNN acceleration, the resources responsible for computing are the only criteria for selection. CNN accelerators require sufficient on-chip and high-speed off-chip memory interfaces to efficiently move data between memory and the design. Once the data are loaded, the performance is defined by the DSP blocks responsible for computation. The key to a successful solution is to choose an FPGA with enough computing resources for a network and an efficient accelerator architecture to derive maximum performance from the hardware.

1.1.1. DSP Blocks

The number of DSP blocks in an FPGA and the data type of the network are directly proportional to the number of computations feasible in hardware. For example, to realize an MAC unit in an AMD-Xilinx FPGA, FP32 requires four DSP blocks and FP16 requires three DSP blocks. A theoretical performance of 13.3 GOPS can be achieved using FPGAs with 2500 DSP blocks and an operating frequency of 200 MHz. Thus, implementing an accelerator design that efficiently uses DSP blocks to achieve the maximum operating frequency plays a key role in realizing performance close to the theoretical value.

1.1.2. Internal Memory

Internal Memory using BRAM is the fastest, operating at one sample per clock, unlike external memory, which requires multiple clock cycles. Despite the capacity advantage of external memory, having internal (on-chip) memory to store input and weights locally is mandatory to ensure cycle-to-cycle operation and keep DDR bandwidth under control [9]. For example, in the first layer of ResNet18, the input data width is 224, the weight width is 7, and the number of channels and filters are 3 and 64, respectively, requiring a bandwidth of 4.88 Mbps to process one frame per second. This is for a one-time fetch of the input and weights for one layer, and for every output to be generated, each input is read multiple times for the weight size, which equates to a massive overhead on the DDR. Based on Equation (5), the input bandwidth (IBW) needed for the first layer is 1200.5 Mbps (1619.3 Mbps for all layers). This overhead can only be avoided by having internal (on-chip) memory for storing input and weights locally.

Equation (4) shows the expression for determining the internal (on-chip) memory size, i.e., the input size, required to store the entire input image data and the respective weights to perform a convolution operation. Based on the FPGA selected and the network layer, the on-chip memory may not be able to store the complete input image, so the design must efficiently manage memory by holding small chunks sufficient to store the data required for current processing while retrieving additional data from external memory when necessary.

(4) $Input Size = [L \times L] + [W \times W] \times DT$

1.1.3. External Memory

External memory is the default choice for storing complete data (including inputs, weights, and intermediate data). FPGA external memory can be one of two types:

High-Bandwidth Memory (HBM);
Double Data Rate (DDR) memory.

HBM-based FPGAs are not the first choice due to their need for high power and cost. DDR memory, on the other hand, is a cheaper and more widely available option. Thus, even though the latency of DDR memory is higher than that of HBM, if an accelerator is designed to utilize the bandwidth efficiently, DDR memory is the ideal choice for storage. Equation (5) shows the expression for determining the bandwidth requirement per layer, that is, the IBW, while Equation (6) shows the expression for determining the output bandwidth (OBW). Without internal memory, all fetches are routed to external memory, and the input bandwidth without internal memory (IBWO) can be calculated using Equation (7). Using these equations, Table 3 lists the DDR bandwidth requirements for various CNN models.

(5) $IBW = F \times [(C \times L^{2}) + (C \times N \times W^{2})] \times D T$

(6) $OBW = F \times N \times [(\frac{(L - W + (2 \times P))}{S} + 1)] 2 \times DT$

(7) $IBWO = O P W \times W^{2}$

where F denotes the number of frames per second, C denotes the number of channels, L denotes the input width, W denotes the weight width, DT denotes the data type, N denotes the number of output channels, P denotes the padding for convolution, and S denotes the stride for convolution.

1.2. FPGA Classification

FPGAs are classified as follows, offering a choice in terms of resources and technology used, depending on the application requirements:

Conventional FPGAs are used as accelerator add-on cards and do not have a hard-macro CPU that can operate at speeds of 1 GHz. Soft CPUs (like MicroBlaze) can be added to design a standalone application but at the cost of resources and a lower frequency.
SoC FPGAs have a high-speed processing system (PS) with a high-end CPU and memory, which interfaces with the Programmable Logic (PL) section via AXI, making them suitable for standalone embedded applications. SoC FPGAs have PL memory that is dedicated to the fabric and is not shared with any other peripheral, while the PS memory is shared with peripherals and the CPU core to execute the code.

1.3. FPGA Selection

As shown in Figure 3, the market offers a wide range of FPGAs, differing in terms of resources, performance, and price. FPGAs with a higher number of resources (like DSPs, LUTs, etc.) are the first choice for applications that require higher performance. Hence, such high-resource FPGAs are a good choice for high-performing applications where cost is not a constraint. However, selecting such FPGAs for bulk, cost-sensitive applications defeats the purpose of using FPGAs, as they can be outperformed by GPUs. Also, while the focus is on compute-specific resources, FPGAs offer other resources, like BRAM or LUTs, which may not be fully utilized. Not opting for high-resource FPGAs does not imply compromising performance because while resources and pricing are directly proportional, performance for the application should be the criterion for selecting an FPGA. For a given application, similar performance can be achieved using low- or moderate-resource FPGAs, as the FPGA resources required by the application will be constant irrespective of the number of resources offered by the FPGA. Thus, the performance achieved using a specific high-resource FPGA can also be achieved using lower-cost FPGAs.

Therefore, FPGA selection is a critical aspect of CNN accelerator design implementation. It is important to choose an FPGA that balances cost and performance to meet application needs.

The scalability of the design provides flexibility for the accelerator to choose among a wide range of FPGAs. By using multiple small FPGAs, the system can be expanded incrementally to meet the growing computational demands without requiring a complete redesign of the accelerator. Most importantly, this approach is cost-effective, as smaller FPGAs are less expensive than equivalent single, large FPGAs [10].

1.4. Accelerator Configuration Types

Looking over the various implementations of FPGA-based accelerators, the primary variation lies in the processing unit used for computations. CNN accelerators can be configured in two different ways:

Fixed configuration;
Configurable.

Accelerators with a fixed configuration are designed for a specific network to achieve target performance and fine-tuned for a dedicated application. They are also more efficient in utilizing FPGA resources. On the other hand, configurable accelerators have a group of Processing Elements (PEs) that can process multiple networks. The PE receives input signals, processes them into a single output signal, and sends that generated output signal to a PE in the next layer in the network. This is a more generic implementation and is often inefficient in terms of performance.

1.5. Performance Evaluation

As the accelerator configuration changes depending on the CNN network, it is important to ensure that the configuration implemented for the particular network is ideal. Performance monitoring helps optimize the efficiency and effectiveness of the accelerator configuration deployed by tracking resource utilization, latency, and throughput. It helps identify bottlenecks, enabling targeted optimizations and ensuring that resources, such as DSP slices and memory, are utilized efficiently. By providing real-time data on performance and power consumption, performance monitoring facilitates adjustments to maximize the accelerator’s performance.

1.6. Organization of This Paper

The rest of this paper is organized as follows. Section 2 provides some background research on the previous architectures of FPGA-based CNN accelerators. Section 3 explains the operation and architecture of the proposed VCONV IP. Section 4 discusses the scenarios, networks, hardware, and software used to test the proposed Intellectual Property (IP). Section 5 presents the experimental results for VCONV on the AlexNet, VGG16, and ResNet18 networks. Finally, Section 6 concludes this paper and discusses future prospects.

2. Background

The previous section discussed the concepts of convolution and the basis of FPGA selection for CNN accelerators. This section provides an overview of the proposed designs by various authors for FPGA-based CNN accelerators.

With the various CNN models in the industry today, numerous CNN accelerators have been proposed, demonstrating efficient accelerator implementations using FPGAs instead of GPUs. Accommodating the diverse requirements of all CNN models with maximum performance and efficiency is crucial when developing an FPGA-based CNN accelerator. Several parameters are considered for meeting such a complex requirement and its constraints, as discussed below.

CNN models are available for different data types like integer, floating-point, and fixed-point representations. ResNet50, for example, comes with FP32, while MobileNetV2 comes with FP16. FP64 (double-precision floating-point) is typically used during phases where high numerical precision is crucial, while FP32 (single-precision floating-point)/INT8/INT16 are commonly used for general CNN training and inference. Most FPGA accelerators are designed to accommodate up to FP32 [11], as FP64 requires high computational cost and memory requirements.

The data interface between the external memory and the accelerator needs to be low on complexity without protocol overhead and provide maximum performance. AXI, PCIe, and DMA are commonly used for streaming data directly to CNN accelerators [12,13]. However, efficient data handling is essential for real-time CNN processing, and this can be achieved better through the AXI stream interface.

Proper utilization of the external memory and the limited on-chip storage is important to achieve the high performance required for CNN accelerators. One way is for the computing engine to rely mainly on off-chip memory (DDR) to read data row by row from the input matrix [14]. Latency is comparatively higher in such implementations even if various optimization techniques are employed, as higher bandwidth is required for continuous data transmission from the communication interface. Alternatively, the input matrix is rearranged into a matrix format as required by the computing engine and loaded row-wise into the on-chip buffer [15]. Row-stationery or row2column data flow is one such rearrangement to organize data access for minimal memory latency and maximum throughput [12].

Some CNN models require BNORM, ReLU, or pooling operations before the convolution. Therefore, it is ideal to have the option to perform these operations before or after convolution and a monitoring unit to enable/disable them, depending on the CNN model’s requirements. The impact of having inline operations before or after convolution is not well documented.

CNN models execute either 1D, 2D, or 3D convolution depending on whether they are processing one-dimensional, two-dimensional, or three-dimensional data, respectively. Many CNN accelerator designs are tailored to execute only 1D, 2D, or 3D convolution, while only a few can implement multiple convolution types [16]. FPGA-based CNN accelerators are also distinguished based on the type of computing unit used for the convolution operations. A few accelerator designs have PE-based units, while others have MAC-based units [17,18].

CNN models usually involve numerous convolutions that can be performed in parallel to improve design efficiency. Hence, having multiple processing units to distribute the compute load of different convolution layers helps reduce processing time [19,20]. The FPGA configured in each of the multiple processing units can be homogeneous or heterogeneous. Heterogeneous FPGAs allow for different FPGA sizes, depending on the computation requirements of the convolution layer being processed. Heterogeneous FPGAs can efficiently handle CNN parallel processing demands, provide better resource utilization, and reduce power consumption compared to homogeneous systems [21,22].

For deeper CNN models, the number of convolution layers increases, requiring more processing power. Having scalability in the accelerator at the multiplier/MAC unit level allows for control over balancing utilization and performance [12]. To improve the outcome of each CNN layer, operations like BNORM, ReLU, and pooling are applied after convolution [23,24]. The arrangement and execution of these operations alongside convolution affect the respective layer’s efficiency and performance.

Considering the different parameters discussed above, the following factors are needed for an efficient FPGA-based CNN accelerator:

The accelerator should be configurable for multiple data types like floating-point (from FP32 to FP8) and integer formats
The accelerator should have low latency, high throughput with burst support, and a simple interface to transfer data between memory and the accelerator.
The accelerator should be able to be operated on FPGAs with less on-chip memory for portability across FPGAs.
The accelerator should support inline operations like normalization, activation, convolution, and pooling in any sequence to be able to cater to the needs of multiple networks.
The accelerator should be scalable at multiple levels to have better control of resource usage (BRAM, DSP) for a selected network.
The accelerator should have an inbuilt performance evaluation system to monitor and evaluate the accelerator’s performance and configuration.
The architecture of the accelerator should support the distribution of solutions across heterogeneous FPGAs.

This paper is an attempt to address the above features in order to realize an efficient CNN accelerator through the VCONV IP.

3. VCONV: A CNN Accelerator

Products are designed based on market needs and use cases to derive technical specifications. Parameters like performance (number of frames to be processed in a second), power, and price (FPGA selection) are additional constraints to already existing limitations of FPGA like resources and routing. To cater to the wide range of applications and various FPGA selections, the accelerator design must have enough parameters to control resources like MAC (DSP utilization), CE (BRAM utilization), and multiple instances (LUT utilization). The VCONV IP has knobs to control these parameters while obtaining maximum performance from the configured resources.

3.1. Overview

The VCONV IP is interfaced with a host CPU over Advanced Peripheral Bus (APB) and Advanced eXtensible Interface (AXI) Master and Slave stream interfaces, as shown in Figure 4. The Register Manager is the host interface for configuring the CNN parameters shown in Table 4 and the control parameters shown in Table 5 for the various modules. The control parameters configure the data path of the data arriving on the AXI Slave stream interface.

After the host configures the registers, the IP receives the weight and input image on the AXI Slave stream interface to store in the On-Chip Memory (OCM) Manager. To avoid frequent fetches of data from DDR, on-chip memory acts as a cache large enough to complete one row of convolution. The OCM Manager uses two buffers (a weight buffer and a line buffer) to store the weight and input image, respectively. No additional configuration is required for this data transfer, but the sequence (weight followed by input data) needs to be maintained. The OCM Manager notifies the Slide Manager (SM) to initiate the operations once the line and weight buffers are filled. The OCM pauses the transfer on the AXI Slave stream interface until the compute operations are completed, as notified by the SM, to fetch the new lines of the image. The SM schedules the data reads (Im2col operation) from the OCM, storing them in CE FIFOs, ready to perform compute operations.

The SM schedules the transactions considering the number of Convolution Engines (CEs) and operates independently of the number of Convolution Units (CUs) and Multiply Accumulate (MAC) units per CU.

Each CE is responsible for executing the convolution operations on the data in the FIFO. On completion of the operation, the output is subjected to additional operations (i.e., BNORM, ReLU, and MaxPooling). The sequence of operations is managed by the Control Manager (CM) based on the configuration of the registers. As the other operations are performed inline, on completion, the data are streamed out on the AXI Master stream. The Performance Monitor (PM) configured inside the CU keeps track of the idle and active cycles of MAC units, which are discussed in Section 3.5. In summary, the VCONV IP has the following features:

Configurable registers on the APB interface.
Receives the input and weight data on the AXI Slave stream interface.
Has OCM to reduce memory bandwidth by avoiding repeated fetching of the same data.
Supports floating-point (FP32, FP16, an FP8) and integer data types, thereby supporting quantized models.
Configurable MACs per CU to handle multiple operations at the same time.
Configurable CEs, consisting of FIFOs and CUs, to enable parallel convolutions for the same layer.
Supports compute operations like BNORM, ReLU, and MaxPooling apart from convolution. Operations like normalization and activation can be performed before or after convolution to meet the requirements for layer processing.
Has a Control Manager to define the data flow path of the IP.
Streams out data on the AXI Master stream interface after completing the computations.
Implements a performance monitoring system to measure the efficiency of MACs.
Can be scaled and instantiated multiple times with different configurations to support the pipelining of multiple layers of CNNs.
With its scalability feature, the IP can be ported for networks regardless of their complexity and FPGA device capacity.

3.2. Compute Operations

One of the advantages of the VCONV IP is its support for inline compute operations before or after convolution. These operations can be enabled in hardware but can be disabled as needed by software.

The supported operations are BNORM, ReLU, and MaxPooling and can be expanded to support other operations, such as average pooling, using the same architecture.

3.3. Multi-Row Line Buffer

FPGA BRAM outputs one data element per clock cycle. In such cases, feeding input (from BRAM) for the parallel convolutions in multi-CE architecture is usually done in two different approaches:

Feeding the CEs one after the other: This approach is inefficient because it cannot feed multiple CEs at the same time. The second CE is starved of data until the input data required for the convolution in the first CE are filled.
Feeding the same data (as per the clock) to multiple CEs: This approach is not efficient, as the design becomes more complex as the number of CEs increases.

In VCONV, for better efficiency, the line buffer is configured as a group of lines (as shown in Figure 5), sufficient for convolutions of a set of rows of the input image, plus a few additional lines to avoid delays in fetching the next lines.

The size of the line buffer can be calculated using Equation (9), where L is the line width, W is the weight width, S is the convolution stride, and DT is the data type. The size of the weight buffer is W2, which is the entire weight.

(9) $OCM Size = ((L \times (W + S)) + W^{2}) \times D T$

For example, if Equation (9) is applied to the first layer of ResNet18, where the input data width is 224, the weight width is 7, and the stride is 2, the BRAM size required will be 64,512 bits, which is feasible in an FPGA.

This approach, which involves having a dedicated line buffer per row, is highly efficient as it allows access to the next row while the first row is still in use. In Figure 6, it can be seen that this implementation allows CEs to have data ready for access from any line without idle cycles.

A multiplexer is used to manage the selection of row elements, as shown in Figure 5. For example, in one clock cycle, CE1 is provided with data A3, and CE0 is provided with data B0, which is possible only by managing the line selection through a multiplexer.

The VCONV IP is configurable for different data types, so the data width of the OCM and CE modules is adjusted accordingly to support quantized models.

3.4. Multi-CE Convolution Engine Module with Scalable MAC Units

A CE comprises a CU with dedicated input and output FIFOs and MAC units that can be configured. Supporting a multi-CE architecture, the number of CEs is configured based on the requirements of parallel convolutions and computation time.

The CU has MAC units that are responsible for performing multiply and accumulate operations. The number of MACs in a CU can vary from 1 to the weight size. In the VCONV IP, the MAC operation is performed by the AMD floating-point IP, which also supports multiplication, addition/subtraction, accumulation, fused multiply-add, division, square root, and comparison operations and supports data widths of FP32/16/8.

While the number of MACs, CUs, and CEs is scalable, it results in wire connections carrying the same data to multiple endpoints, which becomes critical in low-end FPGAs where the fabric does not have enough routing resources. The solution to this problem is either to reduce the target frequency or enable resource duplication during the synthesis operation, which increases LUT usage.

3.5. Performance Monitoring

The efficiency of the design can be measured by calculating the idle time of the design due to dependencies from the neighboring modules.

As shown in Figure 7, each MAC has a dedicated signal to indicate activity and idle time. These signals culminate at the CE level for active and idle times, incrementing two different counters that can be read after the process is completed to check performance. This way, the MACs’ effective utilization can be measured.

4. Implementation Methodology

Open Neural Network Exchange (ONNX) is an open-source framework that provides a single file for a selected network consisting of information related to inter-layer connectivity, weights, and biases. Table 6 shows the list of parameters extracted per layer from the ONNX file. This information can be used as a reference to configure the VCONV IP and also provide input to the SW to program the IP.

The following tools were used for the architecture, design, development, verification, and validation of the VCONV IP.

AMD-Xilinx Vivado 2020.2 was used for the VCONV IP’s design, simulation, synthesis, and implementation.
Python was used to extract the parameters from the ONNX model.
Google Sheets was used for all calculations and graph generation.
Google Slides was used to draw block diagrams.

4.1. FPGA Design

From the information extracted from ONNX, the VCONV IP can be efficiently configured to achieve maximum performance and fully utilize the FPGA resources. For example, different networks have different sequences to be processed in Layer 1, as shown in Table 7. Based on this information, the IP is configured as follows:

Decide on the number of VCONV IPs to be instantiated.
Enable or disable the compute steps within a layer, tuned according to the network configuration.
Based on the input image and kernel size, the BRAM size is configured in the IP.
Based on the number of available DSP macros and instances, the number of CEs and MACs per CE is configured.

For example, a CNN accelerator based on AlexNet’s five layers requires 0.962 GOPs to be processed within 500 ms to achieve 2 FPS. The VCONV IP can be configured using CE, CU, and MAC to process each layer in 500 ms, as shown in Table 8. Each of these VCONV instances can be pipelined to achieve a performance of 2 FPS. Based on the number of VCONV instances, a minimum of 179,650 LUTs, 436 DSPs, and 413K BRAM will be required, either in a single FPGA or spread across multiple FPGAs. With the advantage of a scalable architecture, realizing an IP for each layer is possible.

4.1.1. Integration

The configured VCONV IPs are connected to the DMA controller using the AXIS write (VCONV input) and read (VCONV output) interfaces, as shown in Figure 8. The VCONV IP register interface on the APB interface is connected directly to the CPU. This implementation allows the CPU to first configure the VCONV IP, followed by the DMA engine, which initiates processing.

The FP operator (as discussed in Section 3.3) differs in utilization depending on the FPGA architecture, particularly for LUTs, flip-flops, and DSPs. For example, the multiply-and-add operation requires 700 LUTs, 1085 FFs, and 2 DSP blocks in the Zynq-7000 series FPGA.

4.1.2. Design, Verification, and Implementation

The VCONV IP is developed using Verilog HDL for better implementation control and is simulated through a test bench developed using AXI VIP to interface with the APB and the AXI Master and AXI Slave stream interfaces of the IP. The design is synthesized for a 200 MHz operating frequency and implemented using the same tool.

4.2. Software

The software is divided into two components: the ONNX parser, which runs on the computer to extract the parameters listed in Table 4, and the VCONV software application, which runs on the FPGA platform on the PS to configure the VCONV IP and enable operation.

4.2.1. ONNX Parser

The ONNX parser is a Python program that extracts parameters from the ONNX file and writes them to different files in a specific format:

The weights per layer are stored in separate binary files.
BNORM/convolution/maxpool parameters per layer are stored in header files.
Connections between layers are stored as descriptors with pointers to weights and inputs in header files.

4.2.2. VCONV SW Application

The VCONV SW application is a C program running on PetaLinux on the PS and performs the following tasks:

It downloads the weights to the DDR memory.
It configures the parameters for the VCONV IP through register programming.
It configures the descriptors to the DMA engine to initiate data fetching from DDR memory and provide it to the VCONV IP.
It waits for DMA read/write completion and verifies the output contents for correctness.
After checking for correctness, the performance monitor registers are read back to analyze performance metrics and identify the MACs’ busy and idle cycles.

4.3. FPGA Selection

The cost of an FPGA is directly proportional to its resources, and choosing an FPGA with higher resources may not be cost-effective. Hence, choosing heterogeneous FPGAs with variable resource configurations is desirable. Choosing a single FPGA with abundant resources will directly impact power and price, and it may not be fully utilized for complex features like high-speed transceivers and hard macros.

VCONV’s configurability and scalability allow the end user to select a single FPGA or multiple FPGAs based on performance or cost requirements.

4.4. FPGA Platform Selection

To validate and demonstrate the scalability of the VCONV IP, the Avnet Zedboard is selected. It features an AMD-Xilinx Zynq-7000 AP SoC XC7Z020-CLG484 FPGA with a dual-core ARM Cortex-A9 processor in the PS and 106 K flip-flops, 53 K LUTs, 4.9 Mb BRAM, and 220 DSP slices. The 512 MB DDR3 onboard memory stores the input image, weights, biases, and VCONV output. The design is also synthesized on other FPGAs, such as the XC7A50T and XC7A200T, for utilization analysis.

Based on FPGA utilization for the Zynq-7000 series, the Zedboard can accommodate 100 MAC units given the available DSPs, but due to limited LUT/FF resources, only 50 MACs can be realized. Using the available resources, a CE configuration is chosen to implement all CNN layers.

5. Results and Discussion

5.1. Results

To analyze the performance of the VCONV IP, the first layers of AlexNet, VGG16, and ResNet18 were configured according to the sequence of operations listed in Table 7 and implemented for different CE configurations to generate 18 FPGA bitstreams, as listed in Table 9 for the Zedboard. For the first layer in these three CNN models, the input size was 224 × 224, the largest of all layers. The combination of the input size with different weight sizes (11 × 11 (AlexNet), 3 × 3 (VGG16), and 7 × 7 (ResNet18)) and varying padding/stride values stressed the CE, SM, OCM, and CM modules. This setup functionally validated the convolution-related equations and state machines in the IP.

In addition to the above CE configuration validation, the VCONV IP was configured for the requirements of all layers in AlexNet with varied CE, CU, and MAC configurations. Table 10 shows the simulation results for convolutions of different input and weight sizes to confirm the consistent performance of the IP in deeper layers.

Using the VCONV software application developed on PetaLinux, the VCONV IP in the bitstream was enabled to perform the first-layer convolution operation and to capture performance monitor readings at the end of the operation to analyze the active and idle cycles.

The following observations can be made from the simulation waveform for AlexNet Layer 1, as shown in Figure 9:

The IP can be configured and programmed with the convolution parameters for input size, weight size, padding, and stride.
The OCM for the line buffer is divided into rows to implement multiple BRAMs, enabling parallel access.
The OCM is filled before the start of the convolution operation.
The input rows and columns are swept to perform MAC operations as soon as the OCM is filled. MAC ACTIVE and DONE events are monitored to increment active and idle counters for performance analysis.
The convolution output is streamed from the VCONV IP after performing the 121 MAC operations between the input and weight.
Repeating the analysis for the 18 binaries generated across the networks, the active and inactive cycles were monitored and are tabulated in Table 9.

Observation

With convolution being heavily dependent on DSP blocks, many implementations choose FPGAs with more resources to achieve high performance. In Figure 10, it can be seen that smaller FPGAs can be chosen from the Artix and Kintex family, keeping costs under control without the need to select an FPGA with extensive resources.

Table 11 shows a summary of recent implementations, along with their selected FPGAs, utilization, and costs for AlexNet and ResNet-18. In addition to cost, the VCONV IP has a smaller footprint in BRAM utilization, making it more portable on smaller FPGAs.

The VCONV IP contributes to the realization of a cost-efficient CNN accelerator by enabling the distribution of layer-wise processing across FPGAs and ensuring the efficient utilization of limited resources. As explained in Section 3.3, the VCONV IP has optimal BRAM requirements, and as explained in Section 3.4, the IP is scalable with configurable CE/CU/MAC units to be portable to any FPGA.

As explained in Section 3.3, with the multi-row line buffer architecture, compute resources are utilized for maximum performance, and when idle, the design is reset to save power.

As explained in Section 3.4, the design is capable of handling multiple CE instances to enable simultaneous operations within the layer in a pipelined manner by utilizing the FIFOs, with multiple instances of VCONV enabling parallelism between layers.

From the above results, this paper demonstrated the efficient architecture of the VCONV IP for CNN acceleration, highlighting its capability to parameterize, scale CEs, control execution through software, and measure performance using the monitors implemented and validated on the Zedboard with 100% utilization (zero inactive cycles) of the compute resources.

The ONNX parser used to extract the information for various CNN models can be downloaded from https://github.com/vconnectech-cnn/ONNX_PARSER.git, (accessed on 1 February 2025). The parser reads the .onnx file and dumps the parameters and weights for all layers, which are used as references for configuring the VCONV IP.

6. Conclusions and Future Prospects

In this paper, we presented the VCONV IP, a configurable and scalable CNN accelerator that can be implemented across multiple FPGAs to achieve similar or better performance than a single large FPGA while keeping cost and power under control. The VCONV IP was ported to the Avnet Zedboard platform and evaluated for different configurations based on the number of interleaved MACs and VCONVs. The performance of the IP for FP32 was evaluated, and the results showed maximum performance with 100% utilization of compute resources and no idle cycles during operation.

The IP configuration is currently limited to a specific network and data type. This limitation can be addressed with a partial reconfiguration of the FPGA. The IP’s scalable architecture has a high dependency on routing resources, which can be mitigated by limiting the configuration or selecting an appropriate FPGA. Furthermore, an IP core configuration tool can be developed for the VCONV IP to customize the IP while importing the IP for FPGA design.

Author Contributions

Conceptualization, S.N.; Methodology, S.N.; Software, S.N.; Validation, S.N. and A.A.P.; Formal analysis, S.N. and A.A.P.; Investigation, S.N. and A.A.P.; Data curation, S.N.; Writing—original draft, S.N.; writing—review and editing, S.N. and A.A.P.; Supervision, A.A.P. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The ONNX parser used to extract the information for various CNN models can be downloaded from https://github.com/vconnectech-cnn/ONNX_PARSER.git (accessed on 1 February 2025).

Conflicts of Interest

Author Srikanth Neelam was employed by the company VConnecTech Systems Pvt Ltd. The remaining authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1. Typical architecture of a convolutional neural network layer.

Figure 2. CNN error rates vs. layers.

Figure 3. Comparison of costs and resources between different FPGAs.

Figure 4. VCONV IP: CNN accelerator architecture.

Figure 5. Input image arrangement in LB.

Figure 6. Input to multiple CEs at the same time with a multi-row line buffer.

Figure 7. Inactive counters for monitoring idle time in MACs.

Figure 8. VCONV engine integrated with DMA IP on SoC FPGA.

Figure 9. Simulation capture of the VCONV IP’s functionality for AlexNet.

Figure 10. Implementing AlexNet on different FPGAs at 30 fps.

Table 1

Computational resources required for various CNN networks.

Year	Network	Layers	T_C	T_MAC	T_GOP
2012	AlexNet	5	368,928	962,858,112	1.93
2014	VGG16	13	1,634,496	15,346,630,656	30.7
2015	ResNet18	18	1,392,832	1,813,561,344	3.63

Table 2

Comparison of CPUs, GPUs, and FPGAs.

Aspect	CPU	GPU	FPGA
Computing Speed	Low	High	Better than CPU and less than GPU
Memory	High	High	Sufficient
Price	Average	High	Low
Power	Average	High	Low
Application Domain	Non-Real Time	Real Time	Embedded, Non-Real Time

Table 3

DDR read/write and data bandwidth for 1 frame per second.

Network	DDR Read (Mbps)	DDR Write (Mbps)	IBW (Mbps)	IBWO (Mbps)
AlexNet	4,108,704	600,448	125.39	1206.24
VGG16	23,792,320	13,547,520	726.09	3720.94
ResNet18	13,349,568	2,483,712	407.40	1619.30

Table 4

CNN parameters.

Parameters	Description
BNORM Parameters	Mean ( $μ$ ), Variance ( $σ^{2}$ ), Alpha ( $α$ ), and Beta ( $β$ ) are used to perform the BNORM operation. $α$ and $β$ adjust the standard deviation and bias, respectively. BNORM is calculated as shown in Equation (9): (8) $BNORM (x) = \frac{x - μ}{\sqrt{σ^{2}}} \cdot α + β$
Convolution Parameters	Input Image Size: W or H, assuming that width and height are the same. Weight Size: K, assuming that width and height are the same. Padding: Number of columns and rows appended to the input buffer. Stride: The number of pixels to skip for the next convolution.
MaxPool Parameters	Output buffer is calculated based on the input parameters. Padding: Number of columns and rows appended to the input buffer. Stride: Number of pixels to skip in MaxPool.

Table 5

Control parameters.

Controls	Description
Convolution Enable	1: Store the data in the weight and line buffer (LB) to perform convolution, and forward the data to the BNORM module. 0: Skip convolution and pass the data to the input of BNORM.
BNORM Enable	1: Perform BNORM using the input parameters programmed, and forward the data to the ReLU module. 0: Skip BNORM, and pass the data to the input of ReLU.
ReLU Enable	1: Perform ReLU operation on the input, and forward the data to the output buffer. 0: Skip ReLU, and forward the data to the output buffer.
MaxPool Enable	1: Perform MaxPool based on the input parameters, and pass the data to the AXI Master interface. 0: Skip MaxPool, and pass the data to the AXI Master stream interface.

Table 6

Parameters extracted from ONNX per layer.

Convolution	BNORM	MaxPool
Input Image Size	Mean ( $μ$ )	Output Size
Input Weight Size	Variance ( $σ$ )	Pool Kernel Size
Padding	Alpha ( $α$ )	Padding
Stride	Beta ( $β$ )	Stride

Table 7

Compute operation sequences for different networks.

Network	Step 1	Step 2	Step 3	Step 4
AlexNet	Convolution	ReLU	LRN	MaxPool
VGG16	Convolution	ReLU
ResNet18	Convolution	BNORM	ReLU	MaxPool

Table 8

VCONV configuration for 5 layers in AlexNet.

	Layer 1	Layer 2	Layer 3	Layer 4	Layer 5	Total
Channels	3	96	256	384	384
Input Size	224 × 224	26 × 26	12 × 12	12 × 12	12 × 12
Weight Count	96	256	384	384	256
Weight Size	11 × 11	5 × 5	3 × 3	3 × 3	3 × 3
No. of MACs Operations	101,616,768	415,334,400	127,401,984	191,102,976	127,401,984	962,858,112
No. of VCONVs	2	3	3	3	3	14
No. of CEs	12	3	2	5	7	29
No. of CUs	2	6	6	2	2	18
No. of MACs	2	2	4	6	4	18
Processing time (ms)	492.34	492.20	467.32	438.22	413.01	<500

Table 9

Performance monitoring of active and inactive cycle counter results for various networks with different CE configurations.

Network	No. of CEs	MAC Ops/CE	Active Cycles	GOP/s
AlexNet	1	3,528,356	6,708,384	0.022
	6	58,806	1,117,314	0.128
	9	39,204	744,876	0.19
	18	19,602	372,438	0.38
	27	13,068	248,292	0.569
	54	6534	124,146	1.128
VGG16	7	451,584	8,580,096	0.006
	14	64,512	1,225,728	0.047
	28	32,256	612,864	0.074
	56	16,128	306,432	0.147
	112	8064	153,216	0.295
ResNet18	1	614,656	11,678,464	0.023
	7	87,808	668,352	0.15
	14	43,904	334,176	0.319
	28	21,952	208,544	0.591
	56	10,976	104,272	1.181
	112	5488	104,272	2.35

Table 10

Inactive counter results for different layers and different CE, CU, and MAC configurations in VCONV.

	Layer 1	Layer 2	Layer 3	Layer 4	Layer 5
Input size	224 × 224	26 × 26	12 × 12	3 × 3	3 × 3
Weight size	11 × 11	5 × 5	3 × 3	3 × 3	3 × 3
No. of CEs	9	1	1	2	3
No. of CUs	1	1	3	1	1
No. of MACs	1	1	1	3	3
Active Cycles	461,524	202,796	5180	2588	1724
Inactive Cycles	0	0	0	0	0

Table 11

Comparison of CNN accelerators implemented in FPGAs.

Paper	[19]	[25]	[26]	[21]	[6]	This Paper
Year	2022	2022	2022	2023	2023
CNN Model	ResNet101	VGG16	ResNet18	VGG16	VGG16	AlexNet	AlexNet	ResNet18	ResNet18
FPGA	VX980T	XC7Z045	XC7Z045	XC7Z020	KU060	1-XC7A200T 1-XC7A50T	5-XC7A200T 3-XC7A50T	3-XC7A200T	23-XC7A200T 2-XC7A50T
LUTs	480 K	154 K	39.1 K	12.675 K	141.362 K	147.65 K	490.2 K	542.57 K	1689.27 K
DSPs	3121	787	214	80	2338	1430	11820	2588	17,268
BRAM	1456.5 Kb	18.9 Mb	4.4 Mb	52.5 Kb	22.57 Mb	1.61 Mb	1.64 Mb	5.91 Mb	7.012 Mb
FPS			12.9			2	15	2	15
Frequency (MHz)	100	150	100	100	66.2	100	100	100	100
Bit width	16-bit	8-bit		16-bit	≤12-bit	32-bit	32-bit	32-bit	32-bit
Data Type	Floating-point	Fixed-point	Mixed-precision	Fixed-point	Fixed-point	Floating-point	Floating-point	Floating-point	Floating-point
GOP/s	600	206	46.8	8	29.87	6	56	12	79
Price (USD)	23706	1816	1816	149	5391	604	5817	1342	8079

References

1. Mohaidat, T.; Khalil, K. A Survey on Neural Network Hardware Accelerators. IEEE Trans. Artif. Intell.; 2024; 5, pp. 3801-3822. [DOI: https://dx.doi.org/10.1109/TAI.2024.3377147]

2. Ting, Y.S.; Teng, Y.F.; Chiueh, T.D. Batch Normalization Processor Design for Convolution Neural Network Training and Inference. Proceedings of the IEEE International Symposium on Circuits and Systems (ISCAS); Daegu, Republic of Korea, 22–28 May 2021.

3. Yang, Z.; Wang, L.; Luo, L.; Li, S.; Guo, S.; Wang, S. Bactran: A Hardware Batch Normalization Implementation for CNN Training Engine. IEEE Embed. Syst. Lett.; 2021; 13, pp. 29-32. [DOI: https://dx.doi.org/10.1109/LES.2020.2975055]

4. Lee, J.; Mukhanov, L.; Molahosseini, A.S.; Minhas, U.; Hua, Y.; Del Rincon, J.M.; Dichev, K.; Hong, C.H.; Vandierendonck, H. Resource-Efficient Convolutional Networks: A survey on model-, Arithmetic-, and Implementation-Level techniques. ACM Comput. Surv.; 2023; 55, pp. 1-36. [DOI: https://dx.doi.org/10.1145/3587095]

5. Syed, R.T.; Andjelkovic, M.; Ulbricht, M.; Krstic, M. Towards reconfigurable CNN accelerator for FPGA implementation. IEEE Trans. Circuits Syst. II Express Briefs; 2023; 70, pp. 1249-1253. [DOI: https://dx.doi.org/10.1109/TCSII.2023.3241154]

6. Pacini, T.; Rapuano, E.; Fanucci, L. FPG-AI: A Technology-Independent framework for the automation of CNN deployment on FPGAs. IEEE Access; 2023; 11, pp. 32759-32775. [DOI: https://dx.doi.org/10.1109/ACCESS.2023.3263392]

7. Zeng, K.; Ma, Q.; Wu, J.; Chen, Z.; Shen, T.; Yan, C. FPGA-based accelerator for object detection: A comprehensive survey. J. Supercomput.; 2022; 78, pp. 14096-14136. [DOI: https://dx.doi.org/10.1007/s11227-022-04415-5]

8. Hu, Y.; Liu, Y.; Liu, Z. A Survey on Convolutional Neural Network Accelerators: GPU, FPGA and ASIC. Proceedings of the International Conference on Computer Research and Development (ICCRD); Shenzhen, China, 7–9 January 2022.

9. Hong, H.; Choi, D.; Kim, N.; Lee, H.; Kang, B.; Kang, H.; Kim, H. Survey of convolutional neural network accelerators on field-programmable gate array platforms: Architectures and optimization techniques. J. Real-Time Image Process.; 2024; 21, 64. [DOI: https://dx.doi.org/10.1007/s11554-024-01442-8]

10. Haijoub, A.; Hatim, A.; Arioua, M.; Hammia, S.; Eloualkadi, A.; Guerrero-González, A. Implementing Convolutional Neural Networks on FPGA: A Survey and research. ITM Web of Conferences; EDP Sciences: Les Ulis, France, 2023.

11. Basalama, S.; Sohrabizadeh, A.; Wang, J.; Guo, L.; Cong, J. FlexCNN: An End-to-end Framework for Composing CNN Accelerators on FPGA. ACM Trans. Reconfigur. Technol. Syst.; 2023; 16, pp. 1-32. [DOI: https://dx.doi.org/10.1145/3570928]

12. Kim, D.; Jeong, S.; Kim, J.Y. Agamotto: A Performance Optimization Framework for CNN Accelerator With Row Stationary Dataflow. IEEE Trans. Circuits Syst. I Regul. Pap.; 2023; 70, pp. 2487-2496. [DOI: https://dx.doi.org/10.1109/TCSI.2023.3258411]

13. Kim, H.; Choi, K. Low Power FPGA-SoC Design Techniques for CNN-based Object Detection Accelerator. Proceedings of the 2019 IEEE 10th Annual Ubiquitous Computing, Electronics & Mobile Communication Conference (UEMCON); New York, NY, USA, 10–12 October 2019.

14. Wang, Y.; Liao, Y.; Yang, J.; Wang, H.; Zhao, Y.; Zhang, C.; Xiao, B.; Xu, F.; Gao, Y.; Xu, M. et al. An FPGA-based online reconfigurable CNN edge computing device for object detection. Microelectron. J.; 2023; 137, 105805. [DOI: https://dx.doi.org/10.1016/j.mejo.2023.105805]

15. Kim, V.H.; Choi, K.K. A Reconfigurable CNN-Based Accelerator Design for Fast and Energy-Efficient Object Detection System on Mobile FPGA. IEEE Access; 2023; 11, pp. 59438-59445. [DOI: https://dx.doi.org/10.1109/ACCESS.2023.3285279]

16. Qiu, C.; Wang, X.; Zhao, T.; Li, Q.; Wang, B.; Wang, H.; Wu, W. An FPGA-Based Convolutional Neural Network Coprocessor. Wirel. Commun. Mob. Comput.; 2021; 2021, 3768724. [DOI: https://dx.doi.org/10.1155/2021/3768724]

17. Archana, V.S. An FPGA-Based Computation-Efficient Convolutional Neural Network Accelerator. Proceedings of the 2022 IEEE International Power and Renewable Energy Conference (IPRECON); Kollam, India, 16–18 December 2022.

18. Bai, H.R. A Flexible and Low-Resource CNN Accelerator on FPGA for Edge Computing. Proceedings of the 2023 3rd International Conference on Neural Networks, Information and Communication Engineering (NNICE); Guangzhou, China, 24–26 February 2023.

19. Huang, W.; Wu, H.; Chen, Q.; Luo, C.; Zeng, S.; Li, T.; Huang, Y. FPGA-Based High-Throughput CNN Hardware Accelerator With High Computing Resource Utilization Ratio. IEEE Trans. Neural Netw. Learn. Syst.; 2022; 33, pp. 4069-4083. [DOI: https://dx.doi.org/10.1109/TNNLS.2021.3055814]

20. Tang, S.N. Area-Efficient Parallel Multiplication Units for CNN Accelerators with Output Channel Parallelization. IEEE Trans. Very Large Scale Integr. VLSI Syst.; 2023; 31, pp. 406-410. [DOI: https://dx.doi.org/10.1109/TVLSI.2023.3235776]

21. Jameil, A.K.; Al-Raweshidy, H. Efficient CNN Architecture on FPGA Using High Level Module for Healthcare Devices. IEEE Access; 2022; 10, pp. 60486-60495. [DOI: https://dx.doi.org/10.1109/ACCESS.2022.3180829]

22. Gowda, K.; Madhavan, S.; Rinaldi, S.; Divakarachari, P.B.; Atmakur, A. FPGA-Based Reconfigurable Convolutional Neural Network Accelerator Using Sparse and Convolutional Optimization. Electronics; 2022; 11, 1653. [DOI: https://dx.doi.org/10.3390/electronics11101653]

23. Jiang, W.; Sha, E.H.M.; Zhuge, Q.; Yang, L.; Chen, X.; Hu, J. Heterogeneous FPGA-Based Cost-Optimal Design for Timing-Constrained CNNs. IEEE Trans. -Comput.-Aided Des. Integr. Circuits Syst.; 2018; 37, pp. 2542-2554. [DOI: https://dx.doi.org/10.1109/TCAD.2018.2857098]

24. Hall, M.; Betz, V. HPIPE: Heterogeneous Layer-Pipelined and Sparse-Aware CNN Inference for FPGAs. Proceedings of the 2020 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays; Seaside, CA, USA, 23–25 February 2020.

25. Liu, W.; Li, Y.; Yang, Y.; Zhu, J.; Liu, L. Design an Efficient DNN Inference Framework with PS-PL Synergies in FPGA for Edge Computing. Proceedings of the 2022 China Automation Congress (CAC); Xiamen, China, 25–27 November 2022.

26. Sun, M.; Li, Z.; Lu, A.; Li, Y.; Chang, S.E.; Ma, X.; Lin, X.; Fang, Z. FILM-QNN: Efficient FPGA Acceleration of Deep Neural Networks with Intra-Layer, Mixed-Precision Quantization. Proceedings of the 2022 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays; Virtual, 27 February–1 March 2022.

Word count: 7851

Show less

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Field Programmable Gate Arrays (FPGAs), with their wide portfolio of configurable resources such as Look-Up Tables (LUTs), Block Random Access Memory (BRAM), and Digital Signal Processing (DSP) blocks, are the best option for custom hardware designs. Their low power consumption and cost-effectiveness give them an advantage over Graphics Processing Units (GPUs) and Central Processing Units (CPUs) in providing efficient accelerator solutions for compute-intensive Convolutional Neural Network (CNN) models. CNN accelerators are dedicated hardware modules capable of performing compute operations such as convolution, activation, normalization, and pooling with minimal intervention from a host. Designing accelerators for deeper CNN models requires FPGAs with vast resources, which impact its advantages in terms of power and price. In this paper, we propose the VCONV Intellectual Property (IP), an efficient and scalable CNN accelerator architecture for applications where power and cost are constraints. VCONV, with its configurable design, can be deployed across multiple smaller FPGAs instead of a single large FPGA to provide better control over cost and parallel processing. VCONV can be deployed across heterogeneous FPGAs, depending on the performance requirements of each layer. The IP’s performance can be evaluated using embedded monitors to ensure that the accelerator is configured to achieve the best performance. VCONV can be configured for data type format, convolution engine (CE) and convolution unit (CU) configurations, as well as the sequence of operations based on the CNN model and layer. VCONV can be interfaced through the Advanced Peripheral Bus (APB) for configuration and the Advanced eXtensible Interface (AXI) stream for data transfers. The IP was implemented and validated on the Avnet Zedboard and tested on the first layer of AlexNet, VGG16, and ResNet18 with multiple CE configurations, demonstrating 100% performance from MAC units with no idle time. We also synthesized multiple VCONV instances required for AlexNet, achieving the lowest BRAM utilization of just 1.64 Mb and deriving a performance of 56GOPs.

Details

Title

VCONV: A Convolutional Neural Network Accelerator for FPGAs

Author

Srikanth Neelam¹

; A Amalin Prince²

¹ VConnecTech Systems Pvt Ltd., Hyderabad 500011, India; [email protected]; Department of Electrical and Electronics Engineering, BITS Pilani, K K Birla Goa Campus, Sancoale 403726, India
² Department of Electrical and Electronics Engineering, BITS Pilani, K K Birla Goa Campus, Sancoale 403726, India

First page

657

Publication year

2025

Publication date

2025

Publisher

MDPI AG

e-ISSN

20799292

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/electronics14040657

ProQuest document ID

3171008068