Abstract

Translate

As artificial intelligence (AI) advances, deep learning models are shifting from convolutional architectures to transformer-based structures, highlighting the importance of accurate floating-point (FP) calculations. Compute-in-memory (CIM) enhances matrix multiplication performance by breaking down the von Neumann architecture. However, many FPCIMs struggle to maintain high precision while achieving efficiency. This work proposes a high-precision hybrid floating-point compute-in-memory (Hy-FPCIM) architecture for Vision Transformer (ViT) through post-alignment with two different CIM macros: Bit-wise Exponent Macro (BEM) and Booth Mantissa Macro (BMM). The high-parallelism BEM efficiently implements exponent calculations in-memory with the Bit-Separated Exponent Summation Unit (BSESU) and the routing-efficient Bit-wise Max Finder (BMF). The high-precision BMM achieves nearly lossless mantissa computation in-memory with efficient Booth 4 encoding and the sensitivity-amplifier-free Flying Mantissa Lookup Table based on 12T Triple Port SRAM. The proposed Hy-FPCIM architecture achieves 23.7 TFLOPS/W energy efficiency and 0.754 TFLOPS/mm² area efficiency, with 617 Kb/mm² memory density in 28 nm technology. With almost lossless architectures, the proposed Hy-FPCIM achieves an accuracy of 81.04% in recognition tasks on the ImageNet dataset using ViT, representing a 0.03% decrease compared to the software baseline. This research presents significant advantages in both accuracy and energy efficiency, providing critical technology for complex deep learning applications.

Full text

Turn on search term navigation

Translate

1. Introduction

The rapid advancement of artificial intelligence (AI) has triggered explosive growth in computing power demands, driving the migration of computing workloads from the cloud to the edge. These workloads typically exhibit high data dependency, exposing traditional von Neumann architectures like GPUs [1] to significant “memory wall” bottlenecks. Compute-in-memory (CIM), based on analog calculation [2,3,4,5], in-memory logic [6], digital calculation [7,8,9,10], and lookup table [11,12], has proven highly effective in addressing this bottleneck by distributing computational loads within memory or at the memory boundary, significantly enhancing the efficiency of large-scale matrix computations. Recent state-of-the-art analog CIM, including PCM-based [13] and RRAM-based [14] CIM, have, respectively, improved analog CIM accuracy and bit-width through drift compensation and multi-bit design. However, these methods introduce additional overhead and limit scaling capabilities. Meanwhile, with the emergence of complex tasks like large language models (LLMs) [15,16], which are also increasingly applied in security and privacy domains such as phishing webpage detection [17] and program analysis [18], and low signal-to-noise ratio image recognition applications such as synthetic aperture radar (SAR) images [19,20], traditional integer-based CIM computations can no longer meet precision requirements. With the increasing complexity of deep learning workloads, especially transformer-based models and high-resolution vision tasks, floating-point CIM (FPCIM) [21,22,23,24,25,26] has become a crucial approach to alleviating today’s computational bottlenecks.

Transformer-based architectures are now the leading deep learning algorithms, significantly advancing natural language processing (NLP) and computer vision. They rely on multi-head self-attention and feed-forward modules that require many floating-point matrix multiplications, demanding higher row parallelism and computational precision than traditional methods. Additionally, these algorithms require more weight storage, increasing the memory density needs. Consequently, FPCIM architectures with high precision, row-level parallelism, and memory density are crucial for energy-efficient and accurate transformer-based inference.

Illustrated in Figure 1, FP multiply–accumulate (MAC) operations involve two main processes, multiplication and addition, each with complex steps. Multiplication requires exponent addition with bias correction and mantissa multiplication, followed by normalization and rounding per IEEE 754. Addition demands exponent comparison and dynamic mantissa alignment, where the mantissa of the operand with the smaller exponent is right-shifted to align with the larger exponent before arithmetic. This alignment introduces significant control complexity and data dependencies, impeding straightforward parallel accumulation in memory arrays. In large-scale parallel CIM, the simultaneous accumulation of products with different exponent sums faces mantissa alignment issues, complicating direct in-memory FP-MAC difficult to implement. Thus, innovative architectural and circuit solutions are necessary for efficient, full-precision FP-MAC operations in CIM systems.

Logic-in-memory designs embed digital logic within memory arrays to preserve full precision. For example, KAIST’s work [21] first implemented in-memory exponent computation, while Seoul National University’s design [22] employed CAM-based maximum detection. These methods maintain numerical accuracy but suffer from low throughput due to multi-cycle scheduling and complex control.

Digital compute-in-memory architectures dominate recent developments, leveraging SRAM arrays with embedded arithmetic units. Tu proposed a segmented MAC array with pre-aligned mantissas [27], though fixed alignment limits dynamic range. Saikia [24] reduced mantissa width to improve efficiency at the cost of precision. Fudan University’s SLAM-CIM [25] reversed alignment order to mitigate truncation loss but increased exponent logic overhead. Southeast University’s design [26] replaced exponent subtractors with equivalence comparators to reduce area, yet still relied on pre-aligned mantissas.

Hybrid-domain designs balance precision and efficiency by combining analog and digital computation. Chang proposed a mixed-domain FPCIM [28] with bit-wise exponent processing and analog-domain mantissa multiplication. While this improves energy efficiency, it introduces quantization loss due to pre-alignment. Their recent work used gain cell memory [29] for dual-mode integer–floating-point computation but still faced precision constraints.

In addition, a few recent studies have explored FPCIM architectures based on non-SRAM memory technologies. Hu proposed RRAM-based FPCIM with analog-domain multiplication and non-uniformly grouped sense amplifiers [30], but the design suffers from conductance nonlinearity and drift, leading to unstable precision. Liu developed MDCIM [31], a SOT-MRAM-based FP64 CIM macro using mantissa segmentation, but faced throughput limits due to MRAM’s peripheral complexity. These limitations in precision, latency, and integration complexity have posed challenges to FPCIM designs based on high-density memory technologies; as a result, SRAM-based architectures remain the predominant choice in current implementations.

To address the limitations of prior FPCIM designs and meet the demands for high precision, high row-level parallelism, and high memory density, this work proposes a high-precision hybrid floating-point compute-in-memory (Hy-FPCIM) architecture for complex deep learning with the following contributions:

A dual-macro high-precision Hy-FPCIM architecture for the Vision Transformer including Bit-wise Exponent Macro (BEM) and Booth Mantissa Macro (BMM) with post-alignment, enabling parallel exponent–mantissa processing with pipeline dataflow, reduced inter-macro data movement, and improved routing efficiency.

A high-row-parallelism in-memory BEM incorporating a low-cost Bit-Separated Exponent Summation Unit (BSESU) and a routing-efficient Bit-wise Max Finder (BMF) for scalable in-memory exponent computation.

A high-precision in-memory BMM combining Booth 4 encoding and a 12T triple-port SRAM-based Flying Mantissa Lookup Table (Fly-MLUT) to achieve efficient, almost lossless mantissa multiplication without sensitivity amplifiers.

The remainder of this paper is organized as follows: Section 2 introduces the Hy-FPCIM architecture, including the design of the BEM and BMM. Section 3 presents implementation details and evaluation results. Section 4 discusses comparisons with prior FPCIM architectures. Section 5 concludes the work.

2. Proposed Hybrid Floating-Point Compute-in-Memory

2.1. ViT Structure with High Floating-Point Mult-Adds Demand

Figure 2 depicts a typical transformer-based Vision Transformer (ViT) [32] pipeline used in this work. The input image, sized 224 × 224 × 3, is segmented into 16 × 16 patches using a convolutional kernel with a stride of 16, and these patches are subsequently embedded into a 768-dimensional space. The architecture comprises 12 stacked transformer encoder blocks, each incorporating multi-head self-attention (MHA) and a multi-layer perceptron (MLP). The classification is finally executed through an MLP head, which includes a pre-logits layer followed by a linear layer. In each self-attention layer, queries (Q), keys (K), and values (V) are computed by three linear projections and partitioned across heads. The scaled dot product produces attention weights,

(1) $A t t n (Q, K, V) = s o f t m a x (\frac{(Q K^{T})}{s q a r t (d)}) V$

which aggregates information globally over tokens. These stages are dominated by a large amount of floating-point matrix multiplications and accumulations from Q/K/V projections, the QK^T score computation, the attention-weighted V aggregation, and the two linear layers of the MLP.

Approximately 99.8% of the floating-point operations (FLOPs) are concentrated within the 12 Transformer encoders. With a hidden layer dimension of 768, a sequence length of 197, and 12 attention heads, each MHA layer executes around 531.7 million MAC operations, while each MLP layer performs approximately 930.7 million MACs, resulting in a total of about 17.55 billion MACs across all encoder blocks. It is important to note that the attention necessitates a higher level of accuracy compared to the linear layers. Additionally, the figure includes a computational graph of the MHA, highlighting parallelizable linear and matrix multiplication nodes. As tokens are processed concurrently across multiple heads and layers, the ViT represents a precision-sensitive, row-parallel workload. Thus, the ViT pipeline serves as an effective benchmark for evaluating floating-point acceleration and memory-centric data flows in edge-oriented systems.

Figure 3 shows the mapping of major ViT operators onto the proposed Hy-FPCIM architecture. In ViT, the computationally dominant workloads arise from floating-point matrix multiplications in the Q/K/V projections, the attention (Q K^T) computation, and the linear layers of the feed-forward MLP. These massive amounts of MAC operations are executed in the proposed Hy-FPCIM, where input operands are separated into exponents and mantissas for different macros for processing. To support efficient integration into full AI accelerator systems, we adopt a unified 2D tiling strategy of 64 × 8 for all matrix multiplications, where each Hy-FPCIM macro independently executes a 64 × 8 submatrix multiplication. The exponent computation is performed in parallel by the BEM, while the mantissa multiplication–accumulation is handled by the BMM, and the final result is assembled by the algorithm mapping controller.

2.2. Hy-FPCIM Architecture

Figure 4 illustrates the overall architecture of the proposed Hybrid Floating-Point Computing-in-Memory (Hy-FPCIM) architecture, designed for the BF16 format widely adopted in AI tasks. To mitigate mantissa rounding and truncation loss during accumulation, the architecture implements floating-point matrix multiplication by the post-alignment strategy. The Hy-FPCIM architecture consists of one Bit-wise Exponent Macro (BEM) and two Booth Mantissa Macros (BMMs):

BEM: Bit-wise Exponent Macro stores 65.5 Kb exponent weights and is partitioned into eight banks. It consists of Exponent Calculator Arrays (ECAs) and a parallel Bit-Wise Max Finder (BMF). Each ECA integrates a Bit-Separated Exponent Summation Unit (BSESU) and a Delta Exponent Subtractor Unit (DESU).

BMM: Each BMM is divided into four banks and stores 98.3 Kb pre-calculated mantissa weights. It is composed of Flying Mantissa LUTs (Fly-MLUTs), LUT decoder drivers, Booth adders, local shift accumulators, global mantissa accumulators, and auxiliary circuits for floating-point normalization.

Figure 4 also shows the data flow of BF16 matrix multiplication in this architecture, which is pipelined across two cycles:

Cycle 1: BEM performs in-memory exponent summation and maximum comparison on 64 rows of input exponents and eight columns of 64-row weight exponents, then DESU generates eight sets of exponent differences ΔE. Simultaneously, two BMMs perform Radix-16 Booth coding on 64 rows of mantissa inputs and compute 64 rows of eight-column mantissa multiplication results based on partial products in the Fly-MLUT according to the Booth algorithm.

Cycle 2: Local shift accumulators and global mantissa accumulators in BMM execute shift-and-accumulate operations on mantissa multiplication results based on ΔE. Finally, perform floating-point normalization by E_MAX to obtain the BF16 floating-point matrix multiplication result.

The proposed two-cycle pipeline design achieves a balanced trade-off between latency and throughput. The latency at each stage is tolerable due to the 8-bit exponent and mantissa width of the BF16 format; otherwise, excessive pipeline registers might incur greater energy overhead than the computing operations.

This approach preserves the full mantissa precision throughout the computation and effectively prevents precision degradation caused by early right-shift operations.

2.3. Bit-Wise Exponent Macro

As shown in Figure 4, the BEM consists of three core modules: the BSESU, BMF, and DESU, and peripheral control circuits (e.g., Decoders, Drivers, Controllers), which, respectively, realize the in-memory summation of IN_E and W_E (yielding E_SUM), maximum selection of E_SUM (to obtain E_MAX), and calculation of $∆ E = E_{M A X} - E_{S U M}$ . The BEM is designed with standard 8T SRAM arrays for process compatibility, mitigating data movement overhead between the exponent memory and the calculator in conventional FPCIM.

2.3.1. Bit-Separated Exponent Summation Unit

The BSESU adopts a “bit-separated” design, where 8-bit exponents are split into most 4-bit significant bits (M4SB) and least 4-bit significant bits (L4SB). It enhances row parallelism by reducing word line (WL) capacitance, enabling simultaneous computation on more rows of weights while avoiding the sequential bit processing latency in traditional bit-serial exponent summation.

Figure 5 shows the circuit structure of the BSESU, which consists of the 8T SRAM array, 5T compact local readout unit, and pass-transistor logic (PTL) full adder. The 5T local readout unit is embedded in 8T SRAM array and operates in two modes: when REN = 0, the W_E would be latched by the feedback structure; when REN = 1, W_E would be refreshed following the calculate bit line (CBL). It is used to refresh signals for the PTL full adder and avoids glitches at the FA output caused by precharging during the readout process—thus preventing unnecessary switching power consumption in subsequent operations.

The compact 16T PTL full adder [33] in BSESU consists of two PTL XOR gates and a 2MUX1. For 8-bit summation, a ripple-carry structure is employed to reduce the area of the embedded adder. Compared to traditional 28-transistor CMOS full adders, this design achieves significant area reduction, trading off low power and high throughput for in-memory exponent calculation requirements.

During operation, the target RWL is asserted to sense W_E onto CBL. REN is subsequently triggered to feed W_E into the 10-transistor PTL full adder. Through ripple-carry across eight cascaded full adders, the 8-bit exponent sum (E_SUM) is generated for BMF processing.

2.3.2. Parallel Bit-Wise Max Finder

Figure 6 illustrates the proposed bit-wise maximum finder (BMF) structure, consisting of a two-level in-memory array topology (AT) array optimized based on AT maximum finders [34]. Each AT array comprises 8 × 8 sub-cells, each consisting of two 2-input AND gates, one 2-input OR gate, and 1/8 of an 8-input NOR gate. Different from the original AT unit, the proposed in-memory AT unit decomposes the external 64-input OR gate into internal 8-input NOR gates and an external 8-input NAND gate. Furthermore, the transistors of the internal 8-input NOR gate are distributed into each cell, thus reducing the number of row-direction routings. After layout optimization, there are only 11 routing channels per bit in the row direction, significantly reduced from the 65 channels in the basic AT array. This effectively eliminates routing congestion in high-row-parallel floating-point matrix multiplication.

Figure 7 shows the BMF operation flow. The in-memory AT cells parallelly compare 64; the E_SUM follows the bit-wise logic sequence from the most significant bit (MSB) to the least significant bit (LSB). When Enable #0–63 are 0, E_MAX [7:0] are driven to 0. When all the enable signals Enable #0–63 become high level, the AND gate a (in Figure 6) is activated first. If any of the 64 AT cells has E_SUM [7] = 1, E_MAX [7] is driven high by the first stage NOR gates and the second stage NAND gate b. Simultaneously, the match signal Match [6] in that cell is set by the OR gate e. If all E_SUM [7] are 0, E_MAX [7] is driven low. The inverter c turns on, activating the internal AND gate d. This passes the current row’s Match signal through an OR gate e to the next row. The proposed BMF with parallel matching from MSB to LSB significantly reduces routing overhead, logic delay, and area consumption of the maximum finder during high-row parallelism.

2.3.3. Delta Exponent Subtractor Unit

Figure 8 shows the DESU with a standard 8-bit ripple borrow subtractor to compute the exponent difference ΔE = E_MAX − E_SUM, which is required for mantissa alignment in the subsequent BMM stage. During the exponent computation process, E_MAX is selected as the maximum among all partial exponent sums (E_SUM), ensuring that E_MAX is always greater than or equal to E_SUM. This property allows the use of an unsigned subtractor without overflow handling.

Compared to the conventional approach of using a full adder with two’s complement conversion for subtraction, directly adopting a ripple borrow subtractor reduces both area and dynamic power consumption, as it eliminates the need for complement conversion logic.

2.4. Booth Mantissa Macro

As shown in Figure 4, the BMM architecture consists of Booth 4 Encoders, 12T SRAM-based SA-free Fly-MLUTs, and Floating-Point Mantissa Accumulators. These modules collectively perform in-memory multiplication and accumulation of BF16 mantissas, completing the computation in two cycles. In the first cycle, input mantissas are Booth 4-encoded to index precomputed partial products from Fly-MLUTs. Booth adders then generate full multiplication results. In the second cycle, all products are shift-accumulated based on the exponent difference ΔE, followed by normalization.

2.4.1. Booth 4 Encoder

As shown in Figure 9, a parallelized Booth 4 encoding scheme is proposed to reduce floating-point mantissa multiplication latency. Each 8-bit BF16 mantissa and 1-bit sign of the input are converted to 9-bit two’s complement, which can be divided into two overlapping 5-bit segments. Compared to Booth 2, the proposed Booth 4 encoder requires only two partial products for each BF16 mantissa operation, allowing for parallel implementation and significantly reducing latency.

Table 1 presents the truth table for the proposed Booth 4 encoder, including selected mantissa LUT weight (MLUT), negative indicator (NEG), shift control (SHIFT), and zero indicator (ZERO). The table also shows the required partial products (PPs) after encoding. The proposed Fly-MLUT stores only four odd partial products (1W, 3W, 5W, 7W) to reduce memory overhead, and the even partial products and negative partial products are obtained by shifting and two’s complement conversion, respectively. When ZERO = 1, the corresponding path is disabled to suppress switching activity.

The encoding logic executes in parallel outside the CIM array and is broadcast to four banks, reducing area overhead. For each row, two sets of encoded signals are generated and routed to the LUT decoder and Booth adder. This segmented encoding is implemented within the same cycle as Fly-MLUT lookup and Booth addition, forming the first stage of the BMM data path.

2.4.2. 12T SRAM-Based Fly-MLUT

Figure 10a illustrates the proposed Fly-MLUT structure, which stores four precomputed partial products for Booth 4 mantissa multiplication. The proposed in-memory mantissa multiplier consists of a 32 × 48 Fly-MLUT array built from custom 12T tri-port bit cells (12T TP cells), LUT row decoders and drivers, and Booth adders. Figure 10b presents the structure of the 12T TP cell, which is designed based on a 6T bit cell with an internal full-swing driver and two dedicated read ports, RBL1 and RBL2, for CIM, eliminating the pre-charge process for read ports during computing and enabling partial product readout without a sense amplifier. Dual readout channels support simultaneous lookup of two partial products, allowing parallel decoding of high-order and low-order Booth indices to complete BF16 mantissa multiplication within one cycle.

The 32-row Fly-MLUT array consists of two sub-segments, each storing precomputed partial products of two weights. The RBL of the upper sub-segment overlaps with those of the lower sub-segment, realizing parallel lookup of partial products for two weights and improving LUT efficiency. As described in Section 2.4.1, the Fly-MLUT stores only four odd partial products (1W, 3W, 5W, 7W) after Booth 4 encoding. Therefore, each 16-row sub-segment caches four precomputed weights’ partial products. The 48-column storage units hold 12-bit partial products from four columns of weights and can be read simultaneously. This enables the Fly-MLUT to read 2 rows × 4 columns × 2 segments × 12-bit partial products at once, facilitating 2-row × 4-column floating-point mantissa multiplication.

2.4.3. Mantissa Accumulator and Floating-Point Normalization

Mantissa accumulation is performed in two stages. In the first stage, multiplication results from Booth adders are aligned using barrel shifters based on exponent difference ΔE, then summed through a four-level Local Shift Accumulator. The second stage compresses four locally accumulated results using a two-level Global Mantissa Accumulator.

Figure 11 illustrates the normalization process. The final result is converted to sign-magnitude format from two’s complement, and a leading-zero detection algorithm determines the required shift. The exponent is adjusted accordingly, and the normalized mantissa is combined to produce the BF16 output.

3. Experiments and Results

3.1. Design Methodology

The Hy-FPCIM macro was implemented using the TSMC 28 nm HKMG process technology. The fully customized CIM array was designed at both schematic and layout levels using Cadence Virtuoso Studio IC25.1, and verified through Siemens Calibre 2025 for DRC and LVS sign-off. Outside the CIM array, the top-level instruction controller and I/O interface circuits were designed using a digital semi-custom flow. The front-end development was carried out with Synopsys VCS and Verdi X-2025.06. To shorten the turnaround time (TAT) and improve the correlation between front-end and back-end design stages, Synopsys Design Compiler NXT X-2025.06 was adopted for physical-aware synthesis, followed by Cadence DDI (Digital Design Implementation) v25.1 for place-and-route optimization to achieve the best PPA (Performance, Power, and Area) results. During synthesis and layout implementation, SBOCV design considerations were included to ensure full-chip timing convergence and sign-off reliability. To enable fast and accurate co-design between the top-level digital control/communication modules and the fully custom analog macros, the custom blocks were characterized and standardized using Cadence Liberate v25.1, providing comprehensive IP behavioral and timing models for seamless integration.

3.2. Implementation Results

The proposed Hy-FPCIM covers 0.265 mm² and performs 64 × 8 floating-point matrix multiplication per cycle. The high-row-parallel BEM offers 65.5 Kb storage capacity in just 0.045 mm². Each bank can perform the exponent calculation for 64 rows of floating-point multiplications; hence, the entire BEM implements the exponent operation for 64 × 8 floating-point matrix multiplication. Within each ECA, the proposed embedded exponent calculator features a compact layout area of 28.7 μm², including BSESU (excepting bit cells), bit-wise comparator, and DESU. The memory array employs TSMC’s customized push-rule TP-8T-HC-MUXN compact SRAM bit cell, with a cell area of 0.24 μm², achieving high integration density while maintaining robust read stability.

The high-precision BMM features 49 Kb of mantissa weight storage capacity within 0.11 mm². The Hy-FPCIM consists of two BMMs and one BEM, with a total storage capacity of 163.5 Kb and a storage density of 617 Kb/mm².

3.3. Evaluation Configuration

To evaluate the algorithm-level effectiveness of the proposed high-precision Hy-FPCIM architecture, a system-level, application-oriented evaluation methodology was adopted. At this stage, the precision characteristics of Hy-FPCIM were modeled in software to emulate hardware behavior. Specifically, the software model reflects the precision behavior described in Section 2.4.1, wherein the input mantissa’s least significant bit (LSB) is truncated because of the Booth encoding.

The evaluation was conducted using PyTorch 2.6 and CUDA 11.8 frameworks. Data preprocessing was performed using OpenCV 4.5 and NumPy 1.21, while the benchmark models included ResNet-50 and ViT. The test dataset comprised the ImageNet-1k dataset for large-scale image classification. During evaluation, pretrained model weights from PyTorch were uniformly quantized to BF16. All matrix multiplications followed the precision-loss behavior specified by the Hy-FPCIM evaluation model. The convolutional layers in ResNet-50 and the attention and linear layers in ViT were mapped onto the on-chip compute array model. When the computation exceeded the hardware array capacity, a partition-based evaluation method was adopted to complete the full inference process. Other lightweight operators such as activation and normalization were executed in software to focus measurement on floating-point matrix multiplication, and the simulated run time was recorded to estimate execution time.

During timing simulation, the evaluation is performed serially with a batch size of one, and pipelined execution is adopted to improve array utilization. After patch embedding, an input image of 224 × 224 is divided into 16 × 16 patches, resulting in 196 patches, and one class token is added, forming 197 tokens in total. Each patch is linearly projected to a 768-dimensional token vector. In the attention module, the $Q \times K^{T}$ operation is decomposed into a series of 197 × 64 and 64 × 197 matrix multiplications. The computation is tiled into 64 × 8 blocks and executed sequentially. Figure 12 shows the operation flow of attention in ViT. Due to the decoupled computing and writing paths, weight refresh does not block the computation. Also, each multiplication takes 197 cycles, which is much longer than the 64-cycle refresh time, so the refresh latency is fully hidden.

For circuit-level evaluation, Synopsys StarRC X-2025.06 was used to extract parasitic parameters of the full array. Cadence Spectre was employed for post-simulation of the BEM and the Fly-MLUT submodule within the BMM. For the digital circuits in BMM, latency was characterized by Synopsys PrimeTime based on extracted parasitic parameters. It is then back-annotated to the gate-level netlist via the SDF from PrimeTime and verified for functionality through post-simulation by Synopsys VCS. Power estimation of the BMM was performed using Synopsys PTPX, based on waveform data obtained from post-layout simulation. To ensure functional robustness and sign-off compatibility, comprehensive parasitic extraction and verification were performed under all major RC corners—including Cworst/Cworst_T, Cbest/Cbest_T, RCworst/RCworst_T, RCbest/RCbest_T, and typical conditions—as well as across TT, SSG, and FFG process corners. For an accurate and objective characterization of the peak performance of the proposed Hy-FPCIM macro, post-layout simulations under the worst-case parasitic conditions were conducted. All simulations were executed on a high-performance computing platform equipped with dual Intel Platinum 8375C processors, a 440BX desktop reference motherboard, 471 GB of system memory, and the RHEL operating system. This configuration ensured efficient execution of full-chip simulations and reliable correlation between circuit-level analysis and system-level performance evaluation.

3.4. Evaluation Results

At the bit cell level, extensive PPAY (Performance, Power, Area, and Yield) evaluations were conducted to validate the design rationality and reliability of the fully custom bit cell. Under the TT corner, typical RC condition, and 25 °C operating temperature, the proposed 12T TP cell achieved a read delay of only 12.2 ps (averaged over read-‘1’ and read-‘0’ operations) when accessed through the WWL and BL path. The read energy was measured at 1.03 pJ. In terms of reliability, the 12T SRAM cell features decoupled read and write ports, which effectively eliminates read-disturb issues; thus, only the write static noise margin (WSNM) was evaluated. Monte Carlo simulations with 20,000 samples were performed at 0.9 V supply. Figure 13 shows the simulation results of our proposed 12T TP SRAM cell under the 3σ process variation, with the WSNM of 269 mV, confirming strong write robustness.

At the macro level, we evaluated trade-offs among different SRAM cell designs. Although 6T/8T SRAM bit cells offer smaller cell area and lower static power consumption, memory-sensitive deep learning workloads demand higher readout widths, which require significant additional overhead. Therefore, considering the memory peripheral circuits, the proposed SA-free 12T cell provides higher readout bit width with smaller area overhead to cover the high parallelism demands of matrix multiplication. Furthermore, the proposed 12T bit cell achieves faster readout latency compared to 6T/8T arrays that rely on external readout circuits.

At the system level, the Hy-FPCIM architecture was evaluated under representative operating scenarios to assess timing and power performance. Table 2 summarizes the latency and power characteristics of the BEMs and BMMs, as well as the Hy-FPCIM system. The BEM exhibits a critical-path delay of 5.13 ns and a power consumption of 1.031 mW, while the BMM achieves a delay of 5.00 ns with 7.585 mW power consumption. After heterogeneously integrating two macros, the latency of Hy-FPCIM is determined by the BEM, resulting in a system latency of 5.13 ns and an operating frequency of 195 MHz. The average power of Hy-FPCIM was 8.424 mW.

Figure 14 illustrates the frequency and power of the proposed Hy-FPCIM architecture at different operating voltages. When the voltage drops to 0.7 V, the critical path of the BEM increases to 12.5 ns. At this point, the Hy-FPCIM architecture operates at 80 MHz with 3.3 mW power consumption. At 0.8 V, the operating frequency reaches 140 MHz with 5.93 mW power consumption. Also, the minimum operating voltage that still meets timing at 195 MHz is 0.85 V.

Table 3 summarizes the overall performance of the proposed Hy-FPCIM. The Hy-FPCIM performs 64 × 8 floating-point matrix multiplication per cycle. Based on the measurement results in Table 3, along with the area and storage capacity metrics presented in Section 3.2, the Hy-FPCIM achieves a throughput of 200 GFLOPS, energy efficiency of 23.7 TFLOPS/W, area efficiency of 0.755 TFLOPS/mm², and memory density of 617 Kb/mm². The storage-to-compute ratio is 16:1 for the BEM and 4:1 for the BMM. Table 3 also shows the individual performance metrics of the BEMs and BMMs. These results indicate that the proposed Hy-FPCIM is structurally optimized for complex AI workloads and large language model inference, offering a balanced trade-off between compute density, energy efficiency, and memory integration.

In numerical analysis, Figure 15 presents the ViT inference accuracy with a different strategy in attention module. We use FP32 inference accuracy as the baseline and the 0.14% accuracy loss in BF16 as the error budget. The results show that although the LSB truncation in Booth 4 caused a 0.05% accuracy loss, the Fly-MLUT and post-alignment strategies compensated for its error, respectively, leading to only a 0.03% decrease in ImageNet recognition accuracy. Experiments demonstrate that the proposed BMM achieves almost lossless in-memory computation.

In addition, Figure 16 shows a histogram distribution across 12 attention layers. The quantized activations closely follow the baseline distributions, further confirming that the proposed mantissa path maintains a near lossless precision.

Table 4 summarizes the overall evaluation results of the proposed Hy-FPCIM across AI algorithms of varying complexity. Experiments evaluated the ResNet-50 and ViT at BF16 precision using the ImageNet-1k dataset. Both models utilized pretrained weights without any structural or parameter modifications. During inference, all computations were executed in BF16 precision to systematically evaluate the impact of the Hy-FPCIM precision-loss model on inference accuracy, numerical stability, and execution efficiency. Table 4 illustrates the inference accuracy of ImageNet-1k datasets with proposed Hy-FPCIM compared with the software baseline. For ResNet-50, the model achieves 80.76%/95.42% (Top-1/Top-5) accuracy, closely matching the full-precision baseline implementation (80.858%/95.434%). For ViT, the model reaches 81.04%/95.29%, with only a negligible deviation (<0.03%) from the baseline accuracy (81.072%/95.318%) in both Top-1 and Top-5 metrics. These results confirm that the proposed high-precision architecture and post-alignment strategy effectively prevent mantissa truncation loss and maintain full floating-point precision throughout accumulation. In terms of execution efficiency, under a 0.9 V supply voltage, 200 MHz clock frequency, and normal temperature operating conditions, the proposed Hy-FPCIM architecture achieved a simulated inference latency of approximately 40.29 ms per image for ResNet-50 and 217.87 ms per image for ViT. Overall, these results confirm that the proposed Hy-FPCIM architecture provides accurate and energy-efficient performance across representative AI workloads, demonstrating its strong potential for high-precision, high-performance, and low-power intelligent computing systems.

4. Discussion

Compared with prior works, the proposed Hy-FPCIM architecture exhibits outstanding advantages in both energy efficiency and area efficiency. It achieves an energy efficiency of 23.7 TFLOPS/W and an area efficiency of 0.754 TFLOPS/mm², representing improvements of 16.5× and 9.3×, respectively, compared to the logic-based high-precision floating-point ECIM baseline [21]. As illustrated in Figure 17, the proposed architecture surpasses representative state-of-the-art works such as ECIM [21], CAM [22], P3ViT [35], and EMCIM [36], achieving significantly higher area efficiency and energy efficiency. These improvements primarily stem from the cooperative design of the BEM and BMM units, which enable fully embedded floating-point MAC operations within the memory array.

Compared to state-of-the-art analog CIMs such as the high-precision PCM CIM with drift compensation [13], the proposed digital CIM architecture ensures computational accuracy without excessive compensation circuits. Compared to multi-bit integer RRAM CIM [14], the proposed floating-point architecture avoids the risk of quantization errors and exhibits lower power consumption in sparse scenarios. Furthermore, the modularly designed Hy-FPCIM exhibits exceptional flexibility and scaling capabilities, providing adaptable solutions to meet the high-precision demands of complex deep learning workloads.

Figure 18 illustrates the recognition accuracy with the prior high-complexity ImageNet image recognition CIM accelerator. The results demonstrate that the proposed hybrid floating-point CIM architecture achieves higher inference accuracy, almost matching software precision, and can be widely applied to complex deep learning tasks.

Considering that complex deep learning algorithms require higher weight storage, this work proposes an energy efficiency density (EED) metric that comprehensively evaluates energy efficiency and storage density. Its expression is as follows:

(2) $E E D = E n e r g y E f f i c i e n c y \times M e m o r y D e n s i t y$

Figure 19 illustrates the comparison of EED with prior work. The proposed hybrid floating-point CIM architecture achieves higher EED, effectively improving the performance of complex deep learning tasks and showing better applicability to transformer-based large models.

To evaluate the advantages of the proposed Hy-FPCIM architecture, we compare its performance with representative FPCIM designs from the recent literature. Table 5 summarizes key metrics, including energy efficiency, area efficiency, EED, and memory density.

As shown in Table 5, the proposed Hy-FPCIM demonstrates significant advantages in energy efficiency, area efficiency, image recognition accuracy, energy efficiency density, and storage density. It achieves superior performance in complex deep learning applications such as ViT.

5. Conclusions

This work presents Hy-FPCIM, a hybrid floating-point compute-in-memory architecture integrating BEM and BMM in-memory macros with high-precision BF16 matrix multiplication for complex deep learning such as ViT. The proposed architecture introduces a low-cost BSESU and a routing-efficient BMF for exponent processing, and a Booth 4-encoded Fly-MLUT for almost lossless mantissa multiplication. Post-alignment accumulation ensures precision without rounding loss. The proposed Hy-FPCIM also demonstrates strong robustness against PVT variations, benefiting from stationary local readout in BEM and highly robust LUT bit cells in BMM.

Moreover, the proposed Hy-FPCIM can be extended to FP32 or mixed-precision workloads by reconfiguring the pipeline schedule. For example, the BMM cycle can be extended to three cycles to sequentially process the mantissa segments [7:0], [15:8], and [23:16], thereby covering the FP32 input scenario.

Designed on TSMC 28 nm technology, Hy-FPCIM achieves 23.7 TFLOPS/W energy efficiency, 0.754 TFLOPS/mm² area efficiency, 617 Kb/mm² memory density, and 14.6 TFLOPS/W/mm² EED. Compared to previous floating-point CIMs and ViT accelerators, Hy-FPCIM demonstrates higher recognition accuracy, energy efficiency, and memory density, providing an effective solution for complex deep learning hardware accelerators.

Author Contributions

Conceptualization, Z.M. and Y.W.; methodology, Z.M., Q.C. and C.W.; software, Z.M. and Y.W.; validation, Q.C.; formal analysis, Z.M. and Y.W.; investigation, Z.M. and C.W.; resources, Z.M.; data curation, Z.M. and Q.C.; writing—original draft, Z.M.; writing—review and editing, Z.M., C.W., Y.W. and Y.X.; visualization, Z.M.; supervision, Y.X.; project administration, Y.X.; funding acquisition, Y.X. All authors have read and agreed to the published version of the manuscript.

Data Availability Statement

The ImageNet dataset is publicly available at https://image-net.org (accessed on 10 October 2025). The evaluation data presented in this study are available on request from the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

AI	Artificial Intelligence
FP	Floating-Point
CIM	Compute-in-Memory
Hy-FPCIM	Hybrid Floating-Point Compute-in-Memory
BW	Bit-Wise
BEM	Bit-Wise Exponent Macro
BMM	Booth Mantissa Macro
BSESU	Bit-Separated Exponent Summation Unit
BMF	Bit-Wise Max Finder
MAC	Multiply–Accumulate
Fly-MLUT	Flying Mantissa Lookup Table
ViT	Vision Transformer
MHA	Multi-Head Self-Attention
EED	Energy Efficiency Density

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

Figure 1 Floating-point multiply–accumulate operation.

Figure 2 Vision Transformer (ViT) structure with MHA computational graph.

Figure 3 Multi-head attention for the proposed Hy-FPCIM architecture.

Figure 4 Hy-FPCIM overall architecture.

Figure 5 Bit-Separated Exponent Summation Unit structure.

Figure 6 In-memory Bit-Wise Max Finder structure.

Figure 7 BMF operation flow.

Figure 8 DESU structure.

Figure 9 Booth 4 of BF16 mantissa.

Figure 10 (a) Fly-MLUT structure. Orange: Upper LUT, Green: Lower LUT. (b) 12T triple port cell (12T TP cell) circuit and flying structure.

Figure 11 BF16 normalization flow.

Figure 12 Attention workflow in ViT.

Figure 13 Monte Carlo simulation results for 12T TP SRAM bit cell.

Figure 14 Timing/power guard bands of Hy-FPCIM.

Figure 15 Result of ablation experiments based on ViT@ImageNet inferencing.

Figure 16 Layer-wise activation histograms under quantized and baseline conditions.

Figure 17 Area and energy efficiency comparison of state-of-the-art floating-point CIM architecture.

Figure 18 Comparison of accuracy.

Figure 19 Comparison of energy efficiency density.

Table 1

Booth 4 true table.

Input	MLUT	SHIFT	ZERO	PP	Input	MLUT	NEG	SHIFT	ZERO	PP
00001; 00010	W	0	0	1W	11101; 11110	W	1	0	0	−W
00011; 00100	W	1	0	2W	11011; 11100	W	1	1	0	−2W
00101; 00110	3W	0	0	3W	11001; 11010	3W	1	0	0	−3W
00111; 01000	W	2	0	4W	10111; 11000	W	1	2	0	−4W
01001; 01010	5W	0	0	5W	10101; 10110	5W	1	0	0	−5W
01011; 01100	3W	1	0	6W	10011; 10100	3W	1	1	0	−6W
01101; 01110	7W	0	0	7W	10001; 10010	7W	1	0	0	−7W
01111	W	3	0	8W	10000	W	1	3	0	−8W
00000; 11111	0	0	1	0W

Table 2

The measurement result of the proposed Hy-FPCIM.

Macro	BEM	BMM	Hy-FPCIM
Latency (ns)	5.13	5	5.13
Power (mW)	1.031	7.585	8.424

Table 3

Overall performance of Hy-FPCIM.

Macro	BEM	BMM	Hy-FPCIM
Area (mm²)	0.045	0.22	0.265
Frequency (MHz)	195	200	195
Memory Capacity (Kb)	65.5	98.3	163.5
Storage-to-Compute Ratio	16	4	16/4
Memory Density (Kb/mm²)	1456	445	617
Throughput (GFLOPS)	99.8	204.8	200
Energy Efficiency (TFLOPS/W)	96.8	27	23.7
Area Efficiency (TFLOPS/mm²)	2.22	0.931	0.755

Table 4

Overall evaluation in AI tasks.

Model	ResNet-50	ViT-B
Datasets	ImageNet-1k	ImageNet-1k
Precision	BF16	BF16
Top-1 Accuracy	80.76%	81.04%
Top-1 Baseline	80.858%	81.072%
Top-5 Accuracy	95.42%	95.29%
Top-5 Baseline	95.434%	95.318%
Accuracy Loss @ Top-1	0.1%	0.03%
Execution Time	40.29 ms	217.87 ms

Table 5

Summary of recent high-precision floating-point CIM works.

Architecture	ECIM [21]	CAM [22]	EMCIM [36]	FPCIM [37]	4T1T [38]	P3ViT [35]	This Work
Technology	28 nm	28 nm	28 nm	28 nm	28 nm	28 nm	28 nm
Area Size (Kb)	396	16	64	64	36	131	163.5
Bit Cell	6T	8T/10T	SW6T	6T+MUX	4T1T	6T	8T/12T
Macro Area (mm²)	~1.46	0.044	0.212	0.269	1	~0.28	0.265
Memory Density(Kb/mm²)	271	364	603.8	238	36	468	617
Voltage (V)	0.76–1.1	0.5–0.9	0.6–0.9	0.397–0.9	0.7–0.9	0.9	0.63–0.9
Format	BF16	FP8-32	BF16	FP16	INT1-8	INT16	BF16
Compute Type	Serial	Serial	Serial	Serial	Parallel	Parallel	Parallel
Frequency (MHz)	250	53–403	190	10–400	25	50–200	195
Row ParallelismPre-Macro	16	256	16	64	16	64	64
Area Efficiency(TFLOPS/mm²)	0.081	0.005–0.058	0.286	6.1@INT4	0.039 **	0.2	0.754
Energy Efficiency(TFLOPS/W)	1.43	1.644–0.099	19.8	17.2	4.11 **	11.62	23.7
Accuracy	-	-	75.07%@Cifar100	80.86%@ImageNet	67.11%@ImageNet	77.42%@ImageNet	81.04%@ImageNe
EED *	0.39	0.04–0.6	11.95	4.09	0.148	5.44	14.6

* EED = energy efficiency × memory density. ** Scale to INT 8.

References

1. Choquette, J.; Giroux, O.; Foley, D. Volta: Performance and Programmability. IEEE Micro; 2018; 38, pp. 42-52. [DOI: https://dx.doi.org/10.1109/MM.2018.022071134]

2. Si, X.; Tu, Y.-N.; Huang, W.-H.; Su, J.-W.; Lu, P.-J.; Wang, J.-H.; Liu, T.-W.; Wu, S.-Y.; Liu, R.; Chou, Y.-C. . A Local Computing Cell and 6T SRAM-Based Computing-in-Memory Macro With 8-b MAC Operation for Edge AI Chips. IEEE J. Solid-State Circuits; 2021; 56, pp. 2817-2831. [DOI: https://dx.doi.org/10.1109/JSSC.2021.3073254]

3. Yuan, Y.; Yang, Y.; Wang, X.; Li, X.; Ma, C.; Chen, Q.; Tang, M.; Wei, X.; Hou, Z.; Zhu, J. . 34.6 A 28nm 72.12TFLOPS/W Hybrid-Domain Outer-Product Based Floating-Point SRAM Computing-in-Memory Macro with Logarithm Bit-Width Residual ADC. Proceedings of the 2024 IEEE International Solid-State Circuits Conference (ISSCC); San Francisco, CA, USA, 18–22 February 2024; Volume 67, pp. 576-578.

4. Su, J.-W.; Chou, Y.-C.; Liu, R.; Liu, T.-W.; Lu, P.-J.; Wu, P.-C.; Chung, Y.-L.; Hong, L.-Y.; Ren, J.-S.; Pan, T. . A 8-b-Precision 6T SRAM Computing-in-Memory Macro Using Segmented-Bitline Charge-Sharing Scheme for AI Edge Chips. IEEE J. Solid-State Circuits; 2023; 58, pp. 877-892. [DOI: https://dx.doi.org/10.1109/JSSC.2022.3199077]

5. Chen, Z.; Wen, Z.; Wan, W.; Reddy Pakala, A.; Zou, Y.; Wei, W.-C.; Li, Z.; Chen, Y.; Yang, K. PICO-RAM: A PVT-Insensitive Analog Compute-In-Memory SRAM Macro with In Situ Multi-Bit Charge Computing and 6T Thin-Cell-Compatible Layout. IEEE J. Solid-State Circuits; 2025; 60, pp. 308-320. [DOI: https://dx.doi.org/10.1109/JSSC.2024.3422826]

6. Kim, H.; Yoo, T.; Kim, T.T.-H.; Kim, B. Colonnade: A Reconfigurable SRAM-Based Digital Bit-Serial Compute-In-Memory Macro for Processing Neural Networks. IEEE J. Solid-State Circuits; 2021; 56, pp. 2221-2233. [DOI: https://dx.doi.org/10.1109/JSSC.2021.3061508]

7. Fujiwara, H.; Mori, H.; Zhao, W.-C.; Chuang, M.-C.; Naous, R.; Chuang, C.-K.; Hashizume, T.; Sun, D.; Lee, C.-F.; Akarvardar, K. . A 5-Nm 254-TOPS/W 221-TOPS/Mm² Fully-Digital Computing-in-Memory Macro Supporting Wide-Range Dynamic-Voltage-Frequency Scaling and Simultaneous MAC and Write Operations. Proceedings of the 2022 IEEE International Solid-State Circuits Conference (ISSCC); San Francisco, CA, USA, 20–26 February 2022; Volume 65, pp. 1-3.

8. Agrawal, A.; Jaiswal, A.; Lee, C.; Roy, K. X-SRAM: Enabling In-Memory Boolean Computations in CMOS Static Random Access Memories. IEEE Trans. Circuits Syst. I Regul. Pap.; 2018; 65, pp. 4219-4232. [DOI: https://dx.doi.org/10.1109/TCSI.2018.2848999]

9. Chih, Y.-D.; Lee, P.-H.; Fujiwara, H.; Shih, Y.-C.; Lee, C.-F.; Naous, R.; Chen, Y.-L.; Lo, C.-P.; Lu, C.-H.; Mori, H. . 16.4 An 89TOPS/W and 16.3TOPS/Mm2 All-Digital SRAM-Based Full-Precision Compute-In Memory Macro in 22nm for Machine-Learning Edge Applications. Proceedings of the 2021 IEEE International Solid-State Circuits Conference (ISSCC); San Francisco, CA, USA, 13–22 February 2021; Volume 64, pp. 252-254.

10. Wang, D.; Lin, C.-T.; Chen, G.K.; Knag, P.; Krishnamurthy, R.K.; Seok, M. DIMC: 2219TOPS/W 2569F2/b Digital In-Memory Computing Macro in 28nm Based on Approximate Arithmetic Hardware. Proceedings of the 2022 IEEE International Solid-State Circuits Conference (ISSCC); San Francisco, CA, USA, 20–26 February 2022; Volume 65, pp. 266-268.

11. Lee, C.-F.; Lu, C.-H.; Lee, C.-E.; Mori, H.; Fujiwara, H.; Shih, Y.-C.; Chou, T.-L.; Chih, Y.-D.; Chang, T.-Y.J. A 12nm 121-TOPS/W 41.6-TOPS/Mm2 All Digital Full Precision SRAM-Based Compute-in-Memory with Configurable Bit-Width for AI Edge Applications. Proceedings of the 2022 IEEE Symposium on VLSI Technology and Circuits (VLSI Technology and Circuits); Honolulu, HI, USA, 12–17 June 2022; pp. 24-25.

12. He, Y.; Fan, S.; Li, X.; Lei, L.; Jia, W.; Tang, C.; Li, Y.; Huang, Z.; Du, Z.; Yue, J. . 34.7 A 28nm 2.4Mb/Mm2 6.9–16.3TOPS/Mm2 eDRAM-LUT-Based Digital-Computing-in-Memory Macro with In-Memory Encoding and Refreshing. Proceedings of the 2024 IEEE International Solid-State Circuits Conference (ISSCC); San Francisco, CA, USA, 18–22 February 2024; Volume 67, pp. 578-580.

13. Antolini, A.; Lico, A.; Zavalloni, F.; Scarselli, E.F.; Gnudi, A.; Torres, M.L.; Canegallo, R.; Pasotti, M. A Readout Scheme for PCM-Based Analog in-Memory Computing with Drift Compensation Through Reference Conductance Tracking. IEEE Open J. Solid-State Circuits Soc.; 2024; 4, pp. 69-82. [DOI: https://dx.doi.org/10.1109/OJSSCS.2024.3432468]

14. Moorthii, J.C.; Mourya, M.V.; Bansal, H.; Verma, D.; Suri, M. RRAM IMC Based Efficient Analog Carry Propagation and Multi-Bit MVM. Proceedings of the 2024 8th IEEE Electron Devices Technology & Manufacturing Conference (EDTM); Bangalore, India, 3–6 March 2024; pp. 1-3.

15. Devlin, J.; Chang, M.-W.; Lee, K.; Toutanova, K. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MN, USA, 2019.

16. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention Is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems; Curran Associates Inc.: Red Hook, NY, USA, 2017; pp. 6000-6010.

17. Kulkarni, A.; Balachandran, V.; Divakaran, D.M.; Das, T. From ML to LLM: Evaluating the Robustness of Phishing Web Page Detection Models against Adversarial Attacks. Digit. Threat.; 2025; 6, pp. 1-25. [DOI: https://dx.doi.org/10.1145/3737295]

18. Wang, J.; Ni, T.; Lee, W.-B.; Zhao, Q. A Contemporary Survey of Large Language Model Assisted Program Analysis. Trans. Artif. Intell.; 2025; 1, pp. 105-129. [DOI: https://dx.doi.org/10.53941/tai.2025.100006]

19. Yang, G.; Lei, J.; Xie, W.; Fang, Z.; Li, Y.; Wang, J.; Zhang, X. Algorithm/Hardware Codesign for Real-Time On-Satellite CNN-Based Ship Detection in SAR Imagery. IEEE Trans. Geosci. Remote Sens.; 2022; 60, 5226018. [DOI: https://dx.doi.org/10.1109/TGRS.2022.3161499]

20. Frontiers | Automotive Radar Processing with Spiking Neural Networks: Concepts and Challenges. Available online: https://www.frontiersin.org/journals/neuroscience/articles/10.3389/fnins.2022.851774/full (accessed on 10 October 2025).

21. Lee, J.; Kim, J.; Jo, W.; Kim, S.; Kim, S.; Lee, J.; Yoo, H.-J. A 13.7 TFLOPS/W Floating-Point DNN Processor Using Heterogeneous Computing Architecture with Exponent-Computing-in-Memory. Proceedings of the 2021 Symposium on VLSI Circuits; Kyoto, Japan, 13–19 June 2021; pp. 1-2.

22. Jeong, S.; Park, J.; Jeon, D. A 28nm 1.644TFLOPS/W Floating-Point Computation SRAM Macro with Variable Precision for Deep Neural Network Inference and Training. Proceedings of the ESSCIRC 2022-IEEE 48th European Solid State Circuits Conference (ESSCIRC); Milan, Italy, 19–22 September 2022; pp. 145-148.

23. Tu, F.; Wang, Y.; Wu, Z.; Liang, L.; Ding, Y.; Kim, B.; Liu, L.; Wei, S.; Xie, Y.; Yin, S. A 28nm 29.2TFLOPS/W BF16 and 36.5TOPS/W INT8 Reconfigurable Digital CIM Processor with Unified FP/INT Pipeline and Bitwise In-Memory Booth Multiplication for Cloud Deep Learning Acceleration. Proceedings of the 2022 IEEE International Solid-State Circuits Conference (ISSCC); San Francisco, CA, USA, 20–26 February 2022; Volume 65, pp. 1-3.

24. Saikia, J.; Sridharan, A.; Yeo, I.; Venkataramanaiah, S.; Fan, D.; Seo, J.-S. FP-IMC: A 28nm All-Digital Configurable Floating-Point In-Memory Computing Macro. Proceedings of the ESSCIRC 2023-IEEE 49th European Solid State Circuits Conference (ESSCIRC); Lisbon, Portugal, 11–14 September 2023; pp. 405-408.

25. Li, M.; Zhu, H.; He, S.; Zhang, H.; Liao, J.; Zhai, D.; Chen, C.; Liu, Q.; Zeng, X.; Sun, N. . SLAM-CIM: A Visual SLAM Backend Processor with Dynamic-Range-Driven-Skipping Linear-Solving FP-CIM Macros. IEEE J. Solid-State Circuits; 2024; 59, pp. 3853-3865. [DOI: https://dx.doi.org/10.1109/JSSC.2024.3402808]

26. Wang, X.; Jiao, T.; Yang, Y.; Li, S.; Li, D.; Guo, A.; Shi, Y.; Tang, Y.; Chen, J.; Zhang, Z. . 14.3 A 28nm 17.83-to-62.84TFLOPS/W Broadcast-Alignment Floating-Point CIM Macro with Non-Two’s-Complement MAC for CNNs and Transformers. Proceedings of the 2025 IEEE International Solid-State Circuits Conference (ISSCC); San Francisco, CA, USA, 16–20 February 2025; Volume 68, pp. 254-256.

27. Tu, F.; Wang, Y.; Wu, Z.; Liang, L.; Ding, Y.; Kim, B.; Liu, L.; Wei, S.; Xie, Y.; Yin, S. ReDCIM: Reconfigurable Digital Computing- In -Memory Processor with Unified FP/INT Pipeline for Cloud AI Acceleration. IEEE J. Solid-State Circuits; 2023; 58, pp. 243-255. [DOI: https://dx.doi.org/10.1109/JSSC.2022.3222059]

28. Wu, P.-C. 8:30 AM 7.1 A 22nm 832Kb Hybrid-Domain Floating-Point SRAM In-Memory-Compute Macro with 16.2-70.2TFLOPS/W for High-Accuracy AI-Edge Devices. Proceedings of the 2023 IEEE International Solid-State Circuits Conference (ISSCC); San Francisco, CA, USA, 19–23 February 2023.

29. Wu, P.-C.; Khwa, W.-S.; Wu, J.-J.; Su, J.-W.; Jhang, C.-J.; Chen, H.-Y.; Ke, Z.-E.; Chiu, T.-C.; Hsu, J.-M.; Cheng, C.-Y. . An Integer-Floating-Point Dual-Mode Gain-Cell Computing-in-Memory Macro for Advanced AI Edge Chips. IEEE J. Solid-State Circuits; 2025; 60, pp. 158-170. [DOI: https://dx.doi.org/10.1109/JSSC.2024.3470215]

30. Hu, X.; Wang, Y.; Ma, Z.; Wen, G.; Wang, Z.; Lu, Z.; Liu, Y.; Li, Y.; Liang, X.; Zeng, X. . An 8.8 TFLOPS/W Floating-Point RRAM-Based Compute-in-Memory Macro Using Low Latency Triangle-Style Mantissa Multiplication. IEEE Trans. Circuits Syst. II Express Briefs; 2023; 70, pp. 4216-4220. [DOI: https://dx.doi.org/10.1109/TCSII.2023.3283418]

31. Liu, L.; Tan, L.; Gan, J.; Pan, B.; Zhou, J.; Li, Z. MDCIM: MRAM-Based Digital Computing-in-Memory Macro for Floating-Point Computation with High Energy Efficiency and Low Area Overhead. Appl. Sci.; 2023; 13, 11914. [DOI: https://dx.doi.org/10.3390/app132111914]

32. Dosovitskiy, A.; Beyer, L.; Kolesnikov, A.; Weissenborn, D.; Zhai, X.; Unterthiner, T.; Dehghani, M.; Minderer, M.; Heigold, G.; Gelly, S. . An Image Is Worth 16x16 Words: Transformers for Image Recognition at Scale. arXiv; 2021; [DOI: https://dx.doi.org/10.48550/arXiv.2010.11929] arXiv: 2010.11929

33. Goto, G. High Speed Digital Parallel Multiplier. U.S. Patent No. 5,465,226, 7 November 1995.

34. Yuce, B.; Ugurdag, H.F.; Gören, S.; Dündar, G. Fast and Efficient Circuit Topologies forFinding the Maximum of n K-Bit Numbers. IEEE Trans. Comput.; 2014; 63, pp. 1868-1881. [DOI: https://dx.doi.org/10.1109/TC.2014.2315634]

35. Fu, X.; Ren, Q.; Wu, H.; Xiang, F.; Luo, Q.; Yue, J.; Chen, Y.; Zhang, F. P3 ViT: A CIM-Based High-Utilization Architecture With Dynamic Pruning and Two-Way Ping-Pong Macro for Vision Transformer. IEEE Trans. Circuits Syst. I Regul. Pap.; 2023; 70, pp. 4938-4948. [DOI: https://dx.doi.org/10.1109/TCSI.2023.3315060]

36. Guo, A.; Dong, X.; Dong, F.; Li, D.; Zhang, Y.; Zhang, J.; Yang, J.; Si, X. A 28 Nm 128-Kb Exponent-and Mantissa-Computation-In-Memory Dual-Macro for Floating-Point and INT CNNs. IEEE J. Solid-State Circuits; 2025; 60, pp. 3639-3654. [DOI: https://dx.doi.org/10.1109/JSSC.2025.3558322]

37. Yan, S.; Yue, J.; He, C.; Wang, Z.; Cong, Z.; He, Y.; Zhou, M.; Sun, W.; Li, X.; Dou, C. . A 28-Nm Floating-Point Computing-in-Memory Processor Using Intensive-CIM Sparse-Digital Architecture. IEEE J. Solid-State Circuits; 2024; 59, pp. 2630-2643. [DOI: https://dx.doi.org/10.1109/JSSC.2024.3363871]

38. Zhao, C.; Fang, J.; Huang, X.; Chen, D.; Guo, Z.; Jiang, J.; Wang, J.; Yang, J.; Han, J.; Zhou, P. . A 28-Nm 36 Kb SRAM CIM Engine With 0.173 Μm² 4T1T Cell and Self-Load-0 Weight Update for AI Inference and Training Applications. IEEE J. Solid-State Circuits; 2024; 59, pp. 3277-3289. [DOI: https://dx.doi.org/10.1109/JSSC.2024.3399615]

© 2025 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

A High-Precision Hybrid Floating-Point Compute-in-Memory Architecture for Complex Deep Learning

Content area