Q2SV: A High‐Level Synthesis Approach for State Vector Quantum Simulation

Abstract

This study presents Q2SV, an FPGA‐based quantum state vector simulator implemented using high‐level synthesis (HLS), capable of simulating quantum circuits with up to 29 qubits. Using FPGA parallelism, efficient memory allocation, and hardware‐optimized execution pipelines, Q2SV achieves scalable quantum simulation without requiring iterative resynthesis. The system’s workflow processes OpenQASM circuits by providing a flexible and general‐purpose quantum state vector simulation approach. Experimental evaluations demonstrate significant reductions in execution time and storage requirements, positioning FPGAs as a possible alternative to GPUs and CPUs for large‐scale quantum circuit emulation. While not yet matching the raw computational power of high‐end CPUs and GPUs in all cases, this work establishes a foundational framework for future optimizations in hardware‐accelerated quantum simulation. This work provides an adaptable HLS‐based code that serves as a template that paves the way for enhanced memory management, parallel processing, and architecture‐specific optimizations, enabling more efficient FPGA‐based quantum simulations in the future.

Full text

Translate

Turn on search term navigation

1. Introduction

Quantum state vector (SV) simulators are critical tools for the verification and validation of quantum algorithms using classical computing systems. These simulators compute the full wavefunction of a quantum system, representing its state with a vector of 2ⁿ complex amplitudes, where n is the number of qubits. Although SV simulators are foundational for small-scale quantum systems, their memory and computational requirements grow exponentially with the number of qubits.

Modern SV simulators primarily rely on high-performance computing (HPC) architectures such as CPUs and GPUs to push the limits of qubit scalability. Platforms such as the Qiskit Aer simulator [1], NVIDIA’s cuQuantum [2], and Google’s SV frameworks [3] demonstrate the capabilities of modern HPC platforms to achieve efficient quantum simulations. These frameworks leverage the parallel processing power of GPUs and the computational strengths of multicore CPUs to handle the exponential growth of quantum SV size. GPUs, in particular, have been a preferred platform due to their inherent parallelism and high memory bandwidth. While simulating large-scale quantum systems on medium to low-end processors or GPUs remains challenging, high-performance CPUs and GPUs have demonstrated significant capabilities for SV simulations. However, studies [4, 5] show that even high-end computing platforms face memory exhaustion and bandwidth constraints as the storage requirement grows exponentially with the number of qubits. GPU implementations, despite their parallelism, are limited by memory bandwidth and the overhead of distributed computation, as demonstrated in research by NVIDIA’s cuQuantum SV simulation frameworks [6]. These observations suggest that while CPUs and GPUs are dominant contenders, alternative platforms such as field-programmable gate arrays (FPGAs), if given focused attention, could potentially achieve greater computational efficiency and scalability, addressing current bottlenecks and unlocking new possibilities for large-scale quantum simulations.

FPGAs and application-specific integrated circuits (ASICs) are emerging as alternative hardware solutions for quantum SV simulations. For example, Mowad et al. [7] and Wei et al. [8] have explored FPGA-based simulators, demonstrating their potential for energy efficiency and hardware-level optimizations, respectively. However, these platforms have not been widely adopted or scaled to achieve the flexibility and general-purpose simulation required for large-scale SV operations. Most studies leveraging FPGAs focus on specific quantum algorithms, circuit optimizations, or fixed-purpose tasks rather than the full general-purpose simulation of quantum circuits. Existing FPGA implementations are often limited in flexibility, constrained by their inability to adapt to arbitrary circuits or scale to a large number of qubits. For example, [9] highlights how FPGA designs can be optimized for a quantum kernel estimation algorithm, but often sacrificing adaptability to general-purpose simulations. Furthermore, the study by Zhang et al. [5] identifies resource constraints and memory bandwidth limitations as primary bottlenecks that prevent the scalability of FPGA to larger qubit systems. These limitations emphasize the need for more versatile and scalable FPGA-based implementations. Consequently, FPGAs and ASICs have yet to be considered as a serious alternative to conventional computing architectures despite offering multiple advantages, including high power efficiency, low latency, and customizable parallelism.

This study presents the first-ever implementation of a quantum SV simulator on an FPGA capable of simulating circuits with up to 29 qubits. This work also demonstrates the feasibility of achieving large-scale SV simulations purely through efficient FPGA hardware design using high-level synthesis (HLS). HLS is a process used in electronic design automation (EDA) that takes a high-level description of a digital system, typically written in a programming language like C or C++, and converts it into a register-transfer level (RTL) implementation suitable for hardware synthesis on devices such as FPGAs or ASICs. Previous FPGA-based simulators [10, 11] have shown potential but remain constrained by scalability or circuit flexibility. This work builds on these studies, surpassing existing qubit limits and addressing design limitations by achieving a 29-qubit SV simulation. This work highlights the novelty of using hardware efficiency to unlock scalability through HLS-driven optimizations, setting a new benchmark for FPGA-based quantum simulators. The implementation itself is highly flexible and capable of simulating any quantum circuit in QASM form as long as it does not include mid-circuit measurements.

Open Quantum Assembly Language (OpenQASM) is a low-level citeCross2017, hardware-agnostic programming language that describes quantum circuits. It functions similarly to an assembly language in classical computing, providing a clear, machine-readable way to specify quantum operations, such as qubit initialization, quantum gate applications, and measurements. Initially developed by IBM for its quantum computing platforms, OpenQASM allows researchers and developers to create, simulate, and execute quantum algorithms on various quantum devices or simulators.

The significance of this study lies in its ability to push the boundaries of FPGA-based quantum SV simulators to unprecedented qubit scales while maintaining the ability to simulate any transpiled quantum circuit in OpenQASM format. By achieving this milestone, this work establishes FPGAs as a competitive and scalable platform for SV simulations, offering a practical alternative to GPUs and CPUs. Moreover, the study identifies and analyzes key bottlenecks in the FPGA design, providing a foundation for future improvements that could further enhance its performance and scalability. This work not only showcases the untapped potential of FPGAs but also lays the groundwork for more energy-efficient and cost-effective quantum SV simulators by publishing the HLS code that could be reused as a template for later optimizations.

This study fits within the broader landscape of quantum SV simulators as a pioneering effort in utilizing FPGA hardware for large-scale simulations. While existing studies predominantly focus on algorithmic optimizations and HPC-based implementations, this work bridges the gap by demonstrating that HLS advancements can unlock new possibilities for quantum simulation scalability.

The contributions of this study are summarized as follows:

•
Development of an Open-Source Template: This study presents an open-source template designed to serve as a foundational framework for researchers developing a general-purpose SV quantum computing simulator using readable HLS code.
•
Integration of OpenQASM with FPGA Workflows: The study demonstrates a streamlined method for converting the widely adopted OpenQASM file format into a processable representation, enabling its direct utilization in FPGA-based quantum circuit simulation.
•
Advancement in FPGA Qubit Capacity: This work pushes the boundaries of FPGA capability by significantly increasing the number of qubits that can be handled in a general-purpose quantum SV simulator. Notably, the FPGA synthesis is performed once, allowing the hardware to support a wide variety of quantum algorithms without requiring resynthesis.

2. Background

2.1. Quantum Circuit Simulation

Several practical challenges arise when trying to simulate quantum circuits on classical machines:

2.1.1. Error Propagation

Floating-point rounding errors are a fundamental issue in classical simulations of quantum systems due to the limitations of IEEE 754 arithmetic. Since quantum states are represented as complex vectors with continuous values, truncation and rounding errors occur when performing arithmetic operations, leading to deviations from the ideal quantum state evolution [12]. These errors accumulate with each quantum gate application, particularly in deep circuits, gradually affecting simulation accuracy. Moreover, precision loss in complex arithmetic exacerbates the problem, as real and imaginary parts are stored separately, causing minor inconsistencies in unitary operations [13]. Such small inaccuracies can build up with more operational gates applied, making it crucial to use higher precision arithmetic or error-mitigation techniques in quantum simulators.

2.1.2. Hardware Constraints, Size, and Computational Complexity

The complexity of the resulting unitary matrix increases exponentially with the number of qubits. For n qubits, each additional gate expands the size of the matrix to be computed and stored, which scales as 2ⁿ × 2ⁿ. This exponential growth poses significant challenges for both classical simulation and quantum hardware implementation [14].

Simulating quantum computing on classical hardware faces significant constraints due to the exponential growth of the number of quantum states. An n-qubit system requires storing 2ⁿ complex amplitudes, making memory a critical bottleneck. For example, a 30-qubitSV already requires over 8 GB of RAM using single precision, while a 41-qubit system exceeds a terabyte. Additionally, computational complexity scales exponentially, as applying a single gate involves matrix-vector multiplication with O(2ⁿ) operations. This rapid resource growth limits simulations to around 30–40 qubits on high-performance classical supercomputers, with further limitations imposed by floating-point precision and parallelization inefficiencies in distributed systems [15].

As the number of qubits in a quantum system increases, the memory (DRAM) requirement scales exponentially. This is because the SV of an n-qubit quantum system requires 2ⁿ complex amplitudes, and each amplitude typically requires storage as a floating-point number (single or half-precision). Table 1 illustrates the rapid growth in memory demand, emphasizing the challenge of simulating quantum systems beyond a few dozen qubits. For instance, at 35 qubits, even in half-precision, the required memory is around 137.44 GB, making classical simulation of larger quantum systems increasingly impractical.

Table 1 Memory requirements for quantum state representation with increasing qubits.

Qubits	Float (single precision)	Half-precision
15	262.14 kB	131.07 kB
16	524.29 kB	262.14 kB
17	2.10 MB	1.05 MB
18	4.20 MB	2.10 MB
…	…	…
27	1.07 GB	536.87 MB
28	2.15 GB	1.07 GB
29	4.29 GB	2.15 GB
30	8.59 GB	4.29 GB
31	17.18 GB	8.59 GB
32	34.36 GB	17.18 GB
33	68.72 GB	34.36 GB
34	137.44 GB	68.72 GB
35	274.88 GB	137.44 GB

Despite these challenges, simulating quantum circuits on classical machines provides several advantages. Classical simulations allow for thorough verification and debugging of quantum algorithms before deploying them on quantum hardware, where noise and decoherence can introduce significant errors. They also enable researchers to test how well their error correction and qubit mapping schemes are performing by comparing their results to classical quantum computations under near-ideal conditions. Furthermore, classical simulations provide a crucial benchmark for evaluating quantum supremacy, helping to determine when quantum processors outperform their classical counterparts in practical tasks [16]. High-performance classical simulations remain an essential tool for developing quantum computing, enabling advancements in both hardware and algorithmic design.

2.2. FPGA Platforms and Performance

An FPGA-based datacenter accelerator card is a specialized hardware device designed to offload and accelerate compute-intensive tasks in data centers by leveraging the reprogrammable and highly parallel nature of FPGAs. Unlike fixed-function ASICs or general-purpose CPUs, FPGAs provide a flexible architecture that can be dynamically reconfigured to optimize performance for diverse workloads, data analytics, real-time financial modeling, and HPC applications [17]. Their energy-efficient execution model and customizable computational pipelines make them particularly attractive for large-scale cloud deployments, where workload heterogeneity demands adaptable processing resources. Modern FPGA-based accelerator cards incorporate high-bandwidth memory (HBM) to mitigate data movement bottlenecks, with capacities ranging from 8 GB in compact designs to 32 GB or more in high-end configurations. For example, devices such as the AMD Alveo U55C and the BittWare IA-860m leverage HBM2e with bandwidths exceeding 400 GB/s, significantly enhancing throughput for memory-intensive applications [18, 19]. The reconfigurability of FPGAs provides a crucial advantage for adapting to evolving workloads, allowing new computational kernels to be implemented without hardware replacement. This adaptability, coupled with their high throughput and energy efficiency, has driven their adoption in cloud computing, edge AI, and scientific computing, where application-specific optimizations yield substantial performance gains over general-purpose architectures [20].

3. Related Works

3.1. FPGA in Quantum Computing

Even before the more recent quantum computing boom, the FPGA approach to quantum computing had already seen some early publications [21, 22]. HLS has been involved in quantum computing emulation in FPGAs since earlier studies [23]. A breakthrough in the number of qubits (20 qubits for QFT and 30 qubits for quantum Haar transform [QHT]) had come to light just as quantum computing had become a topic of high interest globally [11]. The referenced study has been widely cited as an example of why FPGAs should not be ruled out as one of the main mediums for simulating quantum computation. The two principal authors of the study mentioned above [11] published a subsequent study that expanded the number of qubits to 32 while running their C2Q algorithm [24].

A study published as a report in Nature presented FPGAs as a supplementary computing resource that can perform clusters of small quantum computations to help the main computing node [9]. Such a study from a well-known publisher invigorated the academic community’s interest in involving FPGAs in quantum computing and inspired a plethora of new studies [25, 26], one of which also happens to be a hybrid approach [10]. Our study aims to use HLS as Silva and Zabaleta used [23] earlier to construct a scalable SV simulator (up to 29 qubits) that can run circuits in OpenQASM format that has been transpiled into single-qubit and Controlled-NOT (CNOT) gates. This level of transpilation is quite common before sending the quantum circuit to a quantum processing units (QPUs) because almost all modern QPUs do not do well with operations that involve more than two qubits.

Table 2 summarizes the related studies and is color-coded concerning how they relate to our proposed system. Red represents studies that pushed the boundaries of the number of qubits an FPGA could run. Green represents hybrid CPU/GPU–FPGA systems that have demonstrated how their approach would outperform nonhybrid approaches. Blue are studies that use HDL/HLS to solve the problem and present the scientific community with open-source material that could be used for others to build on.

Table 2 Related work summarized: reds are studies that push the boundaries of the number of qubits, greens use FPGAs in hybrid quantum simulation systems, and blues use HLS/HDL to implement quantum computing simulations.

The following paragraphs summarize the studies that have been cited:

a.
An FPGA-Based Quantum Computing Emulation Framework Based on Serial–Parallel Architecture [21]: A framework leveraging a serial–parallel hardware architecture to manage the increasing complexity of quantum circuits is proposed. The study demonstrated the efficient emulation of quantum Fourier transform (QFT) and Grover’s search algorithm, with experimental results validating the framework’s scalability and reduced resource utilization.
b.
FPGA Quantum Computing Emulator Using High-Level Design Tools [23]: An emulator implemented using Vivado HLS is introduced, focusing on QFT. Using FPGA’s parallel processing capabilities, the authors demonstrated improved emulation efficiency compared to traditional software simulations. High-level design tools simplified resource management and the overall emulation process.
c.
Scaling Reconfigurable Emulation of Quantum Algorithms at High Precision and High Throughput [11]: FPGA-based emulation models are explored for quantum circuits. The study implemented QFT and QHT using on-chip-generated operational matrices, demonstrating enhanced scalability and reduced resource usage. The implementations are algorithm-specific, meaning that the FPGA would require reconfiguration if another algorithm is to be executed.
d.
Quantum AI Simulator Using a Hybrid CPU–FPGA Approach [9]: A hybrid architecture that integrates CPUs and FPGAs to simulate quantum circuits is presented. This approach demonstrated significant performance improvements, showcasing the potential of hybrid platforms in quantum computing research.
e.
Towards Complete and Scalable Emulation of Quantum Algorithms on High-Performance Reconfigurable Computers [24]: A scalable emulation framework is proposed leveraging reconfigurable hardware. This study facilitated the development and testing of quantum algorithms on classical hardware, emphasizing scalability and efficient resource utilization.
f.
Highly Optimized Quantum Circuits Synthesized via Data-Flow Engines [10]: FPGA-based data-flow engines are introduced for synthesizing variational quantum circuits. Their method minimized gate depth by 97% compared to Qiskit-generated circuits, achieving near-unity fidelity with minimal error, enabling efficient emulation of quantum programs with up to nine qubits.
g.
Project and Implementation of a Quantum Logic Gate Emulator on FPGA Using a Model-Based Design Approach [25]: An FPGA-based quantum logic gate emulator is described using model-based design. This methodology simplified development workflows, enabling rapid prototyping and testing of quantum algorithms.
h.
Benchmarking Matrix Multiplications for Variable Qubit Size and Depth [27]: Parallelism in quantum circuit simulations is explored, proposing a scalable FPGA architecture. Their implementation on Intel Agilex FPGAs demonstrated enhanced performance and scalability for larger quantum circuits, addressing challenges in gate evaluation and state routing.
i.
A Scalable FPGA Architecture for Quantum Computing Simulation [26]: Designing scalable FPGA architectures to handle the increasing demands of quantum circuit simulations is focused. The study highlighted innovations in parallel processing and resource management, showcasing significant advancements in FPGA-based emulation efficiency.

3.1.1. Other Studies

a.
FPGA-Accelerated Quantum Computing Emulation and Quantum Key Distillation [28]: It discusses the potential and advances of FPGA-based systems to emulate quantum computing and accelerate quantum key distillation processes. It highlights the benefits of FPGAs in quantum information processing, particularly due to their adaptability, deep pipeline parallelism, and support for custom precision operations. The paper reviews the recent developments in this field, outlining both the challenges and promising research opportunities, particularly focusing on the practical implementations and optimizations that FPGAs facilitate in the realm of quantum computing and secure quantum communications.
b.
Open-source software in quantum computing [29]: It provides a comprehensive review of open-source tools available for quantum computing. The paper discusses the evolution and impact of open-source software in the field, categorizes tools based on their functionality within the quantum computing stack, and evaluates them on various aspects, such as documentation, license compliance, and community engagement. The authors argue that, while the diversity of projects enriches the field, many lack active community involvement and robust development practices. The paper emphasizes the potential of open source to accelerate innovation in quantum computing, akin to its role in machine learning, but notes the need for improved standards and practices to maximize its benefits. As extensive and detailed as this study is, it was published more than 6 years ago, making it a little outdated. Since the publication of Fingerhuth et al.’s 2018 review, the open-source quantum computing landscape has expanded significantly. Major corporations have released comprehensive SDKs, such as IBM’s Qiskit [1] and Google’s Cirq [30], fostering active communities and accelerating development. Platforms like Quantinuum’s TKET [31] also have contributed to a more robust and collaborative ecosystem.

4. Quantum Computing Simulation Methods

Quantum computing simulators employ various techniques to model and analyze quantum systems. Among these, unitary simulation, SV simulation, density-matrix (DM) simulation, and tensor-network (TN) simulation are prominent methods, each with unique advantages and limitations.

4.1. Unitary Simulation

A unitary simulation focuses on evolving quantum states using unitary operators without explicitly forming the full DM. This approach is most suitable for closed quantum systems where dissipation and decoherence can be neglected, as well as for analyzing circuits, where the input state is fixed or irrelevant to the study at hand (e.g., circuit optimization or structural analysis). Unitary methods can significantly reduce computational overhead compared to DM simulations, enabling faster simulations of certain quantum systems. For instance, unitary simulation has been shown to be over two orders of magnitude faster than corresponding DM methods for specific applications [32]. In a unitary simulation, one typically computes and stores the cumulative gate operation by multiplying the gate matrices together: 1 $\begin{matrix} U_{total} = U_{m} U_{m - 1} \dots U_{2} U_{1}, \end{matrix}$ where each U_i is a 2ⁿ × 2ⁿ matrix acting on n qubits. The final SV is then obtained by applying U_total to the initial state: 2 $\begin{matrix} |ψ_{final}⟩ = U_{total} |ψ_{initial}⟩ . \end{matrix}$

However, because U_i and U_total generally have dimension 2ⁿ × 2ⁿ, the memory requirements and computational complexity can become prohibitive for larger n. Moreover, the full-matrix multiplication approach is especially impractical in resource-constrained environments such as FPGAs, which often have limited on-chip memory and bandwidth. As a result, storing and manipulating large matrices makes unitary simulation challenging for systems of moderate or large qubit numbers on such platforms.

4.2. SV Simulation

SV simulation is one of the most straightforward methods for simulating quantum systems. It represents the quantum state as a 2ⁿ-dimensional complex vector. For n qubits, this vector requires 2ⁿ complex amplitudes: 3 $\begin{matrix} |ψ⟩ = [\begin{matrix} α_{0} \\ α_{1} \\ \begin{matrix} ⋮ \\ α_{2^{n} - 1} \end{matrix} \end{matrix}] . \end{matrix}$

Each gate U (a 2ⁿ × 2ⁿ operator) updates the state as follows: 4 $\begin{matrix} |ψ^{'}⟩ = U |ψ⟩ . \end{matrix}$

However, in practice, these gates are usually transpiled into smaller single-qubit or two-qubit operations (i.e., 2 × 2 or 4 × 4 matrices), significantly reducing the memory and computational burden of each gate application. Because the entire circuit can be decomposed into these smaller gates, each gate application scales with the size of the SV; O(2ⁿ) operations rather than the size of a full 2ⁿ × 2ⁿ matrix multiplication. Hence, while the SV approach still grows exponentially in the number of qubits, it is more tractable than storing and multiplying large 2ⁿ × 2ⁿ matrices outright. This makes SV simulation a more practical alternative on memory- and bandwidth-constrained platforms, such as FPGAs. It is, therefore, the approach chosen for this work. In particular, we support arbitrary single-qubit and CNOT gates applied to the SV, which allows the simulation of any arbitrary quantum computation.

4.3. DM Simulation

DM simulation extends SV simulation by incorporating mixed states and decoherence effects, which are crucial for modeling realistic quantum systems. This method represents the quantum state as a DM, which can describe probabilistic mixtures of pure states. Like a unitary simulation, an n-qubit system, the DM requires 2ⁿ × 2ⁿ complex numbers. While more powerful in capturing noise and decoherence, DM simulation also suffers from exponential scaling with the number of qubits, leading to high computational costs [33]. The DM approach for quantum computing simulation is mainly used to simulate a noisy system with mixed states. Using this approach is unnecessary because our simulation assumes ideal conditions with no noise and pure states.

4.4. TN Simulation

TN simulation leverages the entanglement structure of quantum states to represent and compute quantum many-body systems efficiently. TNs, such as matrix product states (MPS) and projected entangled pair states (PEPS), provide a compact representation of quantum states, making them well-suited for simulating systems with high entanglement. TN methods can handle larger systems than SV and DM simulations by exploiting low-rank approximations and efficient contraction schemes [34, 35]. These methods have been successfully applied to simulate quantum circuits and physical systems with substantial computational savings [36].

While TN simulation is another alternative that reduces memory requirements for certain quantum circuits, it is optimized for highly scalable parallel computing systems, such as GPUs and distributed clusters. FPGAs, in contrast, lack the extensive software frameworks and hardware capabilities that GPUs provide for tensor computations. As a result, SV simulation strikes the best balance for FPGA-based quantum computing research, providing sufficient fidelity to simulate quantum circuits while remaining within the hardware’s feasible resource constraints.

5. Proposed System

Our proposed system performs its SV simulation by processing a QASM file that contains single-qubit and CNOT operations into a comma-separated value (CSV) file (as seen in Figure 1), placed in a directory that the datacenter accelerator card could reference. CSV is a plain text file format that organizes data into rows and columns, with individual values on each row separated by commas. Using the Qiskit library, a Python script produces a CSV file containing the size of the quantum circuit and the operational matrix of each gate, along with the target qubit and the control qubit (if needed). The workflow of the system is graphically summarized in Figure 2. The HLS kernel is synthesized once; different circuits are streamed at run-time via a compact opcode list, so no bitstream regeneration is required for each circuit.

[IMAGE OMITTED. SEE PDF]

Software associated with the FPGA (called the “host”) uses the CSV file to construct a quantum SV, then inputs the SV with the operational gate matrix to the datacenter accelerator card to iterate over the SV while using the quantum gates as the parameter to alter it. After the data accelerator card performs the last gate matrix operation, a file containing the final SV is created in CSV form. The system is designed to require only one synthesis operation to run all sorts of quantum computing operations as long as the number of qubits is 29 and under; it would simply use the produced CSV file as an input to simulate different quantum circuits.

Although fixed-point numbers could provide an edge in terms of execution speed, compatibility was a primary concern we had in mind, as when Qiskit is used to create the CSV file, it uses the floating-point format. We went for a floating-point implementation since it is the format that Qiskit and most other high-performing SV simulators use, not to mention that HLS tools such as Vitis/Vivado include highly optimized IP designs when floating-point numbers are processed. This choice will be later validated in the Discussion section, as the main bottleneck in the process was the memory bandwidth rather than the kernel execution speed.

The FPGA kernel implementation for quantum SV processing includes specialized loops for handling single-qubit and CNOT gate operations, as shown in Listings 1 and 2, respectively. For a single-qubit gate, the kernel iterates over all the SV values, applying the gate matrix to update the SV. Each iteration determines the target bit’s value and computes the indices for states to be updated, ensuring only relevant computations are performed. The loop employs HLS directives, including #pragma HLS PIPELINE for pipeline optimization and #pragma HLS UNROLL to enhance parallelism by unrolling the loop with a particular factor. The Methodology section explains loop unrolling in more detail. The single-qubit operation loop is optimized to capitalize on the spatial locality of the SV’s entries, meaning assignments are always done in incrementing order.

Listing 1: HLS kernel loop code for applying a single-qubit gate.

1.
single_qubit_loop: for (int i = 0; i < num_states; ++i) {
2.
#pragma HLS PIPELINE II=1
3.
#pragma HLS UNROLL factor=3
4.
// Get the bit at the target position.
5.
int bit = (i >> target) & 1;
6.
// Compute the partner index by flipping the target bit.
7.
int partner = i ˆ (1 << target);
8.
if (bit 0) {
9.
// When the target bit is 0, i is the lower index.
10.
output_state_vector[i] = gate_matrix[0] _∗ state_vector[i] +
11.
gate_matrix[1]∗ state_vector[partner];}
12.
else {
13.
// When the target bit is 1, i is the higher index.
14.
output_state_vector[i] = gate_matrix [2] ∗ state_vector[partner] +
15.
gate_matrix [3] ∗ state_vector[i];}
16.
}
17.
}

Listing 2: HLS kernel loop code for handling a CNOT gate.

1.
CNOT_loop:
2.
for (int i = (1 << control); i < num_states; i += (1 << (control + 1))) {
3.
//HLS PIPELINE
4.
int block_start = i;
5.
int block_end = i + (1 << control);
6.
if (target >= control) {// Target bit is constant in this block.
7.
if ((block_start & (1 << target)) == 0) {
8.
// Swap each element in the block with its partner.
9.
for (int idx = block_start; idx < block_end; ++idx) {
10.
int flipped = idx ˆ (1 << target);
11.
swap(output_state_vector[idx], output_state_vector[flipped]);}
12.
}
13.
} else {// Block spans several target periods.
14.
int period = 1 << (target + 1);
15.
int half = 1 << target;
16.
int idx = block_start;
17.
while (idx < block_end) {
18.
int offset = idx % period;
19.
if (offset < half) {
20.
int flipped = idx ˆ (1 << target);
21.
swap(output_state_vector[idx], output_state_vector[flipped]);
22.
++idx; }
23.
else {idx += (period - offset); }
24.
}
25.
}
26.
}

If the gate is a CNOT gate, it has to account for both the control and target qubits. The loop identifies each state’s control and target bits and applies a conditional operation. If the control bit meets the condition, the target bit determines whether a state swap is required. This process involves flipping the target bit to compute the swapped state index and then performing an exchange of amplitudes between the states. Similarly, HLS directives ensure efficient hardware synthesis by optimizing loop execution and resource utilization. Our CNOT algorithm is optimized to minimize iterations over values using the predictability of how SVs are arranged and by processing blocks relevant to the value of the control qubit. In other words, it only iterates over the SV values with a control qubit set to 1, skipping the ones set to 0. Together, these parts of the kernel provide a scalable and hardware-optimized approach to simulating quantum gates on FPGA platforms. CNOT is implemented as the sole two-qubit gate because any universal gate set can decompose arbitrary controlled-U operations into single-qubit rotations plus CNOTs, so supporting CNOT suffices to execute general quantum circuits while keeping the FPGA kernel simple and memory-efficient.

As previously mentioned, all quantum circuits are transpiled on the host before being executed on the FPGA. In this stage, the front end rewrites composite operations into our universal single-qubit and CNOT gates. The FPGA kernel, therefore, receives only these primitives and never needs dedicated RTL for higher level controlled gates. These are multiqubit gate examples that are constructed using the universal gate set and CNOT gates:

•
Controlled-Z gate (CZ): A controlled-Z with control c and target t is constructed by conjugating a single CNOT_C⟶t with Hadamards on the target, using HXH = Z: 5 $\begin{matrix} {CZ}_{C} ⟶ t = (I \otimes H_{t}) {CNOT}_{C} ⟶ t (I \otimes H_{t}) . \end{matrix}$
Transpilation inserts the two H_t gates around the CNOT; the kernel executes one CNOT operation plus two single-qubit operations.
•
Toffoli gate (CCX): A Toffoli with controls c1, c2, and target t could be compiled to a fixed sequence using six CNOT gates and a constant number of single-qubit gates (e.g., H, S, T, T†); well-known relative-phase variants reduce the CNOT count to four at the cost of a correctable phase. The transpiler selects the appropriate form and emits only single-qubit and CNOT gates to the backend. We previously published a study that closely examined different Toffoli gate decomposition techniques and their circuit depth build-up [37].

The proposed system’s file content could serve as a template for later improvements to HLS SV synthesis, as it provides scaffolding on which other scientific community members can build to optimize various aspects of quantum computing SV simulation. Potential improvements that could be applied to it are more evident in memory handling using sparse matrix representation (SMR), memory bandwidth utilization, and gate matrix processing. The contents of the files used in our experimental setup are as follows:

•
Host Code: Contains the code responsible for handling the data on the workstation side that is input into the FPGA or received from it. It feeds the data to the synthesized FPGA in a digestible format and stores the FPGA’s output data.
•
Kernel Code: This code is translated from C++ to RTL using an interpreter. (In our setup, it was Vivado/Vitis 2023.2.) This RTL code is then used to program the FPGA’s configuration to perform the code’s designated function.
•
Memory Configuration File: This file designates the FPGA’s buffer allocation. In our case, the Alveo U200 has four 16 GB DDR4 memory chips, which can be used as buffers to handle the data coming in from the workstation to the FPGA.

The code with all its versions and the scripts used to run it can be found at [38].

5.1. Algorithmic and Computational Complexity of Listings 1 and 2

Consider an n-qubit pure state stored as a length–2ⁿ complex vector ψ ∈ C²ⁿ in row-major order.

5.1.1. Listing 1: Single-Qubit Gate

The kernel iterates over stride- 2^qpairs (ψ_i, $ψ_{i + 2}^{q}$ ), where q is the target qubit. There are 2ⁿ⁻¹ such pairs, and for each pair, the algorithm performs four fused-multiply-adds (FMAs): two complex multiplies and two complex adds.¹ Hence, 6 $\begin{matrix} T_{single} (n) = Θ (2^{n}) complex - FMAs . \end{matrix}$

The operation count is optimal because every amplitude must be read at least once. Bandwidth dominates each pair, which incurs two 16-byte loads and two 16-byte stores (double precision), so the memory traffic is 7 $\begin{matrix} B_{single} (n) = 32 B \times 2^{n - 1} = Θ (2^{n}) bytes . \end{matrix}$

5.1.2. Listing 2: CNOT Gate

CNOT is a permutation: amplitudes are swapped only for index pairs (i, i ⊕ 2^t) whose control bit c is 1. The kernel iterates only over blocks with the control bit set and, within each block, touches only the half-period where the target bit is 0, so each eligible pair is swapped exactly once. As a result, the kernel performs 2ⁿ⁻² swaps (touching 2ⁿ⁻¹ amplitudes) and does no floating-point math:

8 $\begin{matrix} S_{CNOT} (n) = 2^{n - 2} swaps, I / O = Θ (2^{n}) bytes (about half of a single - qubit update) . \end{matrix}$ In practice, branchy control flow and strided access reduce pipeline efficiency and burst locality, so run-time need not scale to exactly half of Listing 1; our measurements show it remains bandwidth-dominated with additional control-flow overhead.

5.2. Measurement

Because the simulator retains the full SV |ψ⟩ ∈ C²ⁿuntil the end of circuit execution, the probability of any computational basis state k is simply P (k) = ${|ψ_{k}|}^{2}$ . A single read-out, or measurement operation, is therefore not needed. By simply copying the vector to the host, a cumulative distribution in O(2ⁿ) can be constructed, meaning that the accelerator does not require an in-kernel measurement to progress; instead, the host can obtain all desired classical outcomes by operating on the final SV. Concretely, a cumulative distribution (or histogram) can be constructed in one linear pass over |ψ⟩ on the host in O(2ⁿ), consistent with the bandwidth-bound nature of our kernels.

To clarify even further, we note how different read-outs map to this same pass without altering the FPGA kernel:

i.
Full distribution: compute $P (k) = {|ψ_{k}|}^{2}$ for all k.
ii.
Marginals (subset measurement): select a qubit subset M and accumulate probabilities into a 2^|M|-bin histogram by packing the measured bits from each index; this costs one pass over |ψ⟩ plus a small on-host reduction.
iii.
Multishot sampling: draw N outcomes via inverse-CDF sampling from the (full or marginal) histogram with O(N) extra work.
iv.
Optional postmeasurement collapse: when conditional logic is required, zero amplitudes are inconsistent with the chosen outcome and are renormalized again in a single linear pass.

These pathways deem measurement unnecessary, preserving the design goal of a single, universal synthesis while aligning with the memory-bound profile already established for our kernels.

6. Methodology

We used the AMD Alveo U200 Data Center Accelerator Card to set up the XCU200 FPGA, which was attached to a workstation that housed an Intel i9-10980XE as a CPU. The Intel Xeon Platinum 8358 Processor with 32 cores was used as an HPC node to carry out the Qiskit Aer SV simulator (Version 0.15.1) runs to compare execution time and precision loss. Qiskit Version 1.3 was used as the operating tool for handling quantum circuit code. To explore further optimization, we implemented synthesis runs using half-precision values. Our experiments also used loop unrolling as a hardware synthesis optimization technique and recorded their effect on execution speed.

Our experiment aims to demonstrate that our proposed system can simulate a substantial number of qubits, crossing the boundaries from medium-sized quantum circuits to large-sized quantum circuits (according to the QASMBench [39] rating). For benchmarking, we used QASMBench’s quantum circuits [39]. QASMBench is a benchmark suite specifically designed for quantum computing, comprising a diverse set of QASM programs derived from real-world quantum algorithms and applications. Numerous studies have used it to evaluate quantum hardware, simulators, and compilers.

HLS synthesis was coded using C++ through Vitis 2023.2 on a RHEL 7.9 machine. We have tried numerous techniques to optimize the implementation of the HLS code. Most have not proven worthwhile, but one that has reduced execution time is loop unrolling, which will be demonstrated to what extent it would work until other bottlenecks begin to limit its extent in reducing execution time. Some approaches resulted in slower implementations rather than faster ones. Expanding the buffer size to its maximum capacity led to increased latency even when simulating large quantum circuits, likely due to memory access and data management overheads. Additionally, inputting all gates as a pointer upfront instead of feeding them iteratively from the host to FPGA caused a notable slowdown. These observations underscore the importance of balancing hardware capabilities with algorithm design choices. We have utilized three of the four DDR4 chips available in the Alveo U200 datacenter accelerator card. The DDR4 to SV mapping can be seen in the code used in this study [38].

Loop unrolling is an optimization technique in computer programming and hardware design where multiple iterations of a loop are combined into a single iteration to reduce loop overhead and improve execution performance. Decreasing the number of iterations minimizes branching instructions and allows for better utilization of pipeline stages in modern processors or hardware accelerators like FPGAs. For example, a loop with an iteration count of 100 can be unrolled by a factor of 4, reducing the number of iterations to 25 while executing four iterations’ worth of work per cycle. This optimization is particularly effective in cases where loop iterations are independent, as it enables parallel execution, reduces branch misprediction, and improves data locality. However, it also increases code size and requires careful consideration to avoid memory bandwidth bottlenecks or register spills. The technique is widely discussed in compiler design and the HPC literature [40, 41], with its application being crucial in domains, such as image processing, scientific computing, and quantum circuit simulation.

To evaluate the execution time for individual gate operations on a quantum SV, a series of QASM files were designed. The first series comprises 100 single-qubit gates applied to quantum circuits ranging from 15 to 28 qubits. The second series follows the same structure but exclusively employs CNOT gates instead of single-qubit gates. The overhead from initialization and memory is included in the timing; in other words, the per-gate calculation includes timing from start to finish of the SV simulation, which includes initialization and memory management. The relative difference in these experiments was calculated using the following equation: 9 $\begin{matrix} Relative Difference = \frac{CNOT Gate Time - Single - Qubit Gate Time}{Single - Qubit Gate Time} . \end{matrix}$

In order to run 29-qubit circuits, an adjustment had to be made to the code. Instead of using a single buffer to contain each SV (input and output), two buffers were used to divide the SV’s values into two parts. Using two buffers for each SV caused issues with the HLS interpreter, which disabled our ability to use multiple DRAMs. However, this issue seems like a software bug rather than a limiting hardware factor; hence, the potential to scale this implementation up to simulate 31 qubits is averted due to a minor bug in the HLS interpreter. It is important to note that the 29-qubit Walsh–Hadamard transform (WHT) circuit is the only quantum circuit used that is not part of the QASMBench set, but it was created due to the high number of gates, 2059 gates, in the QASMBench for the 29-qubit circuit.

7. Results

The effects of loop unrolling on execution performance are evident across both floating-point and half-precision implementations. As seen in Figure 3, the float implementation is noticeably faster when the unroll factor is set to an even number. When the unroll factor is set to 6, it performs faster in almost all QASMBench circuits. The half-precision implementation, shown in Figure 4, demonstrates a more consistent linear reduction in execution time with increasing unroll factors, reaching an optimal performance at an unroll factor of 4. Beyond this point, the execution reduction effect of unrolling stagnates, with slightly worse results observed when the unroll factor is set to odd numbers. The unroll execution reduction is more pronounced in the half-precision implementation than in the float implementation.

[IMAGE OMITTED. SEE PDF]

The FPGA implementations exhibit a clear advantage over the Qiskit Aer SV simulator in terms of output file size. Figure 5 highlights that, on average, half-precision implementations produce smaller SV files than floating-point implementations, which are significantly smaller than Aer’s output. This reduction in file size can be crucial for handling larger quantum systems, where data storage and transfer become limiting factors. Benchmarking against Qiskit Aer on a 32-core Intel Xeon Platinum 8358 decisively shows the CPU completes the same circuits far faster than our current U200 implementation. Accordingly, we want to concentrate on a detailed analysis of the FPGA’s performance across configurations instead of presenting a direct timing table.

[IMAGE OMITTED. SEE PDF]

The precision analysis, as summarized in Table 3, reveals the average difference in values of the resulting probability SV matrix of all implementations when using Aer’s SV simulator as the golden model. The squared magnitude of each complex amplitude is computed to obtain the probability SV from a quantum SV; the resulting distribution should result in a normalized distribution. While the float implementation maintains a lower precision loss regarding the median of the results, the half-precision implementation exhibits a lower overall average difference. We used a cutoff value of ϵ = 1 × 10⁻¹³ to ensure that the precision analysis avoids skewing due to near-zero values, making the comparison robust and meaningful.

Table 3 Precision loss comparison between the float and half implementations using Qiskit Aer values as the golden model.

Benchmark	Float	Half
qft n15	5.35E − 09	6.04E − 06
qram n20	5.00E − 06	6.44E − 15
knn n25	2.68E − 13	6.45E − 09
ising n26	5.38E − 14	1.49E − 08
wstate n27	3.45E − 08	1.28E − 06
adder n28	1.00E − 05	5.96E − 08
wht n29^∗	2.34e − 15

The time for each gate to be executed is in Figure 6. The results came in as expected: With each increase in qubit count, the SV doubles in size. The execution time for both single- and CNOT gates shows an exponential rise in execution speed; all per-gate executions are below the one-minute mark though. The relative difference starts low at first and gradually climbs until it reaches a peak, then a plateau. This is due to the overhead in initialization and memory dominating the execution time until the real relative difference begins to materialize.

[IMAGE OMITTED. SEE PDF]

Figure 6 shows that the CNOT kernel can take up to twice as long as the single-qubit kernel. Although both kernels ultimately stream the same 2ⁿ complex amplitudes, three implementation details in Listing 2 widen the latency gap:

1.
Two nested branches and pipeline stalls. The outer loop chooses between the if (target >= control)clause and the else branch, while the inner while loop contains its own if (offset < half)test. These run-time branches prevent Vitis HLS from fully unrolling the loops; the generated FSM raises the initiation interval to 2–3 cycles and inserts pipeline bubbles.
2.
Sparse, strided locality. The outer loop starts at i = 1 << control and increments by a stride s = 2^control+1, touching addresses that are s × 16 B apart. When the control qubit is not the most-significant bit, those addresses lie in different DRAM rows, degrading burst locality and trimming effective bandwidth (our profiler reports 10%–15% lower throughput).
3.
Swap-store overhead and index arithmetic. Each iteration performs a swap of two amplitudes, which requires two reads, two writes, and a temporary register, plus an XOR operation for every access. The extra load–store pair and address calculation add latency that is absent from the single-qubit multiply-accumulate path.

Together, branch divergence, poorer spatial locality, and the swap-store overhead explain why the arithmetic-free CNOT kernel runs longer than the floating-point-intensive single-qubit kernel.

The results highlight a delicate balance between performance, precision, and resource utilization. Although float implementations offer higher accuracy on average, half-precision implementations provide better scalability and reduced file sizes, making them a compelling choice for storage-restricted applications. The clear performance gains from loop unrolling and the advantages of FPGA-based execution demonstrate the potential of this approach to accelerate quantum circuit simulations.

7.1. Discussion

The float implementation showed a clear advantage when setting the unroll factor to an even number, and this should come as no surprise as numerous studies have confirmed that in general, even-numbered unrolling factors provide an advantage [42]. Odd unrolling factors can introduce misalignment, reducing instruction-level parallelism and increasing loop overhead.

The performance variations among the three implementations (float, half-precision, and double buffer) primarily stem from the complexity of the required workarounds. The single-precision implementation exhibited superior performance due to extensive HLS intellectual property (IP) support, eliminating the need for additional modifications. Furthermore, we applied algorithmic optimizations to enhance its efficiency. In contrast, the half-precision implementation necessitated the separation of real and imaginary components into distinct arrays, while the double-buffer implementation required partitioning the SV across two arrays. Both approaches encountered challenges related to HLS compiler constraints, particularly when attempting to map separate DDR4 memory banks for input and output SVs. These limitations prevented the exploitation of multibank memory access, thereby restricting potential performance gains.

Although our implementation only used medium- to large-scale quantum circuits, synthesis could be modified to run numerous smaller circuits to supplement hybrid approaches [9]. The flexible nature of our proposed system’s FPGA synthesis offers a higher ceiling for parallel computing than the CPU approaches.

Our Aer SV simulation time outperformed our FPGA implementation by quite a margin. However, as discussed earlier, many different optimizations could be performed to reduce this margin, if not completely overtake it. Bandwidth utilization with respect to the current configuration remained at 20% for writing and 10% for reading when running the adder_n28 circuit; these utilization percentages are even lower in an ideal port configuration, reading at 1.3% and 2.5%, respectively. The usage of the dataflow pragma, unfortunately, prolonged execution time rather than increasing workflow parallelism; hence, attaching a read/write scheduler is a promising optimization for this dilemma.

The Alveo U200 contains four 16 GB DDR4 memory units. While this opens up the possibility for SV simulations of more than 31 qubits, we were limited by Vitis’s software constraints of limiting each buffer transfer size to under 4 GB. Hence, a separate implementation that utilized two buffers per SV was performed, and the results can be seen in Figure 7. Due to the slight differences, the execution time minimum in this case was reached when the loop unroll factor was set to 4. Multiple DRAMs were also restricted in this implementation due to bugs in the HLS interpreter. Hence, this implementation (double-buffer implementation) was limited to 29 qubits rather than 30 qubits if multiple DRAMs were used.

[IMAGE OMITTED. SEE PDF]

8. Conclusion

By achieving a 29-qubit universal gate SV simulation on an FPGA through HLS, this study connects existing research on hardware-accelerated quantum simulators with future efforts to optimize both hardware and software designs. Our proposed system does not skim off on imaginary numbers or application flexibility and could be a complete alternative to CPU and GPU SV simulation. Further future optimizations on quantum circuit optimization could be readily applied without any edits to our proposed system’s synthesis. Moreover, our system’s flexibility allows hybrid implementations, such as [9], to execute larger cluster sizes robustly. Compared to the Aer SV simulator, our proposed system requires less data storage, especially as the number of qubits in the SV increases.

The findings of this work highlight the need for further studies to:

•
Enhance the algorithmic implementation of SV operations on FPGAs to reduce resource usage and execution time. The application of merged gate implementations [43] would undoubtedly have enormous potential to optimize this system. SMR could expand the size and speed up the execution speed if the SV matrix were sparse [44].
•
Address hardware bottlenecks that currently constrain larger simulations, such as memory bandwidth, software constraints, and parallel computation limitations. Incorporating customized floating-point values would reduce unneeded memory requirements while taking up less bandwidth space.
•
Explore hybrid FPGA–CPU or FPGA–GPU systems that can combine the strengths of different platforms for large-scale quantum simulations.

In summary, this study provides a critical missing piece in the mosaic of quantum SV simulators. It establishes FPGAs as a viable and scalable platform, paving the way for future research to extend the limits of FPGA-based quantum SV simulations. This work opens new avenues for achieving efficient and large-scale quantum circuit simulations by contributing to the hardware and systems-level understanding of SV simulators.

Appendix

Appendix A - Quantum Computing Simulation Basics

Appendix A1. Quantum Computing: Vector and Matrix Representations

Quantum computing harnesses quantum mechanical phenomena such as superposition and entanglement to perform computation. At the core of quantum computing is the quantum bit, or qubit, which, unlike a classical bit, can be in a superposition of both 0 and 1 states.

Appendix A1.1. Single-Qubit Quantum States and Operations

In the vector representation, the state of a qubit is described by a vector in a two-dimensional Hilbert space. This vector is a linear combination of the basis states |0⟩ and |1⟩, generally represented as follows: A.1 $\begin{matrix} |ψ⟩ = α |0⟩ + β |1⟩, \end{matrix}$ where α and β are complex numbers that satisfy the normalization condition: A.2 $\begin{matrix} {|α|}^{2} + {|β|}^{2} = 1 . \end{matrix}$

Here, |0⟩ and |1⟩ are column vectors: A.3 $\begin{matrix} |0⟩ = [\begin{matrix} 1 \\ 0 \end{matrix}], \\ |1⟩ = [\begin{matrix} 0 \\ 1 \end{matrix}] . \end{matrix}$

Thus, the state |ψ⟩ can be expressed as follows: A.4 $\begin{matrix} |ψ⟩ = [\begin{matrix} α \\ β \end{matrix}] . \end{matrix}$

Operations on qubits are performed using quantum gates, which are represented by unitary matrices. For a single qubit, any quantum operation can be represented as a 2 × 2 unitary matrix U that acts on the SV |ψ⟩.

The application of a quantum gate U to the state |ψ⟩ is represented by the following equation: A.5 $\begin{matrix} |ψ^{'}⟩ = U| ψ⟩, \end{matrix}$ where U must satisfy the unitary condition U^†U = UU^† = I, with I being the identity matrix and U^† the conjugate transpose of U (which is also its inverse: U^† = U⁻¹).

For example, the Pauli-X gate, which acts as a quantum NOT operation (changing |0⟩ to |1⟩ and vice versa), can be represented by the matrix as follows: A.6 $\begin{matrix} X = [\begin{matrix} 0 & 1 \\ 1 & 0 \end{matrix}] . \end{matrix}$

When this gate is applied to the state |ψ⟩, the new state |ψ^′⟩ is as follows: A.7 $\begin{matrix} |ψ⟩ = X |ψ⟩ = [\begin{matrix} 0 & 1 \\ 1 & 0 \end{matrix}] [\begin{matrix} α \\ β \end{matrix}] = [\begin{matrix} β \\ α \end{matrix}] . \end{matrix}$

This matrix representation allows us to apply linear algebraic techniques to analyze and simulate quantum algorithms efficiently.

Complex numbers allow quantum algorithms to utilize interference, where the phases of different paths can add constructively or destructively. Phase factors, which are complex exponentials, play a critical role in this context: A.8 $\begin{matrix} e^{i θ} = \cos (θ) + i \sin (θ) . \end{matrix}$

These phase factors can alter the probability amplitudes in a way that real numbers alone could not, enabling unique computational capabilities such as those seen in the QFT and Grover’s algorithm, as explored by Mermin [45]. These aspects highlight the indispensable role of complex numbers in the foundational structure of quantum computing, enhancing its capability beyond classical computational methods.

Appendix A1.2. Multiple-Qubit Quantum States and Operations

Quantum states of multiple qubits are represented by the tensor product of the SVs of individual qubits. For two qubits, if the state of the first qubit is |ψ₁⟩ = α|0⟩ + β|1⟩ and the state of the second qubit is |ψ₂⟩ = γ|0⟩ + δ|1⟩, the combined state of the system is given by the following equation: A.9 $\begin{matrix} |Ψ⟩ = |ψ_{1}⟩ \otimes |ψ_{2}⟩ = (α |0⟩ + β |1⟩) \otimes (γ |0⟩ + δ |1⟩) . \end{matrix}$

Expanding the tensor product, we obtain the following: A.10 $\begin{matrix} |Ψ⟩ = α γ |00⟩ + α δ |01⟩ + β γ |10⟩ + β δ |11⟩ . \end{matrix}$

Each of these states represents a vector in a four-dimensional Hilbert space. The basis states for two qubits are as follows: A.11 $\begin{matrix} |00⟩ = [\begin{matrix} 1 \\ 0 \\ \begin{matrix} 0 \\ 0 \end{matrix} \end{matrix}], \\ |01⟩ = [\begin{matrix} 0 \\ 1 \\ \begin{matrix} 0 \\ 0 \end{matrix} \end{matrix}], \\ |10⟩ = [\begin{matrix} 0 \\ 0 \\ \begin{matrix} 1 \\ 0 \end{matrix} \end{matrix}], \\ |11⟩ = [\begin{matrix} 0 \\ 0 \\ \begin{matrix} 0 \\ 1 \end{matrix} \end{matrix}] . \end{matrix}$

Quantum gates that act on multiple qubits can also be represented by larger unitary matrices. For example, the CNOT gate, which flips the second qubit (target) if the first qubit (control) is in state |1⟩, can be represented by the matrix as follows: A.12 $\begin{matrix} CNOT = [\begin{matrix} 1 & 0 & \begin{matrix} 0 & 0 \end{matrix} \\ 0 & 1 & \begin{matrix} 0 & 0 \end{matrix} \\ \begin{matrix} 0 \\ 0 \end{matrix} & \begin{matrix} 0 \\ 0 \end{matrix} & \begin{matrix} \begin{matrix} 0 \\ 1 \end{matrix} & \begin{matrix} 1 \\ 0 \end{matrix} \end{matrix} \end{matrix}] . \end{matrix}$

When applied to a two-qubit system, the CNOT gate demonstrates how quantum gates can manipulate the states to generate entanglement, a key feature of quantum computation.

Data Availability Statement

The datasets generated and/or analyzed during the current study are available from the corresponding author upon reasonable request. The source code used in this study, together with scripts and instructions to reproduce the experiments, is openly available at .

Conflicts of Interest

The authors declare no conflicts of interest.

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Acknowledgments

The authors thank Shubham Pandey and Naveen Gopalakrishnan for their assistance in refining the operational template for synthesis on the Alveo U200 device.

References

1 Abraham H., Qiskit: An Open-Source Framework for Quantum Computing, Qiskit Open Source Project. (2019) https://github.com/Qiskit/qiskit.

2 NVIDIA Corporation, cuQuantum: Accelerating Quantum Circuit Simulations with Gpus, 2022, NVIDIA Developer Documentation.

3 Arute F., Arya K., Babbush R. et al., Quantum Supremacy Using a Programmable Superconducting Processor, Nature. (2019) 574, no. 7779, 505–510, https://doi.org/10.1038/s41586-019-1666-5, 2-s2.0-85074074842.

4 Ha¨ner T., Steiger D. S., Smelyanskiy M., and Troyer M., High Performance Emulation of Quantum Circuits, 874, Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (SC 2016), 2017, IEEE.

5 Zhang B., Fang B., Ye F. et al., Overcoming Memory Constraints in Quantum Circuit Simulation with a High-Fidelity Compression Framework, 2024.

6 Bayraktar H., Charara A., Clark D. et al., Cuquantum SDK: a High-Performance Library for Accelerating Quantum Science, 1, 2023 IEEE International Conference on Quantum Computing and Engineering (QCE), 2023, IEEE, 1050–1061, https://doi.org/10.1109/qce57702.2023.00119.

7 Moawad Y., Brown A., Steijl R., and Vanderbauwhede W., Optimising Iteration Scheduling for Full-State Vector Simulation of Quantum Circuits on Fpgas, 2024.

8 Wei K., Amano H., Niwase R., and Yamaguchi Y., A Data Compressor for FPGA-based State Vector Quantum Simulators, Proceedings of the 14th International Symposium on Highly Efficient Accelerators and Reconfigurable Technologies, 2024, 63–70, https://doi.org/10.1145/3665283.3665293.

9 Suzuki T., Miyazaki T., Inaritai T., and Otsuka T., Quantum AI Simulator Using a Hybrid CPU–FPGA Approach, Scientific Reports. (2023) 13, no. 1, https://doi.org/10.1038/s41598-023-34600-2.

10 Rakyta P., Morse G., Nádori J., Majnay-Takács Z., Mencer O., and Zimborás Z., Highly Optimized Quantum Circuits Synthesized via Data-Flow Engines, Journal of Computational Physics. (2024) 500, https://doi.org/10.1016/j.jcp.2024.112756.

11 Mahmud N., El-Araby E., and Caliga D., Scaling Reconfigurable Emulation of Quantum Algorithms at High Precision and High Throughput, Quantum Engineering. (2019) 1, no. 2, https://doi.org/10.1002/que2.19.

12 Nielsen M. A. and Chuang I. L., Quantum Computation and Quantum Information, 2001, Cambridge University Press.

13 Aaronson S., Quantum Computing Since Democritus, 2013, Cambridge University Press.

14 Preskill J., Quantum Computing in the NISQ Era and Beyond, Quantum. (2018) 2, https://doi.org/10.22331/q-2018-08-06-79.

15 Pednault E., Gunnels J. A., Nannicini G. et al., Breaking the 49-Qubit Barrier in the Simulation of Quantum Circuits, 2018, Lawrence Livermore National Laboratory (LLNL).

16 Boixo S., Isakov S. V., Smelyanskiy V. N. et al., Characterizing Quantum Supremacy in Near-Term Devices, Nature Physics. (2018) 14, no. 6, 595–600, https://doi.org/10.1038/s41567-018-0124-x, 2-s2.0-85051265057.

17 Putnam A., Caulfield A. M., Chung E. S. et al., A Reconfigurable Fabric for Accelerating Large-Scale Datacenter Services, Communications of the ACM. (2016) 59, no. 11, 114–122, https://doi.org/10.1145/2996868, 2-s2.0-84994551954.

18 AMD, Alveo U55C Data Center Accelerator Card, 2025, https://www.amd.com/en/products/accelerators/alveo/u55c/a-u55c-p00g-pq-g.html.

19 Intel, IA-860M Intel Agilex FPGA Card, 2025, https://www.intel.com/content/www/us/en/partner/showcase/offering/a5b3b000000MNB1AAO/ia860m-intel-agilex-fpga-card.html.

20 Fowers J., Ovtcharov K., Papamichael M. et al., A Configurable Cloud-Scale DNN Processor for Real-Time AI, 2018 ACM/IEEE 45th Annual International Symposium on Computer Architecture (ISCA), 2018, IEEE, 1–14.

21 Lee Y. H., Khalil-Hani M., Marsono M. N. et al., An FPGA-Based Quantum Computing Emulation Framework Based on Serial-Parallel Architecture, International Journal of Reconfigurable Computing. (2016) 2016, 1–18, https://doi.org/10.1155/2016/5718124, 2-s2.0-84965115243.

22 Rodrıguez-Borbón J. M., Kalant A., Yamijala S. S., Oviedo M. B., Najjar W., and Wong B. M., Field Programmable Gate Arrays for Enhancing the Speed and Energy Efficiency of Quantum Dynamics Simulations, Journal of Chemical Theory and Computation. (2020) 16, no. 4, 2085–2098, https://doi.org/10.1021/acs.jctc.9b01284.

23 Silva A. and Zabaleta O. G., FPGA Quantum Computing Emulator Using High Level Design Tools, 2017 Eight Argentine Symposium and Conference on Embedded Systems (CASE), 2017, IEEE, 1–6.

24 El-Araby E., Mahmud N., Jeng M. J. et al., Towards Complete and Scalable Emulation of Quantum Algorithms on High-Performance Reconfigurable Computers, IEEE Transactions on Computers. (2023) 72, no. 8, 2350–2364, https://doi.org/10.1109/tc.2023.3248276.

25 Giorgio A., Project and Implementation of a Quantum Logic Gate Emulator on FPGA Using a Model-Based Design Approach, IEEE Access. (2024) 12, 41317–41353, https://doi.org/10.1109/access.2024.3377458.

26 Belfore IIL., A Scalable FPGA Architecture for Quantum Computing Simulation, 2024.

27 Mazhar M. I., Tiwari A., Ali N., Patil A., Neiwal R. K. et al., Benchmarking Matrix Multiplications for Variable Qubit Size and Depth, 2024 ITU Kaleidoscope: Innovation and Digital Transformation for a Sustainable World (ITU K), 2024, IEEE, 1–7.

28 Li H. and Pang Y., FPGA-Accelerated Quantum Computing Emulation and Quantum Key Distillation, IEEE Micro. (2021) 41, no. 4, 49–57, https://doi.org/10.1109/mm.2021.3085431.

29 Fingerhuth M., Babej T., and Wittek P., Open Source Software in Quantum Computing, PLoS One. (2018) 13, no. 12, https://doi.org/10.1371/journal.pone.0208561, 2-s2.0-85058784192.

30 Google Quantum AI, Cirq: A Python Framework for NISQ Quantum Circuits, 2025, https://quantumai.google/cirq.

31 Quantinuum, TKET: A Retargetable Compiler for Quantum Computing, https://docs.quantinuum.com/tket/.

32 Talkington S. and Jiang H., Efficient Unitary Method for Simulation of Driven Quantum Dot Systems, Journal of Physics Communications. (2020) 4, no. 5, https://doi.org/10.1088/2399-6528/ab8ff8.

33 Moueddene A. A., Khammassi N., Bertels K., and Almudever C. G., Realistic Simulation of Quantum Computation Using Unitary and Measurement Channels, Physical Review A. (2020) 102, no. 5, https://doi.org/10.1103/physreva.102.052608.

34 Sandvik A. W. and Vidal G., Variational Quantum Monte Carlo Simulations With Tensor-Network States, Physical Review Letters. (2007) 99, no. 22, https://doi.org/10.1103/physrevlett.99.220602, 2-s2.0-36649005167.

35 Pang Y., Hao T., Dugad A., Zhou Y., and Solomonik E., Efficient 2D Tensor Network Simulation of Quantum Systems, SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, 2020, IEEE, 1–14.

36 Gillman E., Carollo F., and Lesanovsky I., Numerical Simulation of Critical Dissipative Non-equilibrium Quantum Systems With an Absorbing State, New Journal of Physics. (2019) 21, no. 9, https://doi.org/10.1088/1367-2630/ab43b0.

37 Bennakhi A., Byrd G., and Franzon P., Analyzing Quantum Circuit Depth Reduction with Ancilla Qubits in MCX Gates, 2024 IEEE International Conference on Quantum Computing and Engineering (QCE), 2024, 510–511, https://doi.org/10.1109/qce60285.2024.10380.

38 Bennakhi A., SV-FPGA: State Vector Processing on Fpgas, 2025, https://github.com/aabennak/SV-FPGA.

39 P. N. N. L. (Pnnl), Qasmbench: A Collection of Quantum Assembly Benchmarks, 2025, https://github.com/pnnl/QASMBench/tree/master.

40 Hennessy J. L. and Patterson D. A., Computer Architecture: A Quantitative Approach, 2011, Elsevier.

41 Rosen K. H., Discrete Mathematics & Applications, 1999, McGraw-Hill.

42 Murtovi A., Georgakoudis G., Parasyris K., Liao C., Laguna I., and Steffen B., Enhancing Performance Through Control-Flow Unmerging and Loop Unrolling on Gpus, 2024 IEEE/ACM International Symposium on Code Generation and Optimization (CGO), 2024, 106–118, https://doi.org/10.1109/cgo57630.2024.10444819.

43 Zhang C., Song Z., Wang H., Rong K., and Zhai J., Hyquas: Hybrid Partitioner Based Quantum Circuit Simulation System on GPU, Proceedings of the ACM International Conference on Supercomputing, 2021, 443–454, https://doi.org/10.1145/3447818.3460357.

44 Bellante A. and Zanero S., Quantum Matching Pursuit: a Quantum Algorithm for Sparse Representations, Physical Review A. (2022) 105, no. 2, https://doi.org/10.1103/physreva.105.022414.

45 Mermin N. D., Quantum Computer Science: An Introduction, 2007, Cambridge University Press.

Word count: 9589

Show less

© 2025. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Q2SV: A High‐Level Synthesis Approach for State Vector Quantum Simulation

Content area

Abstract

Full text