Artificial Intelligence Accelerators Based on

Full text

Turn on search term navigation

Introduction

The past half-decade has seen unprecedented growth in machine learning (ML) algorithms and their applications. For example, deep neural networks (DNNs) represent the state-of-the-art performance in a variety of context, such as large-scale computer vision, natural language processing, and data mining. DNNs have also impacted practical technologies such as web search, autonomous vehicles, and financial analysis.^[^] However, most ML algorithms have substantial computational and memory requirements, which greatly limit their training and deployment in resource-constrained environments. To address these challenges, there has been a significant trend in building high-performance specific hardware platforms, such as field-programmable gate arrays^[^] and application-specific integrated circuits.^[^] However, with the end of Dennard scaling and Moore's law, the power consumption and density of integrated electronic circuits have hit a bottleneck of processing more complex ML algorithms, especially when state-of-the-art data structures and the number of arithmetic operations are growing from a scale of millions to trillions. For example, the inference of one high-dimensional image on a ResNet-50^[^] requires $3 \times 10^{11}$ floating point operations per second (FLOPs s⁻¹), and a single training epoch requires $3 \times 10^{13}$ FLOPs s⁻¹ to update 25 million parameters.

Recent efforts on leveraging emerging techniques for efficient ML hardware focus on accelerating the key tensor-level multiply-accumulation (MAC) operations in ML algorithms, which are known as the most computation-intensive operations. For example, analog DNN hardware focuses on accelerating matrix multiplication, such as matrix–vector multiplying modules,^[^] mixed-mode MAC units,^[^] and memristor-based MAC.^[^] On the other hand, all-optical and hybrid optoelectronic implementations in early works have offered promising alternative routes to microelectronic implementations^[^] because of the advantages of executing MAC operation at the speed of light, high throughput, and very low or even nearly zero energy consumption. Recently, an integrated nanophotonic processor based on reconfigurable Mach–Zehnder interferometers at telecommunication wavelength demonstrated the advantages of optical DNN acceleration,^[^] where the matrix–vector multiplication (MVM) operation was decomposed into a series of multiplications following singular value decomposition. Moreover, optical approaches utilizing free-space optical components show great promise for ultrafast and ultraefficient hardware for ML because of highly parallel architectures and large-scale optoelectronic devices. For example, a large-scale photonic accelerator based on photoelectric multiplication and using the massive spatial multiplexing enabled by standard free-space optical components can operate at high speeds and very low energies.^[^] Furthermore, multiple 3D-printed diffractive optical layers in the terahertz range^[^] have shown the capability of performing linear classification, although they are not reconfigurable for new models as weights are physically hardcoded in passive diffractive layers. The reconfigurability for manipulating free-space propagation can be implemented using spatial light modulators (SLMs). For instance, optical convolutional layers in convolutional neural networks and random projections have been implemented using digital micromirror devices (DMDs) with binary amplitude modulation;^[^] optical Ising machines for computing the minima of spin Hamiltonians have been demonstrated by encoding spin variables in binary phases on SLMs;^[^] and a coherent vector–matrix multiplier has been achieved using DMDs and liquid-crystal SLMs.^[^]

In this article, we report a new high-performance optoelectronic architecture of performing general MVM operations by exploiting the extraordinary properties of graphene. Specifically, the architecture consists of a 2D array of SLMs and a 2D array of photodetectors with electrically controllable photoresponse, which are both constructed out of the combination of large-scale graphene monolayers and optical metamaterials. As graphene is gapless, these optoelectronic devices can be tailored to operate in an ultrabroad frequency range. Considering the inevitable nonuniformity of material properties and associated device variation, especially for large-scale polycrystalline graphene, we evaluate the influence of various contributing factors and conceive a methodology of performing accurate calculation even with imperfect devices and systems. Finally, we demonstrate a few representative ML algorithms showing the versatility and generality of the hardware platform.

Results and Discussion

Figure shows an illustration and the operation principle of the designed architecture to perform a general MVM operation $\vec{o} = W \vec{v}$ , consisting of a 2D array of SLMs for encoding vector information and a photodetector array with tunable photoresponsivity for encoding matrix element information. The input light is incoherent, such as high-efficiency narrow-band visible and near-infrared light emitting diodes and selective thermal emitters in the midinfrared range,^[^] so that any coherent interference effect is not involved. An N-dimensional vector $\vec{v} = (v_{1}, v_{2},…, v_{N})$ is mapped onto one row of SLMs, and the vector information is also replicated on other rows. This replication has two advantages: 1) It removes the necessity of involving beam splitting components that restrict chip integration and complicate optical alignment and 2) it relaxes the requirements of high-quality devices with large-scale uniformity and has a large tolerance of device variation. Each electro-optic unit of SLMs has an electrically controllable optical transmission function $T_{i} (V_{gv})$ encoding the information of $v_{i}$ , and the input power $P_{0}$ is modulated to $P_{0} T_{i} (V_{gv})$ after the passage. Afterward, the modulated light is detected by an array of photodetectors, where the photoresponsivity of each element can be electrically controlled. Each element $w_{j i}$ in the matrix W is encoded corresponding to the photoresponsivity of a photodetector in the array $R_{j i} (V_{gw})$ . As a result, the obtained photocurrent $I_{j i}$ is $P_{0} T_{i} (V_{gv}) R_{j i} (V_{gw})$ , and eventually generated photocurrents are added across columns in the same row that will be converted to voltage for further processing using electronic circuits, such as the implementation of nonlinear activation functions for DNNs. Mathematically, each element $o_{j}$ in $\vec{o} = (o_{1}, o_{2},…, o_{N})$ corresponds to $Σ_{i} I_{j i} = Σ_{i} P_{0} T_{i} (V_{gv}) R_{j i} (V_{gw})$ . Physically, both optical intensity and readout photocurrent are always-positive values. To perform mathematical calculations having both negative and positive real numbers, each element $v_{i}$ in $\vec{v}$ and $w_{i j}$ in the matrix W can be represented as a difference of two positive values $v_{i} = v_{i}^{+} - v_{i}^{-}$ and $w_{j i} = w_{j i}^{+} - w_{j i}^{-}$ . Thus, the MVM can be done through four multiplications $\vec{o} = W^{+} {\vec{v}}^{+} + W^{-} {\vec{v}}^{-} - W^{+} {\vec{v}}^{-} - W^{-} {\vec{v}}^{+}$ .

[IMAGE OMITTED. SEE PDF]

In addition, we lay out a design flowchart including electromagnetics (EM) simulation, system abstraction and integration, and performance benchmarking and evaluation; see Figure . This flowchart can be generalized to future design and optimization of other architectures. Specifically, the EM simulations connect material optical and optoelectronic properties with device-level response. The system modeling incorporates individual device input–output relation to construct a high-level computing architecture with sufficient software interface; ML algorithms utilize such an interface to run and evaluate performance, and in turn guide a better material and device design in the bottom level. This “closed-loop”-style design features an ultimate picture of “computer-designed computer” of the future. In this article, we use such a design methodology to demonstrate a graphene-based MVM operating in the midinfrared range, where there have been many applications, such as thermal imaging,^[^] as well as chemical and biomolecular sensing.^[^]

The detailed implementation and characterization of the array of SLMs and photodetectors are summarized in Figure . Figure shows the design of graphene-based SLMs, which consist of monolayer graphene and an extraordinary optical transmission (EOT) metamaterial on top. Since the first discovery of the EOT effect in subwavelength apertures,^[^] varieties of EOT structures have been extensively studied and utilized in many application scenarios to enhance light–matter interaction due to the enhanced electric field inside apertures. For example, metallic ring aperture EOT structures have demonstrated a $10^{10}$ sensitivity enhancement in a biomolecular sensor.^[^] Particularly, EOT structures have been used to construct electro-optic modulators for free-space optics. For instance, the combination of graphene with EOT metallic ring apertures shows a high-contrast-ratio terahertz electro-optic amplitude modulator through the adjustment of the graphene Fermi level.^[^] In addition to graphene, semiconductor p–i–n junctions can also be used to modulate free-space terahertz radiation through the control of carrier densities around EOT metallic ring apertures.^[^] Moreover, EOT metallic apertures with other geometries, such as a metallic slit array, are combined with graphene for the electro-optic control of midinfrared radiation.^[^] In our design, the EOT metamaterial unit has a 340 nm outer radius (r), a 50 nm gap (s), and the periodicity (p) of the array is 1 μm. The resulting transmission resonance is positioned around 4.5 μm. The graphene layer sits on the top of a dielectric thin layer that acts as an insulating layer for electrostatic doping to control the graphene Fermi level ( $E_{F}$ ) and modify its optical properties. The graphene only needs to be present around the aperture area.^[^] The scattering rate of graphene is assumed to be 2 meV, corresponding to a carrier mobility $\approx 6500 {cm}^{2} V^{- 1} s^{- 1}$ when $E_{F} = 0.5 eV$ ; see Methods for detailed conversion. Large-scale graphene films of such quality can be readily obtained nowadays using chemical vapor deposition (CVD)^[^] and can be easily manufactured using standard micro/nanofabrication methods. The EOT array serves a dual purpose: One is to enhance the light–matter interaction in graphene so that the modulation efficiency can be significant and the second is to be used as a top electrode for electrostatic control. Underneath each pixel is a transparent electrode, such as nickle ultrathin films^[^] and carbon nanotube thin films^[^] in the midinfrared range, for addressing each unit modulator. Note here that because our architecture is based on incoherent radiation, the only functionality requirement for SLMs is the light amplitude modulation and the associated phase modulation is irrelevant. Moreover, our architecture does not function based on optical diffraction. Thus, the unique features in our architecture significantly simplify the design of SLMs, which is generally challenging because of the requirements of decoupled modulation of amplitude and phase, as well as large diffraction angles.^[^] In addition, we only presented one possible example of SLM design and many other designs of high-performance SLMs can be also utilized and adapted in our architecture.^[^]

[IMAGE OMITTED. SEE PDF]

Full EM simulation results obtained from the commercial software Lumerical FDTD, shown in Figure , display the tunable device transmission at various $E_{F}$ from 0 to 0.22 eV. When the photon energy (0.27 eV for 4.5 μm) is greater than 2 $E_{F}$ , interband transition is allowed with substantial optical absorption and is further enhanced by the EOT resonance structure. When $E_{F}$ is greater than half the photon energy, due to Pauli blocking, the absorptive transition is forbidden, rendering larger transmittance. In a simple parallel-plate capacitor model, $E_{F}$ is proportional to the square root of gate bias. The accurate relation between the gate voltage and Fermi level is strongly dependent on graphene quality, uniformity, and electric gating circuitry. However, such a relation is not crucial practically because the relation between transmission and gate bias is of central interest and can be experimentally determined. The relationship between transmission at 4.5 μm wavelength and $E_{F}$ is fitted through a parabolic function and used in system modeling incorporating device variation; see Methods for more details. Note that, in practice, such fitting is not necessary and a look-up table for each device can be used to retrieve the applied gate voltage for a given output response.

Similarly, we designed graphene photoconductive detectors as shown in Figure , also consisting of EOT metamaterials on top of a monolayer graphene. Given a constant bias between the inner and outer metals of EOT structures, the generated photocurrent and photoresponsivity are electrostatically controlled by the individual bottom electrode and address wire. All the inner metals of EOT structures are connected together to harvest currents from each pixel along the same row to implement addition operation. The collected current can be converted to voltage and amplified for next-stage processing, such as implementing a nonlinear activation function. The electrically controllable Fermi level and Pauli blocking switch tune the absorption in graphene. Ideally, if we assume one photon generates a pair of electron and hole that both contribute to the measured photocurrent, Figure shows the photoresponsivity spectra and values at 4.5 μm under various Fermi levels, and the latter is also fitted using a parabolic function. Opposite to transmission, the responsivity decreases with increasing Fermi levels because of blocked interband absorption. Similar to the design of SLMs, there are many other possible designs for photodetector arrays, such as the demonstration in Yao et al. and Cakmakyapan et al.^[^]

In contrast to electronic implementation, this graphene optoelectronic architecture has ultrahigh parallelism, where all elements in both vector and matrix are computed simultaneously and at the speed of light. In addition, the ultrahigh carrier mobility of graphene promises high operation bandwidth of SLMs and photodetectors, which can be readily above 1 GHz to even tens of gigahertz.^[^] These two factors suggest the ultrahigh data throughput of the system. Furthermore, the subwavelength EOT structures with a highly confined electric field enable power-efficient modulation and low-noise graphene photodetectors can reduce energy consumption. This architecture also has potential of being integrated into a single chip, thanks to the large-scale CVD growth of graphene and its compatibility with modern micro/nanofabrication processes. We quantitatively evaluated the power efficiency and energy consumption of our proposed architecture for inference purpose in terms of FLOPs s⁻¹ per watt (FLOPs s⁻¹ W⁻¹) and joule per MAC (J/MAC), respectively. In the inference mode, the system works with high-throughput input vector data at ultrafast speed encoded by SLMs, while the matrix element encoded by the photodetector array is not changing or changing at a much slower speed. The quantity FLOPs s⁻¹W⁻¹ can be converted to the quantity J/MAC by taking the inverse if 1 FLOP is considered as 1 MAC, which is the case in most GPU platforms. The estimation is based on the assumption of $N \times N$ array size for both SLMs and photodetectors. The first contribution of energy consumption comes from the minimum detectable power of photodetectors, which is characterized by noise equivalent power (NEP). The total minimum optical power incident onto the photodetector array is given by $N^{2} \times NEP \times \sqrt{BW}$ , where BW is the bandwidth of photodetectors. Thus, the total input optical power into the whole system required for photodetectors to achieve minimum detectable power is given by $N^{2} \times NEP \times \sqrt{BW} / T_{avg}$ , where $T_{avg}$ is the average transmission in SLMs. A $N \times N$ system can perform $2 N^{2}$ FLOPs or MACs at the speed of BW. As a result, the J/MAC can be calculated as $NEP / 2 \sqrt{BW} T_{avg}$ . With realistic and feasible performance of graphene photodetectors^[^] in the midinfrared range, for example, BW = 10 GHz and NEP = 10 pW/ $\sqrt{Hz}$ , and $T_{avg} \sim 0.4$ for our SLM structure, the energy consumption is $\approx 125$ aJ/MAC. By further improving the operation bandwidth and NEP of detectors and transmission of SLMs, the energy consumption can be further reduced.

In addition to photodetectors, the switching energy of SLMs for encoding input information needs to be taken into account. The subwavelength EOT apertures feature tiny graphene active areas $\approx 0.1 μ m^{2}$ and the $E_{F}$ tunable range is $\approx 0.22 eV$ . In the parallel capacitor model, $E_{F} = ℏ v_{F} \sqrt{π n V_{g}}$ , where ℏ is the reduced Planck constant, $v_{F} {= 10}^{6}$ m s⁻¹ is the Fermi velocity, n is the carrier density, and $V_{g}$ is the applied dynamic gate voltage.^[^] $E_{F}$ is initially at the charge neutral point under zero or some constant static bias. n is the charge density and given by $ϵ_{0} ϵ_{d} / e d$ , where $ϵ_{0}$ is vacuum electric permittivity, $ϵ_{d}$ is the dielectric constant of the gate oxide, e is the electron charge, and d is the thickness of the gate oxide. For a 25 nm thick ${Al}_{2} O_{3}$ gate dielectric material ( $ϵ_{d} = 9$ ), $V_{g}$ is $\approx 1.8 V$ and the corresponding capacitance C is $ϵ_{0} ϵ_{d} A_{s} / d$ , where $A_{s} \approx 0.1 μ m^{2}$ is the active graphene area. The modulation speed $f_{s}$ is given by $1 / 2 π R_{s} C$ , where $R_{s}$ is the resistance in the capacitor charging circuit that mainly comes from graphene and contact resistance. If taking a reasonable resistance value $\approx 5 k Ω$ ^[^], $f_{s} \approx 100$ GHz. Indeed, graphene transistors have been demonstrated to be able to operate at $> 20$ GHz,^[^] which can match photodetector BW of $\approx 10 GHz$ . Furthermore, the energy consumption associated with charging and discharging capacitors can be estimated as $0.5 C V_{g}^{2}$ , which is $\approx 505 aJ$ . Because $N^{2}$ times switching of modulators corresponds to $2 N^{2}$ FLOPs or MACs, the energy consumption for switching SLMs is $\approx 252 aJ {MAC}^{- 1}$ . As a result, the total energy consumption related to SLMs and photodetectors in our architecture under inference mode is $\approx 377 aJ {MAC}^{- 1}$ , corresponding to $\approx 2.6 \times 10^{15}$ FLOPs s⁻¹ W⁻¹. Compared to the current state of the art of electronic hardware accelerators with the typical power efficiency of $\approx 10^{12}$ FLOPs s⁻¹ W⁻¹ and energy consumption of picojoules per MAC,^[^] our architecture is promising for three orders of magnitude improvement. As a final note, a reasonable near-term estimation of energy consumption from electronic components at the interface of optical and electrical domains, such as transimpedance amplifiers and other postprocessing electronic circuits for arithmetic calculations, is a few picojoules for one out of N channels.^[^] Thus, a large-scale system, for example, with potentially $N {> 10}^{3}$ ,^[^] can bring this energy consumption down to a level similar to the energy consumption from optoelectronic devices, which is a few pJ $\times N / 2 N^{2} \approx$ sub-fJ MAC⁻¹ to fJ MAC⁻¹. The proposed system has the estimated total power efficiency of $\approx 10^{15}$ FLOPs s⁻¹W⁻¹ and energy consumption of approximately femtojoules per MAC.

One important issue of emerging architectures bearing photonic and generally analog computing is scalability, which is especially notorious, involving unconventional materials, such as graphene nanomaterials. In the current example, there are inevitable device variations and nonuniformity when the array scale is large, which can be due to the polycrystalline nature of graphene and micro/nanofabrication variation. Performing accurate calculation with imperfect components is thus crucial for practical deployment, and the procedure of correcting such imperfection is necessary. Figure shows an $8 \times 8$ array of SLMs with 20% transmission variations at the same graphene Fermi level or gate voltage and similar variation also occurs in the responsivity of the photodetector array; see Methods for details about how the strength of variation is added onto device parameters.

[IMAGE OMITTED. SEE PDF]

In our correction procedure, for each row, we sweep applied gate voltages on each unit of SLM and the corresponding photodetectors pair by pair, and for each pair, we sweep the gate voltage of the SLM unit and that of the photodetector unit separately. From the readout, we obtain tuning curves for each unit of SLM and photodetector. Due to the nonuniformity of devices, the tuning range from each pair can vary, as shown in Figure . The developed strategy is to define the minimum tuning range on that row as the physical quantity unit so that any other reading from the row readout can be converted to algebraic values by dividing by this unit. Also, defining the unit as the minimum tuning range guarantees that each pair can achieve this range. This methodology highlights the advantages of replicating vector encoding on the rows of SLMs, through which the correction for each row is independent from others. In contrast, the calibration in the structures involving beam splitting elements is cross-linked between rows and is significantly more complicated. The detailed mathematical analysis and proof-of-correction procedure to generate the accurate output results are provided in Section , Supporting Information.

The accuracy of the graphene multiplier is evaluated by comparing the calculation results with those obtained from the standard linear algebra multiplication function. Figure shows a representative calculation error distribution of 10 000 multiplication calculations of a random $8 \times 8$ matrix and a random $8 \times 1$ vector with all elements within $- 1$ to 1. The histogram is fitted using a normal distribution and the standard deviation is the figure of merit for evaluation. Figure displays the standard deviation error for various degrees of device variation from 0% to $20 %$ . This variation applies to both SLMs and photodetectors. The error is nearly constant using the correction procedure described previously, proving the effectiveness of this procedure. Note here, the residual error for perfect devices originates from the finite precision of the applied gate voltage, which is assumed to be 8 bit. In addition to the limit of finite precision in applied gate voltages, the readout from detectors can also have finite precision. For example, commercially available digital CCD cameras in the visible range generally have 10 bit precision. We also investigated the influence of detector bit precision on accuracy performance, and as shown in Figure , the error drastically increases with small bit precision (e.g., 5 bit). Finally, we investigated the influence of noise in the system, which is modeled as a Gaussian noise added onto the readout end. The noise effect is reflected onto the error dependence on input power. Note here, the detector responsivity has been ideally modeled and in practice the responsivity can be quite different. Thus, the unit of input power on the x-axis is an arbitrary unit. As expected, as the input power and thus the signal-to-noise ratio decrease, the error increases. More error histograms for these contributing variations and noise are provided in Section , Supporting Information.

Finally, we utilize our graphene multiplier for running multiple ML algorithms. We emulated and corrected an $8 \times 8$ multiplier, and established a general matrix–matrix multiplication (GEMM) by segmenting the matrix into multiple blocks to fit the dimension of our multiplier emulator; see Methods and the Supporting Information for more details. We compared the quality of results of selected ML algorithms obtained with our GEMM multiplier with the results from a general-purpose processor (GPP), which is an Intel Xeon Gold 6230 processor in this work. First, we evaluated the graphene GEMM for image reconstruction, in which the image was compressed using singular value decomposition (SVD). The original image is shown in Figure , and has been compressed using SVD such that $image = U \times Σ \times V^{T}$ , where the dimensionalities of the $image, U, Σ, V^{T}$ are $R^{m \times n \times 3}$ , $R^{m \times p}$ , $R^{n \times p}$ , and $R^{p \times p}$ , respectively. Specifically, our experiments were conducted on image $\in R^{768 \times 512 \times 3}$ ( $m = 768$ , $n = 512$ ). Although the top singular vectors capture most of the variation, instead of using all the singular vectors and multiplying them as shown in SVD decomposition, we reconstructed the image using top-K singular vectors. The reconstructed image ( $K = 50$ ) with GPPs (Figure ) has the same quality as that of the image reconstructed using the graphene multiplier (Figure ). The second ML algorithm we evaluated with graphene GEMM is unsupervised learning using a support vector machine (SVM) algorithm on the Blobs dataset. As shown in Figure , the clustering results generated with our GEMM multiplier match the results obtained on GPPs, where the loss differs $< 0.2 %$ .

[IMAGE OMITTED. SEE PDF]

Figure displays another demonstration on ML algorithms conducted on multilayer perceptron (MLP) neural networks. Specifically, we built and trained a two-layer MLP network without using a nonlinear activation function for two multiclass image classification datasets, MNIST10 and Fashion-MNIST10. Details about training settings can be found in Methods. Figure display the prediction confusion matrices for these two datasets, where the graphene multiplier achieved 88.7% accuracy for MNIST10 and 76.8% accuracy for Fashion-MINST10. In comparison, the GPP achieved slightly better prediction performance using the same MLP architecture, 92.3% and 78.7% accuracy for MNIST10 and Fashion-MNIST10, respectively. Although we demonstrate that the graphene GEMM multiplier can achieve similar results to GPPs for the first two ML algorithms, there are noticeable accuracy degradations for image classification tasks using MLPs. We found that the accuracy degradations are mainly caused by initialization and training algorithms, such that the learned parameters in MLPs are very small with the mean close to zero. Due to such distribution of MLP parameters, the inevitable errors from the graphene multiplier associated with noise and finite precision become noticeable compared to other applications. However, we believe that the impacts of errors introduced by the graphene multiplier will be much smaller when applied to more robust and larger neural network architectures.

Conclusion

In summary, we report a new high-performance optoelectronic architecture of performing general MVM and GEMM operations by exploiting the extraordinary properties of graphene. Specifically, this architecture consists of a 2D array of SLMs and photodetectors with electrically controllable transmission and photoresponse, which are both constructed out of the combination of large-scale graphene monolayers and optical EOT metamaterials. This system possesses ultrahigh data throughput and ultralow energy consumption because of the extreme parallelism of the architecture, the ultrahigh carrier mobility of graphene, and the high-power-efficiency electro-optic control. From the perspective of practically deploying in a large-scale system, we design a methodology of performing accurate calculation with imperfect devices and systems and evaluate the influence of imperfection, considering inevitable nonuniformity of material properties and associated device variation. Finally, we demonstrate a few ML algorithms showing the versatility and generality of the hardware.

Experimental Section

Graphene Model in Lumerical FDTD

The graphene monolayer was modeled as a 2D rectangle conducting sheet in the Lumerical material library, including both interband and intraband contributions. The Fermi level and scattering rate are two parameters used to calculate dielectric constants used for Lumerical. The scattering rate used in Lumerical can be converted to mobility as follows. The damping constant $γ = \frac{q ℏ v_{F}^{2}}{μ E_{F}}$ is twice the scattering rate setting in the Lumerical library, where $q = 1.6 \times 10^{- 19}$ C is the value of electron charge, $ℏ = 1.05 \times 10^{- 34} J s$ is the reduced Planck constant, $v_{F} = 1 \times 10^{6}$ m⁻¹ s⁻¹ is the Fermi velocity, μ is the carrier mobility, and $E_{F}$ is the Fermi level. Thus, in this study, a 2 meV damping rate, corresponding to a 1 meV scattering rate in the Lumerical setting, was used and is equivalent to carrier mobility of $\approx 6500 {cm}^{2} V^{- 1} s^{- 1}$ at $E_{F} = 0.5 eV$ .

Device Response Fitting and Variation Modeling

The simulated transmission of SLMs and absorption of photodetectors using EM simulators as a function of various Fermi levels were fitted using second-order polynomials. Concretely, $T (V_{g}) = a_{t}^{(2)} V_{g}^{2} + a_{t}^{(1)} V_{g} + a_{t}^{(0)}$ and $R (V_{g}) = b_{r}^{(2)} V_{g}^{2} + b_{r}^{(1)} V_{g} + b_{r}^{(0)}$ . The device variation was modeled in the way that fitting parameters $a_{t}^{(i)}$ and $b_{r}^{(i)}$ vary across different devices in the array of SLMs and photodetectors. Specifically, taking SLMs for example, the parameter vector $\vec{a} = (a_{t}^{(2)}, a_{t}^{(1)}, a_{t}^{(0)})$ varies as $\vec{\tilde{a}} = - p X \vec{a} + (1.0 + p / 2) \vec{a}$ , where X is a random number between 0 and 1 with uniform distribution generated for each unit and p denotes the strength of variation. $\vec{\tilde{a}}$ is in the range between $(1.0 - p / 2) \vec{a}$ and $(1.0 + p / 2) \vec{a}$ . A 20% variation means $p = 0.2$ and for each unit of the array, X is randomly generated.

Implementation of GEMM

GEMM is a common algorithm in linear algebra, ML, statistics, and many other domains. Mostly, this includes using blocking, inner products, outer products, and systolic array techniques, which breaks the computations of GEMM to better utilize vector multiplication or MVM. Specifically, for this work, an optoelectronic GEMM was developed by utilizing the proposed optoelectronic MVM, where the targeted matrices were decomposed into block matrices (also known as block partitioning). GEMM was then implemented recursively using a divide-and-conquer algorithm, which was used to execute the ML algorithms discussed in Figure . See the Supporting Information for more details and illustration.

Implementation and Training of ML Algorithms

PyTorch was used for the implementation and evaluation of the ML algorithms and the fundamental computational kernels generally used were PyTorch matrix multiplications (torch.mm; see ). For example, the batched inference process of the MLP network for classifying the MNIST and Fashion-MNIST tasks (Figure ) was executed based on 3D matrix multiplications using torch.mm, and also for SVD and SVM algorithms. As discussed in the GEMM implementation, such 3D matrix multiplications can be executed using the graphene optoelectronic MVM-based GEMM kernel. In addition to torch.mm in PyTorch, a customized Python GEMM Application Programming Interface (API) was implemented that takes two matrices as input and returns the resulting matrix as output, where the multiplications are completed through our graphene optoelectronic MVM emulator. Then, instead of calling torch.mm for matrix multiplications in ML algorithms, all matrix multiplications can be replaced with the customized GEMM API. Therefore, through the comparison of the results obtained from calling torch.mm for matrix multiplications with those from the customized GEMM API, the final prediction performance of those ML algorithms using the optoelectronic architecture herein can be evaluated.

The Autograd mechanism in PyTorch was used for the training of ML algorithms. Autograd is a reverse automatic differentiation system, which records a graph representation of all the operations that encode the input–output mappings of ML models. As a result, it returns as a directed acyclic graph whose leaves are the input tensors and roots are the output tensors. By tracing this graph from roots to leaves, gradients can be automatically computed based on the chain-rule for gradient-based backpropogation algorithms. While the evaluated ML algorithms are all implemented with GEMM operators, the autograd graphs can simply be constructed using the PyTorch-autograd mechanism and the gradient descent algorithm Adam can be deployed to train the ML models according to a given loss function. Specifically, mean-square-error loss was used herein to train the SVM-based clustering application, and negative-log-likelihood-loss to train the MNIST10 and Fashion-MNIST10 classification tasks. The Adam backpropagation algorithm settings herein include learning rate $lr = 0.1$ , $β_{1} = 0.9$ , $β_{2} = 0.999$ , $ϵ {= 10}^{- 8}$ , and without $L 2$ penalty.

Acknowledgements

W.G. and R.C. thank the support from the University of Utah start-up fund. C.Y. thanks the support from grants NSF-2019336 and NSF-2008144.

Conflict of Interest

The authors declare no conflict of interest.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Y. LeCun, Y. Bengio, G. Hinton, Nature 2015, 521, 436.

C. Zhang, P. Li, G. Sun, Y. Guan, B. Xiao, J. Cong, in Proc. of the 2015 ACM/SIGDA Int. Symp. on Field-Programmable Gate Arrays, Association for Computing Machinery, New York, NY, USA 2015 pp. 161–170.

H. Sharma, J. Park, D. Mahajan, E. Amaro, J. K. Kim, C. Shao, A. Mishra, H. Esmaeilzadeh, in The 49th Annual IEEE/ACM Int. Symp. on Microarchitecture (MICRO), IEEE, Piscataway, NJ 2016 pp. 1–12.

F. Akopyan, J. Sawada, A. Cassidy, R. Alvarez-Icaza, J. Arthur, P. Merolla, N. Imam, Y. Nakamura, P. Datta, G.-J. Nam, B. Taba, M. Beakes, B. Brezzo, J. B. Kuang, R. Manohar, W. P. Risk, B. Jackson, D. S. Modha, IEEE Trans. Comput.-Aided Des. Integr. Circuits Syst. 2015, 34, 1537.

Y.-H. Chen, T. Krishna, J. S. Emer, V. Sze, IEEE J. Solid-State Circuits 2016, 52, 127.

K. He, X. Zhang, S. Ren, J. Sun, in Proc. of the IEEE Conf. on Computer Vision and Pattern Recognition, IEEE 2016, pp. 770–778.

C. R. Schlottmann, P. E. Hasler, IEEE J. Emerg. Select. Topics Circuits Syst. 2011, 1, 403.

Z. Wang, S. Joshi, S. Savel'ev, W. Song, R. Midya, Y. Li, M. Rao, P. Yan, S. Asapu, Y. Zhuo, H. Jiang, P. Lin, C. Li, J. H. Yoon, N. K. Upadhyay, J. Zhang, M. Hu, J. P. Strachan, M. Barnell, Q. Wu, H. Wu, S. Williams, Q. Xia, J. J. Yang, Nat. Electron. 2018, 1, 137.

D. Bankman, L. Yang, B. Moons, M. Verhelst, B. Murmann, IEEE J. Solid-State Circuits 2018, 54, 158.

Z. Wang, S. Joshi, S. E. Savel'ev, H. Jiang, R. Midya, P. Lin, M. Hu, N. Ge, J. P. Strachan, Z. Li, Q. Wu, M. Barnell, G.‐L. Li, H. L. Xin, R. S. Williams, Q. Xie, J. J. Yang, Nat. Mater. 2017, 16, 101.

I. Boybat, M. Le Gallo, S. Nandakumar, T. Moraitis, T. Parnell, T. Tuma, B. Rajendran, Y. Leblebici, A. Sebastian, E. Eleftheriou, Nat. Commun. 2018, 9, 1.

M. Hu, C. E. Graves, C. Li, Y. Li, N. Ge, E. Montgomery, N. Davila, H. Jiang, R. S. Williams, J. J. Yang, Q. Xia, J. P. Strachan, Adv. Mater. 2018, 30, 1705914.

J. W. Goodman, A. Dias, L. Woody, Opt. Lett. 1978, 2, 1.

J. J. Hopfield, Proc. Natl. Acad. Sci. U.S.A. 1982, 79, 2554.

D. Psaltis, N. Farhat, Opt. Lett. 1985, 10, 98.

T. Lu, S. Wu, X. Xu, T. Francis, Appl. Opt. 1989, 28, 4908.

G. Dunning, Y. Owechko, B. Soffer, Opt. Lett. 1991, 16, 928.

M. Reck, A. Zeilinger, H. J. Bernstein, P. Bertani, Phys. Rev. Lett. 1994, 73, 58.

I. Bar-Tana, J. P. Sharpe, D. J. McKnight, K. M. Johnson, Opt. Lett. 1995, 20, 303.

Y. Shen, N. C. Harris, S. Skirlo, M. Prabhu, T. Baehr-Jones, M. Hochberg, X. Sun, S. Zhao, H. Larochelle, D. Englund, M. Solijačić, Nat. Photonics 2017, 11, 441.

N. C. Harris, J. Carolan, D. Bunandar, M. Prabhu, M. Hochberg, T. Baehr-Jones, M. L. Fanto, A. M. Smith, C. C. Tison, P. M. Alsing, D. Englund, Optica 2018, 5, 1623.

R. Hamerly, L. Bernstein, A. Sludds, M. Soljačić, D. Englund, Phys. Rev. X 2019, 9, 021032.

X. Lin, Y. Rivenson, N. T. Yardimci, M. Veli, Y. Luo, M. Jarrahi, A. Ozcan, Science 2018, 361, 1004.

J. Chang, V. Sitzmann, X. Dun, W. Heidrich, G. Wetzstein, Sci. Rep. 2018, 8, 1.

A. Saade, F. Caltagirone, I. Carron, L. Daudet, A. Drémeau, S. Gigan, F. Krzakala, in 2016 IEEE Int. Conf. on Acoustics, Speech and Signal Processing, IEEE, Piscataway, NJ 2016 pp. 6215–6219.

D. Pierangeli, G. Marcucci, C. Conti, Phys. Rev. Lett. 2019, 122, 213902.

J. Spall, X. Guo, T. D. Barrett, A. Lvovsky, Opt. Lett. 2020, 45, 5752.

T. Inoue, M. De Zoysa, T. Asano, S. Noda, Opt. Express 2016, 24, 15101.

Y. Gong, K. Li, N. Copner, H. Liu, M. Zhao, B. Zhang, A. Pusch, D. L. Huffaker, S. S. Oh, Nanophotonics 2021, 1.

M. Vollmer, K.-P. Möllmann, Infrared Thermal Imaging: Fundamentals, Research and Applications, John Wiley & Sons, Hoboken, NJ 2017.

B. Mizaikoff, Chem. Soc. Rev. 2013, 42, 8683.

T. W. Ebbesen, H. J. Lezec, H. Ghaemi, T. Thio, P. A. Wolff, Nature 1998, 391, 667.

H. Liu, P. Lalanne, Nature 2008, 452, 728.

F. J. Garcia-Vidal, L. Martin-Moreno, T. Ebbesen, L. Kuipers, Rev. Mod. Phys. 2010, 82, 729.

L. Xie, W. Gao, J. Shu, Y. Ying, J. Kono, Sci. Rep. 2015, 5 8671.

W. Gao, J. Shu, K. Reichel, D. V. Nickel, X. He, G. Shi, R. Vajtai, P. M. Ajayan, J. Kono, D. M. Mittleman, Q. Xu, Nano Lett. 2014, 14, 1242.

J. Shu, W. Gao, K. Reichel, D. Nickel, J. Dominguez, I. Brener, D. M. Mittleman, Q. Xu, Opt. Express 2014, 22, 3747.

S. Kim, M. S. Jang, V. W. Brar, Y. Tolstova, K. W. Mauser, H. A. Atwater, Nat. Commun. 2016, 7, 1.

L. Banszerus, M. Schmitz, S. Engels, J. Dauber, M. Oellers, F. Haupt, K. Watanabe, T. Taniguchi, B. Beschoten, C. Stampfer, Sci. Adv. 2015, 1, e1500222.

D. Ghosh, L. Martinez, S. Giurgola, P. Vergani, V. Pruneri, Opt. Lett. 2009, 34, 325.

Z. Wu, Z. Chen, X. Du, J. M. Logan, J. Sippel, M. Nikolou, K. Kamaras, J. R. Reynolds, D. B. Tanner, A. F. Hebard, A. G. Rinzler, Science 2004, 305, 1273.

S.-Q. Li, X. Xu, R. M. Veetil, V. Valuckas, R. Paniagua-Domnguez, A. I. Kuznetsov, Science 2019, 364, 1087.

C. Qiu, J. Chen, Y. Xia, Q. Xu, Sci. Rep. 2012, 2, 1.

C. Qiu, T. Pan, W. Gao, R. Liu, Y. Su, R. Soref, Opt. Lett. 2015, 40, 4480.

C. Peng, R. Hamerly, M. Soltani, D. R. Englund, Opt. Express 2019, 27, 30669.

Y. Yao, R. Shankar, M. A. Kats, Y. Song, J. Kong, M. Loncar, F. Capasso, Nano Lett. 2014, 14, 6526.

S. Cakmakyapan, P. K. Lu, A. Navabi, M. Jarrahi, Light Sci. Appl 2018, 7, 1.

B. Jafari, H. Soofi, Appl. Opt. 2019, 58, 6280.

W. Gao, J. Shu, C. Qiu, Q. Xu, ACS Nano 2012, 6, 7806.

Y.-M. Lin, K. A. Jenkins, A. Valdes-Garcia, J. P. Small, D. B. Farmer, P. Avouris, Nano Lett. 2009, 9, 422.

L. Anzi, A. Mansouri, P. Pedrinazzi, E. Guerriero, M. Fiocco, A. Pesquera, A. Centeno, A. Zurutuza, A. Behnam, E. A. Carrion, E. Pop, R. Sordan, 2D Mater. 2018, 5, 025014.

A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, J. Kepner, in 2019 IEEE High Performance Extreme Computing Conf., IEEE, Piscataway, NJ 2019, pp. 1–9.

J. W. Beletic, R. Blank, D. Gulbransen, D. Lee, M. Loose, E. C. Piquette, T. Sprafke, W. E. Tennant, M. Zandian, J. Zino, in High Energy, Optical, and Infrared Detectors for Astronomy III, Vol. 7021, International Society for Optics and Photonics, Bellingham, WA 2008, p. 70210H.

K. Fan, J. Y. Suen, W. J. Padilla, Opt. Express 2017, 25, 25318.

Word count: 5828

Show less

Abstract

Translate

Optical and optoelectronic approaches of performing matrix–vector multiplication (MVM) operations have shown the great promise of accelerating machine learning (ML) algorithms with unprecedented performance. The incorporation of nanomaterials into the system can further improve the device and system performance thanks to their extraordinary properties, but the nonuniformity and variation of nanostructures in the macroscopic scale pose severe limitations for large‐scale hardware deployment. Here, a new optoelectronic architecture is presented, consisting of spatial light modulators and tunable responsivity photodetector arrays made from graphene to perform MVM. The ultrahigh carrier mobility of graphene, high‐power‐efficiency electro‐optic control, and extreme parallelism suggest ultrahigh data throughput and ultralow energy consumption. Moreover, a methodology of performing accurate calculations with imperfect components is developed, laying the foundation for scalable systems. Finally, a few representative ML algorithms are demonstrated, including singular value decomposition, support vector machine, and deep neural networks, to show the versatility and generality of the platform.

Details

Title

Artificial Intelligence Accelerators Based on Graphene Optoelectronic Devices

Author

Gao, Weilu¹

; Yu, Cunxi¹; Chen, Ruiyang¹

¹ Department of Electrical and Computer Engineering, The University of Utah, Salt Lake City, UT, USA

Section

Research Articles

Publication year

2021

Publication date

Jun 1, 2021

Publisher

John Wiley & Sons, Inc.

ISSN

26999293

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/adpr.202100048

ProQuest document ID

3089860804

Artificial Intelligence Accelerators Based on Graphene Optoelectronic Devices

Jump to:

Full text

Abstract

Details

Suggested sources