Full text

Turn on search term navigation

Introduction

With the rapid development of deep learning algorithms in recent years, artificial neural networks (ANNs) have obtained qualitative breakthroughs in various fields of artificial intelligence (AI), such as face recognition, voice recognition, natural language processing, robotics, and autonomous driving.^[^1–5^] However, when performing continual learning on sequential tasks, ANNs suffer from catastrophic forgetting.^[^6–10^] That is, when these networks are trained on new tasks, previously learned tasks are quickly forgotten (see Figure 1). To address this issue, several continual learning neural networks, e.g., the gradient episodic memory (GEM),^[¹¹^] the elastic weight consolidation (EWC),^[¹²^] the continual learning through synaptic intelligence (SI),^[¹³^] the general replay (GR),^[¹⁴^] and the progressive neural networks (PNN),^[¹⁵^] have been proposed. These models can effectively mitigate catastrophic forgetting and achieve outstanding performance in benchmarking continual learning with split-MNIST and split-FashionMNIST. Nonetheless, the implementations of the aforementioned models on conventional CMOS circuit systems face great challenges due to the complex network structures and enormous parameters. The frequent and massive data shuttling between memory and processing units incurs unaffordable computing costs, especially for resource-limited edge applications.

View Image - Figure 1. Schematics of the catastrophic forgetting. When ANNs perform multitask continual learning, the hidden weights are expected to be plastic to adapt to new tasks while remaining stable to avoid being overwritten and losing previous tasks’ information. This phenomenon is known as the plasticity–stability dilemma. The catastrophic forgetting is caused by failing to balance plasticity and stability, leading to a sharp accuracy decline on the previous cat task (the orange curve) while learning the present dog task (the green curve).

Figure 1. Schematics of the catastrophic forgetting. When ANNs perform multitask continual learning, the hidden weights are expected to be plastic to adapt to new tasks while remaining stable to avoid being overwritten and losing previous tasks’ information. This phenomenon is known as the plasticity–stability dilemma. The catastrophic forgetting is caused by failing to balance plasticity and stability, leading to a sharp accuracy decline on the previous cat task (the orange curve) while learning the present dog task (the green curve).

One promising approach is to perform continual learning using the in-memory computing (IMC) paradigm, where certain arithmetic operations are carried out by the memory itself. As a result, the IMC paradigm is characterized by superior parallelism and energy efficiency, especially when handling data-intensive tasks.^[^16–18^] Resistive nonvolatile memories (NVM), such as resistance random access memory (RRAM), phase-change memory, and magnetic random access memory, have been actively pursued for demonstrating the highly efficient IMC paradigm.^[^19–21^] However, due to device nonidealities, e.g., device variation, programming nonlinearity and asymmetry, and limited conductance states, the present IMC devices are still struggling to parallel the high-precision (e.g., 32-bit floating-point) calculations, which is critical for these continual learning neural network models.

Human brains excel at continual learning in a lifelong manner. The ability of brains to incrementally adapt is mediated by a rich set of neurophysiological processing principles that regulate the stability–plasticity balance of synapses. Synaptic plasticity is an essential feature of the brain that allows us to learn, remember, and forget.^[²²^] On the one hand, the synaptic weights need to be plastic enough to maintain the learning potential for new tasks; on the other hand, the weights need to remain stable to avoid being overwritten extensively during the training process. Metaplasticity, which refers to activity-dependent changes in neural functions that modulate subsequent synaptic plasticity,^[²³^] has been viewed as an important rule to balance the stability and plasticity of synapses.^[²⁴^] Recently, Laborieux et al. reported a binarized neural network (BNN) with metaplastic hidden weights to mitigate catastrophic forgetting, where the metaplasticity of a synapse was used as a criterion of importance concerning the tasks that have been learned throughout.^[²⁵^] This type of BNN not only embodies the potential benefits of the synergy between neurosciences and machine learning research but also provides a low-precision way to achieve continual learning with resistive IMC technology. However, the characterization and hardware implementation of BNN are still missing, and thus, the impact of device nonidealities and weight precision on the system performance remains to be elucidated.

Here, we propose a metaplasticity-inspired mixed-precision continual learning (MPCL) model for the hardware implementation of continual learning. The balance between plasticity and stability is regulated by the underlying sensitivity to data changes. The MPCL model is deployed on a hybrid analogue–digital computing platform equipped with a 256 kb RRAM IMC chip to perform 5-split-MNIST and 5-split-FashionMNIST continual learning tasks. By taking advantage of the IMC paradigm of the RRAM chip, both an ≈200× reduction of the energy consumption for the multiply-and-accumulation (MAC) operations compared to traditional CMOS digital systems and high average recognition accuracies comparable to the state-of-the-art performance have been demonstrated during the inference phase, providing a promising solution to future autonomous edge AI systems.

Mixed-Precision Continual Learning Model

To reduce the precision requirement, we propose the MPCL neural network model with mixed-precision weights. Floating-point weights, which are precise to reflect minor weight changes associated with learning, are used as plastic memory for selectively forgetting unimportant information in subsequent task learning. In contrast, binary weights, which are less likely to respond to small weight changes, are used as the stable memory for storing important information. The entire training procedure is shown in Figure 2a. Inspired by metaplasticity, the MPCL adopts an asymmetric weight update strategy regulated by the memory coefficient m for the floating-point weights’ back-propagation during training and uses binary weights for task inference. When training on a new task, if the floating-point weight updates in the same direction as its sign, the corresponding binary weight shall not change so that the MPCL will retain the high inference accuracy of the learned tasks. If the floating-point weight updates in the opposite direction as its sign, the corresponding binary weights may flip over during the training process, resulting in the accuracy decline of the previous task. The maximum allowed change per update of those floating-point weights with the update direction opposite to their signs can be regulated through m, which equips binary weights tightly correlated to the new task with the possibility to switch. In contrast, the rest of the floating-point weights sharing the same sign with their changes can freely update (see Experimental Section). The MPCL model increases only a small amount of training time compared to traditional BNNs due to the additional computations, while the inference time cost is the same (see Figure S4, Supporting Information). Owing to the asymmetric update strategy, the MPCL limits the switching of the binary weights that are less relevant with the new task to avoid massive weights overwriting, thus effectively mitigating catastrophic forgetting.

View Image - Figure 2. a) Flowchart of the MPCL and weight update strategy. The MPCL model is a fully connected feed-forward neural network (hidden layer sizes are 64-256-10) with batch normalization and sign activation. The hidden layers use mixed-precision weights, where the binary weights are used in the forward-propagation and loss function evaluation, and the floating-point weights are used for back-propagation. b) The simulation results of the weight switch rate and two tasks average accuracy under different m values. After learning the current task, the weight switch rate is defined as the ratio of the switched binary weights. The best MPCL performance is obtained when the weight switch rate equals 3.48% and m equals 0.005. c) Confusion matrices of the recognition accuracy for 5-split-FashionMNIST dataset with m = 0 and m = 0.005. d) The inference accuracy of the MPCL and other reported continual learning models on the 5-split-FashionMNIST dataset. We choose the GEM,[11] the EWC,[12] and the continual learning through SI[13] models as the comparison. e) The comparison results of noise immunity. The normal distribution noise is introduced to the hidden layer weights represented by the conductance of the RRAM cells. The “conductance noise” represents the size of the normal distribution noises introduced to the weights of hidden layers during simulation. Thanks to binary weights, the MPCL shows stronger robustness than the reported floating-point-type models, including GEM,[11] EWC,[12] and SI.[13].

Figure 2. a) Flowchart of the MPCL and weight update strategy. The MPCL model is a fully connected feed-forward neural network (hidden layer sizes are 64-256-10) with batch normalization and sign activation. The hidden layers use mixed-precision weights, where the binary weights are used in the forward-propagation and loss function evaluation, and the floating-point weights are used for back-propagation. b) The simulation results of the weight switch rate and two tasks average accuracy under different m values. After learning the current task, the weight switch rate is defined as the ratio of the switched binary weights. The best MPCL performance is obtained when the weight switch rate equals 3.48% and m equals 0.005. c) Confusion matrices of the recognition accuracy for 5-split-FashionMNIST dataset with m = 0 and m = 0.005. d) The inference accuracy of the MPCL and other reported continual learning models on the 5-split-FashionMNIST dataset. We choose the GEM,[11] the EWC,[12] and the continual learning through SI[13] models as the comparison. e) The comparison results of noise immunity. The normal distribution noise is introduced to the hidden layer weights represented by the conductance of the RRAM cells. The “conductance noise” represents the size of the normal distribution noises introduced to the weights of hidden layers during simulation. Thanks to binary weights, the MPCL shows stronger robustness than the reported floating-point-type models, including GEM,[11] EWC,[12] and SI.[13].

As an appropriate m is crucial for the MPCL to achieve high-performance multitask classification, we further investigate the strategy to refine m. First, to analyze the correlation between the MPCL performance and m, we randomly select two tasks out of the 5-split-FashionMNIST as the previous and current tasks. The definition of multitask average accuracy is given by[Image Omitted. See PDF]where the ${Acc}_{PT}$ and ${Acc}_{CT}$ represent the accuracy of previous and current tasks, respectively, and the $N_{PT}$ and $N_{CT}$ represent the data size of the previous and current tasks, respectively. Figure 2b illustrates the impact of m on the binary weight switch rate and the average accuracy. The simulation result shows that, when m is 0.001, the binary weight switch rate exceeds 45% after learning the current task due to the less bounded update step size of the corresponding floating-point weight, resulting in a severe catastrophic forgetting on the previous task. With an increased m, the accuracy of the previous task gradually recovers due to the decreased rate of binary weight flipping. The best average accuracy of the two tasks appears m = 0.005 (see Figure S1, Supporting Information). Further increasing m, the weight switch rate dramatically decreases to 1%, revealing that the network loses the capability to learn the new task due to the lack of plastic weights. Figure 2c shows the confusion matrices of recognition accuracy by simulating the training with 5-split-FashionMNIST, clearly demonstrating the continual learning ability of the MPCL with an appropriate m value.

We then compare the MPCL with other existing continual learning algorithms on the 5-split-FashionMNIST dataset. As baselines, the model freely updated on each new task is denoted as None, while the model trained on both current and previous tasks information is referred to as Joint. As shown in Figure 2d, the MPCL effectively mitigates the catastrophic forgetting and achieves comparable continual learning accuracy (average accuracy of five tasks is 97.7%) with respect to other reported continual learning models. Finally, to probe the robustness of MPCL, normal distribution noises are introduced to the weights of hidden layers (see Experimental Section) as a simulation of the impact of device variation on the system's inference performance. As shown in Figure 2e, the MPCL is significantly less prone to noise than those floating-point-type models. Therefore, the MPCL model is a natural choice for hardware implementation of continual learning considering the evitable device nonidealities of resistive NVMs.

It should be pointed out that the MPCL model will still face catastrophic forgetting or loss of learning capability with overwhelming learning tasks. This challenge, according to Gido et al.,^[²⁶^] applies to all existing continual learning models^[^11–14^] to scale with the number of learning tasks. A possible explanation for this phenomenon is that the hypothesis space for weight search gradually decreases with the increasing number of learning tasks, resulting in the inability to achieve the global optimal solution.^[²⁷^] This phenomenon can be mitigated by expanding the network architecture and increasing the network's capacity, e.g., the progressive neural networks.^[¹⁵^]

Hybrid Analogue–Digital Hardware System for Continual Learning

To leverage the IMC paradigm for continual learning, a hybrid analogue–digital hardware system has been developed to implement the MPCL neural network (Figure 3). The MPCL contains a three-layer fully connected feed-forward neural network with both binary and floating-point weights. During continual learning, the binary weights are used in forward-propagation and loss evaluation, while the floating-point weights are used for weight update in back-propagation and updating corresponding binary weights through a sign function (see Experimental Section).

View Image - Figure 3. The hybrid analogue–digital computing system. The MPCL consists of a three-layer fully connected feed-forward neural network with binary (pink) and floating-point (blue) mixed-precision weights. By replacing the most computationally expensive floating-point VMM operations with binary ones, the MPCL significantly lowers the requirements for weight precision during the inference phase. The hybrid analogue–digital system consists of a 256 kb RRAM computing-in-memory chip, a general digital processor, and a PCB. During hardware implementations, the binary weights are physically represented by the normalized conductance of the RRAM differential pairs, and the floating-point operations are carried out by the general digital processor. The 5-split-FashionMNIST and 5-split-MNIST continual learning datasets are used to benchmark the continual learning performance.

Figure 3. The hybrid analogue–digital computing system. The MPCL consists of a three-layer fully connected feed-forward neural network with binary (pink) and floating-point (blue) mixed-precision weights. By replacing the most computationally expensive floating-point VMM operations with binary ones, the MPCL significantly lowers the requirements for weight precision during the inference phase. The hybrid analogue–digital system consists of a 256 kb RRAM computing-in-memory chip, a general digital processor, and a PCB. During hardware implementations, the binary weights are physically represented by the normalized conductance of the RRAM differential pairs, and the floating-point operations are carried out by the general digital processor. The 5-split-FashionMNIST and 5-split-MNIST continual learning datasets are used to benchmark the continual learning performance.

The hybrid analogue–digital hardware system integrates an RRAM computing-in-memory chip and a general digital processor on a printed circuit board (PCB). The RRAM chip physically embodies binary weights to accelerate the low-precision binary vector–matrix multiplication (VMM) in the MPCL forward-propagation. The high-precision digital processor consists of a field-programmable gate array (FPGA) together with an advanced RISC machine (ARM) processor for both floating-point operations, e.g., normalization and activation, and commanding the RRAM chip. When deploying the MPCL to the hybrid analog–digital system, the RRAM array is programmed according to the pretrained binary weights, where each +1 or −1 binary weight is encoded by the RRAM differential pair (see Experimental Section and Supplementary Text1, Supporting Information). As the weights storage and processing takes place on the same RRAM chip, the energy and time consumption to transfer data between processor and memory can be minimized during the inference phase.

To validate the continual learning performance, we synthesize the 5-split-FashionMNIST and 5-split-MNIST datasets by evenly splitting the FashionMNIST and MNIST datasets into five tasks. Each task contains 12 000 training images and 2000 test images from two categories. During the inference phase, all images are first down-sampled and binarized before being converted into voltage signals that are fed into the RRAM chip through digital-to-analogue converters (DACs). The VMM results are in the form of accumulated currents that are sampled by analogue-to-digital converters (ADCs) and fetched by the general digital processor for the downstream normalization and activation (see Experimental Section).

In Situ Fine-Tuning Method

Although resistive IMC reduces the O(N²) computational complexity of VMM down to O(1) by exploiting Kirchhoff's law and Ohm's law,^[¹^] crucial to computationally intensive tasks, e.g., neural networks,^[¹⁸^] increasing hardware parallelism tends to accumulate errors rapidly and eventually be detrimental to hardware performance due to nonideal factors such as device noise and IR-Drop.^[^28,29^] To reduce the impact of the hardware nonidealities, we design an in situ fine-tuning method that allows the hardware to accelerate the MPCL through massively parallel processing (MPP) without significantly compromising inference accuracy, as shown in Figure 4.

Figure 4. In situ fine-tuning method. The in situ fine-tuning method is designed for optimizing hardware performance with parallel processing. Step 1: the equivalent hardware weights are read out from the programmed RRAM array. The Wpara used in hardware VMM operations is obtained by applying read voltages to multiple BLs simultaneously on both positive and negative memristor differential pair arrays. Step 2: normalized weights that contain nonideal noises are converted into fixed-point weights by the Int operation through ADC quantization. The Int operation significantly reduces the noise caused by the conductance fluctuations of RRAM devices. Step 3: the equivalent hardware parallel weights are decomposed into two sparse matrices WL and WR to repeat the hardware computational flow in the software simulation. By enforcing the Int operation to the VMM results of WL and input vector x, the simulation results match with the hardware implementation. Step 4: the first two layers of the MPCL network are fixed, and the in situ fine-tuning method is used to retrain the last layer. Finally, the well-retrained weights are remapped to the RRAM array.

On the hardware side, we first map software weights to program RRAM differential pairs. These hardware weights are parallelized in forward-VMM operations by simultaneously applying read voltages to the RRAM chip's several bit lines (BLs), and the parallelism is thus defined as the number of rows of these BLs. The modeled equivalent values of these hardware weights follow a quasi-Gaussian distribution due to RRAM device variation, which differs from the pretrained weights and results in deviations between hardware and software calculating results. Using the build-in quantization of ADC, we introduce an Int operation during the current summation, converting the fluctuating hardware weights to stable fixed-point weights. Although the Int operation lowers the precision of weights, it significantly reduces the randomness caused by the conductance fluctuation of RRAM during computation and also benefits computational cost.

On the software side, we develop the same computational flow as the hardware. To reconstruct the hardware's Int operation, we propose a matrix decomposition method, which decomposes the equivalent hardware weights into two sparse matrices. The left matrix W^L is the equivalent hardware weight matrix divided according to the hardware computational parallelism used to simulate the parallel hardware operations. The right matrix W^R is a zero-one matrix, which sums the intermediate results after weight decomposition, and restores the output matrix to the original dimension (see Figure S2, Supporting Information). The weight matrix decomposition avoids using the entire row or column during VMM. Meanwhile, the insertion of Int operations after the parallel multiply–accumulation further regulates the VMM results, converting the obtained results into the fixed-point type according to the hardware implementation. As the hardware computational flow is precisely reflected by the software one, we can obtain the matched parallel weights (W^para) in both software and hardware. The parallel weights are stable enough for the subsequent in situ fine-tuning, where the last layer of the MPCL neural network is further optimized with rest layers fixed before being mapped again to the RRAM array.

Hardware Implementation of MPCL

To benchmark the continual learning performance of the MPCL and the designed in situ fine-tuning method, we deploy the MPCL on the hybrid analogue–digital hardware system. Figure 5a illustrates the reliability of the binary weights on the RRAM chip. The normalized conductance map of the RRAM array reveals a weight mapping accuracy of 99% thanks to the high yield of the RRAM array. The narrow quasi-Gaussian distributions of the low resistance state (LRS) and high resistance state (HRS) improve the precision of the binary weights. Figure 5b demonstrates the conductance retention over 10 000 reading operations. The majority of RRAM devices show stable nonvolatile conductance in carrying IMC. It should be pointed out that the RRAM state before forming is defined as the HRS in this case, as it shows high energy efficiency due to the low conductance values. Figure 5c shows the relationship between the computational parallelism and the average inference accuracy of the hardware-implemented MPCL. The in situ fine-tuning significantly suppresses the error accumulation caused by the parallel hardware operations, achieving 95.3% accuracy on the 5-split-FashionMNIST dataset with a parallelism of 16. Thanks to the improvement of parallelism, the estimated number of VMM in a single image inference dramatically decreases to 3600 operations (OPs), revealing a 17 times improvement over the conventional hardware implementation (62 000 OPs), as shown in Figure 5d. Finally, Figure 5e shows the simulation and hardware-measured recognition accuracy of the MPCL on both the 5-split-MNIST and 5-split-FashionMNIST datasets. The average hardware-measured accuracies on the 5-split-MNIST and 5-split-FashionMNIST datasets are 94.9% and 95.3%, respectively, which are just 2.1% and 2.4% lower than the software baseline of 97.0% and 97.7%.

View Image - Figure 5. a) The normalized conductance map (left) and resistance distribution of the RRAM array (right). The conductance map contains 220 × 256 RRAM differential pairs (56 300 devices on each positive and negative array) and is selectively formed by applying electric pulses (3 V, 10 μs) according to the predesigned patterns. The resistance of the RRAM array shows two quasi-Gaussian distributions, constituting the low-noise binary weights. b) Retention of the LRS and HRS. Inset shows the LRS fluctuation rate represented by the maximum conductance deviation (see Experimental Section), lower than 15%. c) Hardware implementation accuracies before and after in situ fine-tuning with increasing parallelism. d) The estimated MAC operations with increasing parallelism for single picture inference. e) Simulation and hardware-measured training accuracies on the 5-split-FashionMNIST and 5-split-MNIST datasets.

Figure 5. a) The normalized conductance map (left) and resistance distribution of the RRAM array (right). The conductance map contains 220 × 256 RRAM differential pairs (56 300 devices on each positive and negative array) and is selectively formed by applying electric pulses (3 V, 10 μs) according to the predesigned patterns. The resistance of the RRAM array shows two quasi-Gaussian distributions, constituting the low-noise binary weights. b) Retention of the LRS and HRS. Inset shows the LRS fluctuation rate represented by the maximum conductance deviation (see Experimental Section), lower than 15%. c) Hardware implementation accuracies before and after in situ fine-tuning with increasing parallelism. d) The estimated MAC operations with increasing parallelism for single picture inference. e) Simulation and hardware-measured training accuracies on the 5-split-FashionMNIST and 5-split-MNIST datasets.

A comparison between the MPCL results and existing continual learning models is shown in Table 1. In software simulation, the MPCL shows a comparable accuracy with the reported continual learning models, which lift the strict requirement of the weight precision, making it friendly to resistive NVMs for IMC. In addition, taking advantage of the IMC paradigm of the RRAM computing in-memory chip, the energy consumption for the MAC operations during the inference phase is improved ≈200× compared to the traditional CMOS digital systems. In the hardware implementation, the MPCL shows ≈40× improvement compared to the reported RRAM-based IMC continue learning system,^[³⁰^] where 4-bit weight precision is used for a hybrid network consisting of convolutional neural network (CNN) and spiking neural network (SNN).

Table 1 Comparison of this work with recent works

Approach	EWC^[¹²^]	GEM^[¹¹^]	SI^[¹³^]	This work	CNN + SNN^[³⁰^]	This work
Implementation	Simulation	Simulation	Simulation	Simulation	Hardware	Hardware
Average accuracy (%)a)	99.0	94.5	95.7	97.5	82.0	95.2
Class no.	10	10	10	10	3	10
Inference precision	FP	FP	FP	1 bit	4 bits	1 bit
Noise tolerance (%)b)	13	15	10	36	N/A	N/A
MAC energy (J)c)	5.4 × 10⁻⁷	5.4 × 10⁻⁷	5.4 × 10⁻⁷	2.7 × 10⁻⁸	≈10⁻⁷	2.3 × 10⁻⁹

^a)The average accuracy is obtained on the 5-split-MNIST under the same network architecture 192-256-256-10;

^b)The noise tolerance is calculated when the model's inference accuracy drops to 90%;

^c)The MAC energy refers to the energy consumption of the MAC operations for a single picture inference. The simulation MAC energy is estimated based on 45 nm standard CMOS technology, while that of the hardware implementation is estimated based on memristor units.

Conclusion

In this work, we have developed a metaplasticity-inspired MPCL model experimentally implemented on a hybrid analogue–digital computing system to solve continual learning problems. The MPCL leverages the different precision of floating-point and binary weights to balance the plasticity and stability of synapse, which thus mitigates the catastrophic forgetting. The in situ fine-tuning method helps to further reduce the impact of RRAM device nonidealities on the parallel operation. Finally, using the hybrid analogue–digital hardware system, we achieve an average recognition accuracy of 94.9% and 95.3% (software baseline 97.0% and 97.7%) on the 5-split-MNIST and 5-split-FashionMNIST continual learning datasets, respectively. Relying on the IMC paradigm of the RRAM chip, the energy efficiency of the MAC operations is improved ≈200× compared to the traditional CMOS digital systems during the inference phase. The ability to introduce balanced synaptic plasticity and stability through mixed-precision weights makes the MPCL an ideal model to perform continual learning tasks in autonomous edge AI systems.

Experimental Section The RRAM Chip Fabrication

The RRAM chip with a crossbar structure contains 256 kb RRAM cells (512 rows × 512 columns). Each of these cells is integrated on the 40 nm standard logic platform between metal 4 (M4) and metal 5 (M5), including top electrodes (TEs), a TaO_x-based oxide resistive layer, and bottom electrodes (BEs). The TE comprising 3 nm Ta and 40 nm TiN is deposited by sputtering in sequence. The resistive layer consisting of 10 nm TaN and 5 nm Ta is deposited by physical vapor deposition on the BE via, where the Ta is further oxidized in an oxygen atmosphere to form an 8 nm TaO_x dielectric layer. The BE via with a size of 60 nm is patterned by photolithography and etching, where the via is filled with TaN using physical vapor deposition and then polished by chemical mechanical polishing (CMP). After fabrication, the logic BEOL metal is deposited as the standard logic process, and the cells in the same columns share TE while those in the same rows share BE to generate the RRAM array chip. Finally, the chip is heated at 400 °C for 30 min to perform the postannealing process.

The Hybrid Analogue–Digital Computing System

This hybrid analogue–digital computing platform consists of three parts, the RRAM computing-in-memory chip, the general digital processor (Xilinx ZYNQ XC7Z020 system-on-chip), and the high-speed PCBs. The high-speed PCBs contain an 8-channel digital-to-analogue converter (DAC80508, TEXAS INSTRUMENTS, 16-bit resolution) with two 8-bit shift registers (SN74HC595, TEXAS INSTRUMENTS), providing 64-ways parallel analogue input voltages ranging from 0 V to 5 V. During VMM calculations, the input vectors are converted into the DC voltage signals and applied to the BLs of the RRAM chip through a 4-channel analogue multiplexer (CD4051B, TEXAS INSTRUMENTS). The calculation results are represented by the source line's (SL) convergence currents and converted into voltage outputs through trans-impedance amplifiers (OPA4322-Q1, TEXAS INSTRUMENTS). Finally, the voltage outputs are read by ADCs (ADS8324, TEXAS INSTRUMENTS, 14-bit resolution) and sent to the general digital processor for the following calculations.

Details of the Algorithms

Algorithm 1: The MPCL model. The MPCL is a three-layer fully connected feed-forward neural network using binary weight $W^{b}$ and floating-point weight $W^{f}$ as a set of mixed-precision weights. During the training phase, the binary weights used for forward-propagation are obtained by exerting the sign function to the corresponding floating-point weight.[Image Omitted. See PDF]

After calculating the entropy loss, the MPCL uses the momentum-based Adam optimizer,^[³¹^] where the $Δ W^{f}$ is the update of floating-point weight. The asymmetric weight update is determined by the sign of the product of $Δ W^{f}$ and $W^{b}$ . If $Δ W^{f} \cdot W^{b} > 0$ , the floating-point weight is updated in the opposite direction to its sign, and the maximum changes will be limited during each update step.[Image Omitted. See PDF]

The $W_{Max}^{f}$ is further regulated by the memory coefficient m to obtain the maximum allowed update during each update.[Image Omitted. See PDF]

To avoid massive binary weights switching, the floating-point weights are updated using the smaller of $| η Δ W^{f} |$ and $| W_{Max_allowed}^{f} |$ [Image Omitted. See PDF]where the η is the learning rate. If $Δ W^{f} \cdot W^{b} \leq 0$ , the floating-point weight updates in the same direction of its sign, and the floating-point weight will be freely updated.[Image Omitted. See PDF]

Finally, the number of update steps will be minus one to end one update process.[Image Omitted. See PDF]

If the model's inference accuracy does not increase within five epochs, the training phase will be early stopped to avoid over-fitting.

Algorithm 2: The noise injection. To probe the robustness of the MPCL compared with the reported floating-point-type models, we simulate the impact of device variation during inference by introducing the normal distribution noise to the hidden layer weights $W_{i, j}^{f}$ to obtain the nonideal weight $W_{i, j}^{f_with_noise}$ . Both the $W_{i, j}^{f}$ and $W_{i, j}^{f_with_noise}$ follow a normal distribution.[Image Omitted. See PDF]

The mean and standard deviation of the normal distribution are $W_{i, j}^{f}$ and $W_{i, j}^{f} \times ρ$ , respectively, and the ρ is defined as the conductance noise %, which determines the size of conductance noise during reading operations, ranging from 0% to 100%. After noise introduction, all continual learning models are tested on the 5-split-FashionMNIST.

Algorithm 3: The maximum conductance deviation. The maximum conductance deviation is used to illustrate the memristor's conductance fluctuation during reading operations and is defined as[Image Omitted. See PDF]where the $g_{max}$ , $g_{min}$ , and $g_{average}$ are the max, min, and average conductance of the memristors during reading operations.

The Method of Estimating Power Consumption

To estimate the energy consumption of MAC operations in the inference of a single picture, we calculate the energy cost of RRAM cells with binary weights. During inference, as the computational parallelism of the RRAM chip is smaller than the horizontal dimension of weight matrices, some of the summation operations need to be implemented on the ARM core. The method of estimating power consumption is given by[Image Omitted. See PDF]where the $mult_num$ and $plus_num$ are the number of multiplication and plus operations, respectively. According to the weight matrices and computational parallelism, the number of multiplication and summation operations is calculated according to[Image Omitted. See PDF][Image Omitted. See PDF]where the layer_num is the number of the hidden layers. The MAC operations performed on the RRAM chip are then carried out by the read operations of memristor cells, and the ${read_energy}_{RRAM}$ is given by[Image Omitted. See PDF]where the $read_voltage$ and $read_time$ are the amplitude of the reading pulse and the memristor's read-response time (see Figure S3, Supporting Information). The $R_{LRS}$ and $R_{HRS}$ are the average resistance of LRS and HRS, respectively. All the parameters used for the evaluation of energy consumption are summarized in Table S1, Supporting Information. Note that a few of the RRAM devices (<1%) will be reset to about 135 kΩ as the value of R_HRS during in situ fine-tuning, which has a negligible effect on the energy estimation. The ${plus_energy}_{ARM}$ refers to the energy consumption of the summation operations performed on the ARM core and is calculated based on the 45 nm standard CMOS technology. By taking advantage of the IMC paradigm, the energy consumption for the MAC operations during inference is improved ≈200× compared to the 45 nm ARM digital systems. As MAC operations account for the majority of the energy consumption in the forward-propagation, the RRAM computing in-memory chip significantly improves the energy efficiency of the MPCL model during inference.

Acknowledgements

This work was supported by the National High Technology Research Development Program under grant no. 2018YFA0701500, the National Natural Science Foundation of China under grant nos. 61874138, 61888102, 61834009, and 62055406, and in part by the Strategic Priority Research Program of the Chinese Academy of Sciences under grant no. XDB44000000. This research was also supported by the Hong Kong Research Grant Council—Early Career Scheme (grant no. 27206321), National Natural Science Foundation of China—Excellent Young Scientists Fund (Hong Kong and Macau) (grant no. 62122004). This research was also partially supported by ACCESS—AI Chip Center for Emerging Smart Systems, sponsored by Innovation and Technology Fund (ITF), Hong Kong SAR.

Conflict of Interest

The authors declare no conflict of interest.

Data Availability Statement

The data that support the findings of this study are available from the corresponding author upon reasonable request.

Word count: 4813

Show less

© 2022. This work is published under http://creativecommons.org/licenses/by/4.0/ (the “License”). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Artificial neural networks have acquired remarkable achievements in the field of artificial intelligence. However, it suffers from catastrophic forgetting when dealing with continual learning problems, i.e., the loss of previously learned knowledge upon learning new information. Although several continual learning algorithms have been proposed, it remains a challenge to implement these algorithms efficiently on conventional digital systems due to the physical separation between memory and processing units. Herein, a software–hardware codesigned in‐memory computing paradigm is proposed, where a mixed‐precision continual learning (MPCL) model is deployed on a hybrid analogue–digital hardware system equipped with resistance random access memory chip. Software‐wise, the MPCL effectively alleviates catastrophic forgetting and circumvents the requirement for high‐precision weights. Hardware‐wise, the hybrid analogue–digital system takes advantage of the colocation of memory and processing units, greatly improving energy efficiency. By combining the MPCL with an in situ fine‐tuning method, high classification accuracies of 94.9% and 95.3% (software baseline 97.0% and 97.7%) on the 5‐split‐MNIST and 5‐split‐FashionMNIST are achieved, respectively. The proposed system reduces ≈200 times energy consumption of the multiply‐and‐accumulation operations during the inference phase compared to the conventional digital systems. This work paves the way for future autonomous systems at the edge.

Details

Title

Mixed‐Precision Continual Learning Based on Computational Resistance Random Access Memory

Author

Li, Yi¹; Zhang, Woyu¹; Xu, Xiaoxin²; He, Yifan³; Dong, Danian²; Jiang, Nanjia²; Wang, Fei¹; Guo, Zeyu¹; Wang, Shaocong⁴; Dou, Chunmeng²; Liu, Yongpan³; Wang, Zhongrui⁴; Shang, Dashan¹

¹ Key Laboratory of Microelectronics Devices and Integrated Technology, Institute of Microelectronics, Chinese Academy of Sciences, Beijing, China; University of Chinese Academy of Sciences, Beijing, China
² Key Laboratory of Microelectronics Devices and Integrated Technology, Institute of Microelectronics, Chinese Academy of Sciences, Beijing, China
³ Department of Electronics Engineering, Tsinghua University, Beijing, China
⁴ Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong, Hong Kong

Section

Research Articles

Publication year

2022

Publication date

Aug 2022

Publisher

John Wiley & Sons, Inc.

e-ISSN

26404567

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1002/aisy.202200026

ProQuest document ID

2703630127

Mixed‐Precision Continual Learning Based on Computational Resistance Random Access Memory

Jump to:

Full text

Abstract

Details

Suggested sources