Numerical‐discrete‐scheme‐incorporated recurrent

Full text

Turn on search term navigation

INTRODUCTION

Deep learning has made rapid progress and fulfilled great success in a wide spectrum of applications, such as natural language processing (NLP) [1, 2], speech recognition [3], and computer vision [4, 5]. Behind these accomplishments lies the powerful function approximation capability of deep neural networks. In NLP tasks, modelling sequential data is a challenging problem. Plenty of work has been presented to solve the problem of sequential data modelling. Among them, recurrent neural networks (RNNs) achieve satisfactory performance due to their recurrent mechanism [6, 7]. Despite the accomplishment obtained by the RNNs, the training of the RNN is rather difficult. Back-propagation through time (BPTT) is needed to train the model. When the input sequence is long, the multiplication term produced by the chain derivation mechanism is numerically unstable, which makes RNN suffer from the gradient explosion and gradient vanishing problem [8]. In Ref. [9], linear time-delayed connections were added to RNNs to alleviate the gradient vanishing problem during training. However, it is only examined for some small-scale tasks. The advent of the long short-term memory (LSTM) conspicuously enhanced the performance of the recurrent architecture and solved the gradient vanishing problem from the perspective of structural design [10]. LSTM designs a fairly elaborated structure to create a gating mechanism. This kind of gating mechanism uses learnable gates to implement a sophisticated feedback approach, which makes it easier for gradient information to flow back. Nevertheless, due to its structural complexity, training the LSTM is time-consuming and it is hard for LSTM to scale for large tasks. Gated feedback connections were used in Ref. [11] between layers of stacked RNNs to adaptively adjust the connection patterns between the adjoint layers. To capture long-term contextual semantic information in sequential data, more memory units were used in Ref. [12] to keep track of previous hidden states, where the weighted connections were directly linked to multiple preceding hidden states. With feedback paths provided by these connections, residual signals can propagate to the farther preceding hidden states to better model the long-term memory. In addition, these connections give more feedback paths to a model to smooth its update during training. In Ref. [13], the authors discussed the high-order connections in the Markov property framework. With the reduction in parameters by projecting hidden-state vectors to a low dimension and weighted connections on hidden states, the model presented achieves noticeable gains both in accuracy and efficiency when applied to acoustic modelling. However, its wiring pattern does not change in essence. It aggregates historical information by adding weighted connections. By combining high- and low-order LSTM with a pruning technique, Zhang et al. [14] introduced a recurrent model called MO-BILSTM, which achieved promising results evaluated in two named entity recognition datasets. A generalisation of LSTM called multiple-history LSTM was investigated in Ref. [15], where different LSTM units were connected with high-order feedback and maintained historical information at different time steps.

From the topology of RNNs, it allows connections among the preceding hidden units. Through these connections, RNNs maintain a special mechanism of recurrent feedback that summarises the past sequence of the inputs, enabling themselves to capture correlations between the temporally distant events in the data. Previous research has shown that the recurrent mechanism of the RNN model has the same form as the explicit forward Euler scheme. If unfold in time, simple RNN can be regarded as a kind of discrete forward Euler scheme from the perspective of numerical analysis. Inspired by the connection between RNNs and ordinary differential equations (ODEs), this study hammers at designing a Taylor-type recurrent neural network (T-RNN) guided by a Taylor-type discrete scheme. Therefore, considering the shortcomings of RNN, such as the difficulty of obtaining context information in a long distance, it can be explained by the fact that the forward Euler scheme only uses first-order derivative information for estimation, which causes a large truncation error. To improve the simple RNN from the perspective of numerical analysis, a Taylor-type discrete scheme with a tiny truncation error deduced from the Taylor expansion is presented in this paper. In addition, the Taylor-type discrete scheme is used as an orientation to construct the T-RNN to further explore the connection between neural networks and ODEs and to improve the performance of the RNNs. As shown in Figure 1, extensive experiments are conducted to evaluate the performance of T-RNN. Experimental results indicate that T-RNN has higher accuracy and can capture longer contextual information compared to the existing RNN model. The main contributions in this paper are summarised as follows:

Relevant analysis is given to show the connection between the numerical formulas and the neural network structure. Detailed derivation is given to introduce the Taylor-type numerical formulas, and on this base, the T-RNN is designed.
Sufficient experiments are conducted on benchmark datasets to evaluate the T-RNN. Experimental results demonstrate its superiority when compared to simple RNN.
The performance gains obtained by the T-RNN echoes the connection between the neural networks and the numerical discretisation schemes and indicates that it is promising to improve the neural networks from the perspective of numerical analysis.

[IMAGE OMITTED. SEE PDF]

RELATED WORK

The performance of the neural network especially relies on its topological structure, which is an important issue in deep learning. The past decade has witnessed the emergence of many neural network architectures. As for a convolution neural network (CNN), it is typically a concatenation of many non-linear layers and ends with a fully connected layer. An obvious experience of CNN architectures is that the number of layers is rising, ranging from the AlexNet [16] with five convolutional layers, the VGG [17] network with 19 layers to the GoogleNet [18] with 22 layers. Nevertheless, as the layer goes deep, it suffers from gradient explosion, gradient vanishing problems, and model degradation [8]. The advent of ResNet sheds light on training very deep neural networks by introducing the mechanism of identity mapping, which can maintain the stability and correlation of gradient during back-propagation [19]. From then on, the idea of introducing the skipping connection between layers springs up. In Ref. [20], the DenseNet was presented by connecting each layer to every other layer in a forward fashion. For each layer, its input is the output of all preceding layers. Similarly, the cliqueNet constructed the layers as a loop, where each layer was both the input and output of any other layer in a block [21]. These kinds of elaborate wiring patterns make full use of the feature information between layers. Analogously, authors in [12] tried to connect preceding hidden states with weighted links to gather more historical information. In Ref. [13], a neural network was presented under the Markov theory framework, which improved the neural network by introducing the skip connection in essence. In addition to adding skipping connections, Zagoruyko and Komodakis [22] increased the number of channels to explore what the width of a network influences. In addition, authors in [23] used the LSTM to construct a controller and introduced a neural network by resorting to reinforcement learning. In Ref. [24], the random graph was used to design a randomly wired neural network. Despite the abundance of neural architecture, the design of neural networks still lacks guidance. It is time-consuming and artificially involved to verify the performance of a neural network model. In Refs. [25, 26], the connection between neural networks and dynamic systems was revealed and authors indicated that deep neural networks could be deemed as different discretizations of ODEs. Chen et al. [27] provided an original perspective of neural network and brought the neural ODEs to view. In Ref. [28], authors indicated that several existing networks, such as ResNet [19] and RevNet [29], were related to correspondingly numerical discrete schemes and put forward a neural network architecture oriented by the linear multistep method solving ODEs, which further verified the relationship between numerical discrete formulations and multi-layer neural networks. Luo et al. [30] used the Runge-Kutta discrete schemes as a principle to guide the stack of layers to design a neural network and brought remarkable performance improvement. In Ref. [31], partial differential equations were used to design the CNNs and related numerical techniques were also used to solve and optimise the neural network. As for the T-RNN, by reconsidering the RNN from the perspective of numerical analysis and deeming it as a discrete form of the forward Euler scheme, the T-RNN is proposed based on a higher-order Taylor-type discrete scheme deduced from the Taylor expansion.

At present, with its unique position encoding to capture long-range word-order information, transformer has achieved the state-of-art records in many fields, such as language model tasks [32] and computer vision [33]. Despite its success in many tasks, for data with strict timing information requirements, there are still some disadvantages to using transform for time series data [34]. However, RNNs have innate advantages for time series data processing with its continuous-time hidden-state mechanisms [35]. In addition, the structure of transform contains six layers of encoder and six layers of decoder [36]. The complicated structure makes it require a large amount of data for training. When input sequence length increases, the consumption of memory and computation is massive if its parallel computation is not supported. Therefore, in some application scenarios with limited computation or few data samples, it is still necessary and appliable to resort to RNNs.

PRELIMINARIES

In this section, the deduction of the Taylor-type discrete scheme is demonstrated in detail. With the connections between RNN and the discrete forward Euler scheme explained, the T-RNN is constructed guided by the Taylor-type discrete scheme.

Deduction

Due to its recurrent structure allowing connections among hidden units relevant to a time delay, the RNNs have achieved the prominent performance on a large number of tasks [37]. For a typical simple RNN, at time step t, given an input x _t and a hidden state h _t−1 generated in the previous time step, the update process of RNN is demonstrated as 1 ${h}_{t}={\Phi }\left({W}_{xh}{x}_{t}+{W}_{hh}{h}_{t-1}+{b}_{xh}+{b}_{hh}\right),$

The symbol Φ(⋅) represents the non-linear activation function usually being rectified linear (ReLU) or hyperbolic tangent function (Tanh), which adds non-linear factors and improves the expressiveness of the neural network. W _xh is the weighted matrix in the input connection, and W _hh is the weighted matrix of the hidden state. Typically, the RNN accepts the input and the hidden state of the previous time step as the input of the activation function, and the output of the activation function becomes the hidden state at the current time. The corresponding interpretation is demonstrated in Figure 2.

[IMAGE OMITTED. SEE PDF]

With the Taylor expansion, the following rules can be obtained: 2 $\begin{align*}\hfill {\Phi }\left({t}_{3}\right)& ={\Phi }\left({t}_{2}+\tau \right)\hfill \\ \hfill & ={\Phi }\left({t}_{2}\right)+\tau \dot{{\Phi }}\left({t}_{2}\right)+\frac{{\tau }^{2}}{2!}{{\Phi }}^{(2)}\left({t}_{2}\right)+\frac{{\tau }^{3}}{3!}{{\Phi }}^{(3)}\left({h}_{1}\right),\hfill \end{align*}$ where h ₁ ∈ (t ₂, t ₃). Subsequently, we obtain the term 3 $\dot{{\Phi }}\left({t}_{2}\right)=\frac{{\Phi }\left({t}_{3}\right)-{\Phi }\left({t}_{2}\right)}{\tau }-\frac{\tau }{2!}{{\Phi }}^{(2)}\left({t}_{2}\right)-\frac{{\tau }^{2}}{3!}{{\Phi }}^{(3)}\left({h}_{1}\right).$

Likewise, by applying one- and two-step backward expansion, we have 4 $\begin{align*}\hfill {\Phi }\left({t}_{1}\right)& ={\Phi }\left({t}_{2}-\tau \right)\hfill \\ \hfill & ={\Phi }\left({t}_{2}\right)-\tau \dot{{\Phi }}\left({t}_{2}\right)+\frac{{\tau }^{2}}{2!}{{\Phi }}^{(2)}\left({t}_{2}\right)-\frac{{\tau }^{3}}{3!}{{\Phi }}^{(3)}\left({h}_{2}\right),\hfill \end{align*}$ and 5 $\begin{align*}\hfill {\Phi }\left({t}_{0}\right)& ={\Phi }\left({t}_{2}-2\tau \right)\hfill \\ \hfill & ={\Phi }\left({t}_{2}\right)-2\tau \dot{{\Phi }}\left({t}_{2}\right)+\frac{4{\tau }^{2}}{2!}{{\Phi }}^{(2)}\left({t}_{2}\right)-\frac{8{\tau }^{3}}{3!}{{\Phi }}^{(3)}\left({h}_{3}\right),\hfill \end{align*}$ where h ₂ and h ₃ lie between (t ₁, t ₂) and (t ₀, t ₂), respectively. The above equations can be paraphrased as follows: 6 $\dot{{\Phi }}\left({t}_{2}\right)=\frac{{\Phi }\left({t}_{2}\right)-{\Phi }\left({t}_{1}\right)}{\tau }+\frac{\tau }{2!}{{\Phi }}^{(2)}\left({t}_{2}\right)-\frac{{\tau }^{2}}{3!}{{\Phi }}^{(3)}\left({h}_{2}\right),$ and 7 $\dot{{\Phi }}\left({t}_{2}\right)=\frac{{\Phi }\left({t}_{2}\right)-{\Phi }\left({t}_{0}\right)}{2\tau }+\tau {{\Phi }}^{(2)}\left({t}_{2}\right)-\frac{2{\tau }^{2}}{3}{{\Phi }}^{(3)}\left({h}_{3}\right).$

Then, let Equation (5) add Equation (7), then minus Equation (6), and we have 8 $\begin{align*}\hfill \dot{{\Phi }}\left({t}_{2}\right)=\frac{2{\Phi }\left({t}_{3}\right)-3{\Phi }\left({t}_{2}\right)+2{\Phi }\left({t}_{1}\right)-{\Phi }\left({t}_{0}\right)}{2\tau }\\ \hfill +{\tau }^{2}\left(-\frac{1}{3!}{{\Phi }}^{(3)}\left({c}_{1}\right)-\frac{2}{3}{{\Phi }}^{(3)}\left({c}_{3}\right)-\frac{1}{3!}{{\Phi }}^{(3)}\left({c}_{2}\right)\right).\end{align*}$

As the term $\left(-\frac{1}{3!}{{\Phi }}^{(3)}\left({c}_{1}\right)-\frac{2}{3}{{\Phi }}^{(3)}\left({c}_{3}\right)-\frac{1}{3!}{{\Phi }}^{(3)}\left({c}_{2}\right)\right)$ is independent to t, we can paraphrase Equation (8) as 9 $\dot{{\Phi }}\left({t}_{2}\right)=\frac{2{\Phi }\left({t}_{3}\right)-3{\Phi }\left({t}_{2}\right)+2{\Phi }\left({t}_{1}\right)-{\Phi }\left({t}_{0}\right)}{2\tau }+O\left({\tau }^{2}\right).$

With the item O(τ ²) discarded, we have 10 $\dot{{\Phi }}\left({t}_{2}\right)\approx \frac{2{\Phi }\left({t}_{3}\right)-3{\Phi }\left({t}_{2}\right)+2{\Phi }\left({t}_{1}\right)-{\Phi }\left({t}_{0}\right)}{2\tau }.$

Note that Equation (10) is the Taylor-type 1-step-ahead numerical differential scheme because it has the term Φ(t ₃), which is one step ahead of the Φ(t ₂). By setting the interval τ to be 1 and moving item Φ(t ₃) to the left, we obtain 11 ${\Phi }\left({t}_{3}\right)\doteq \frac{3}{2}{\Phi }\left({t}_{2}\right)-{\Phi }\left({t}_{1}\right)+\frac{1}{2}{\Phi }\left({t}_{0}\right)+\dot{{\Phi }}\left({t}_{2}\right),$ where $\dot{=}$ denotes the computational assignment operation. The different time points, t ₁, t ₂, and t ₃, can be discretized into different and adjacent layers in neural networks, and Equation (11) can be rewritten as 12 ${\Phi }\left({t}_{k+2}\right)\doteq \frac{3}{2}{\Phi }\left({t}_{k+1}\right)-{\Phi }\left({t}_{k}\right)+\frac{1}{2}{\Phi }\left({t}_{k-1}\right)+\dot{{\Phi }}\left({t}_{k+1}\right).$

Now, Equation (12) is the Taylor-type discrete scheme, which uses the information of the three historical time steps and the gradient information of the previous moment to estimate the value of the current step. It is worth pointing out that the Taylor-type discrete scheme has longer-term dependencies upon historical data compared to the discretisation of the forward Euler method. Mathematically, the truncation error of the Taylor-type discrete scheme is O(τ ³), which has higher precision.

T-RNN with a Taylor-type discrete scheme

In fact, we paraphrase Equation (1) in a general form, and Equation (13) is obtained, 13 ${y}_{t}=f\left({y}_{t-1},{x}_{t},t\right).$

Traditionally, for an ODE $\dot{x}(t)=f(x(t),t)$ , the forward Euler scheme is a commonly used numerical solution whose general form is presented as 14 ${x}_{t+h}={x}_{t}+hf\left({x}_{t},t\right),$ where h denotes the step size. It can be found that Equation (14) is in compliance with Equation (13), which indicates the connections between the RNN and the forward Euler scheme. The same observations are also found in the field of CNN, where ResNet [19], RevNet [29], and LM-ResNet [28] all can be connected to different numerically discrete schemes. Inspired by this connection, this paper hammers at constructing a recurrent network structure based on the presented Taylor-type discrete scheme. This part begins with demonstrating the meaning of Equation (12) and explaining the connections between T-RNN and it. Actually, Equation (12) predicts the function value of the current data point based on the linear combination of three historical data points and the derivative of the previous step. It has long-term dependencies upon historical data because it needs three pieces of information of the past. When modelling the sequential data, it is crucial to capture the contextual semantic information. Inspired by this idea, the T-RNN is constructed guided by the Taylor-type discrete scheme.

Specifically, three neural units are used to represent the three historical terms of the Taylor-type scheme. At each moment, three neural units calculate the relationship between the current input and three hidden states of history. We use the activation value of the Tanh function at the previous moment to replace the derivative term in the Taylor discrete scheme. The output result is calculated in the form of the Taylor discrete scheme and is the hidden state at the current moment. More specifically, at time t + 1, if the bias term is omitted, the update process of T-RNN can be presented from Equations (15) to (18). 15 ${Q}_{1}={\Phi }\left({W}_{xh}^{t-2}{x}_{t}+{W}_{hh}^{t-2}{h}_{t-2}\right),$ 16 ${Q}_{2}={\Phi }\left({W}_{xh}^{t-1}{x}_{t}+{W}_{hh}^{t-1}{h}_{t-1}\right),$ 17 ${Q}_{3}={\Phi }\left({W}_{xh}^{t}{x}_{t}+{W}_{hh}^{t}{h}_{t}\right),$ 18 ${h}_{t+1}=\frac{3}{2}{Q}_{3}-{Q}_{2}+\frac{1}{2}{Q}_{1}+\mathrm{tanh}\left({Q}_{3}\right).$

From Equations (15) to (18), the three terms Q ₁, Q ₂, and Q ₃ represent three correspondingly historical data in the formula and demonstrate the longer-term context in the neural network model. The terms ${W}_{xh}^{t-2},{W}_{xh}^{t-1},{W}_{xh}^{t}$ , ${W}_{hh}^{t-2},{W}_{hh}^{t-1},{W}_{hh}^{t2}$ are correspondingly weighted matrix and the new hidden state at t + 1 is denoted as h _t+1. The corresponding interpretation is demonstrated in Figure 3.

[IMAGE OMITTED. SEE PDF]

DATASETS AND EXPERIMENTS

In this section, different experiments are conducted to evaluate the performance of T-RNN. The corresponding results are also manifested.

Sentiment analysis

The target of sentiment analysis is to analyse subjective texts with emotional colours to determine the views, preferences, and emotional tendencies of the text. Our model is evaluated on sentiment analysis tasks based on Internet Movie Database (IMDB). The IMDB dataset contains 50,000 reviews, which are divided into two equal parts being a training set and a test set. For each part, positive and negative reviews each account for half. In our experiments, we set the learning rate to 0.1 and use the mini-batch stochastic gradient descent (SGD) algorithm to train the models. In addition, the batch size is fixed to 64 and the training epoch is set to 550. A hard clipping is set to 1.0 to avoid gradient explosion during training. The vocabulary is composed of the top 25,000 words with the highest frequency. It is worth mentioning that the text sequence length of the IMDB dataset varies a lot, we use the zero padding technique to fill the input sequence to a fixed length. To reduce the impact of the zero padding, we set the sequence length to 16. Moreover, the input size is set to 100 and the hidden size is set to 256 in each hidden layer. We use the dropout regularisation [38] in all experiments with the dropout coefficient being 0.5. In addition, we use the Glove representation technique to initiate the input vector to improve the training efficiency [39].

The corresponding results are demonstrated in Figure 4, where we can find that T-RNN outperforms simple RNN on test accuracy. As shown in Figures 5 and 6, we can find that T-RNN has a slow rate in the initial period and suffers from oscillation in the early training process. However, the T-RNN reaches a higher accuracy eventually. We discuss the reasons for this phenomenon in the discussion section. In addition, after 100 epochs, the test loss of RNN rises, which means that the RNN occurs overfitting, but T-RNN avoids this situation.

[IMAGE OMITTED. SEE PDF]

Statistic language model

Traditionally, a statistical language model is used to describe the probability distribution over different grammatical units of words, sentences, and even the entire document in natural languages. In language modelling tasks, it is quite important to take advantage of the long-term dependency of natural language. We evaluate the proposed architecture on the Penn Treebank (PTB) and Text8. It is worth mentioning that considering the limitation of computing resources, we have tailored the Text8 corpus and the corresponding details are shown in Table 1. In our experiments, all neural networks are trained by the SGD algorithm. In order to circumvent the gradient explosion during the BPTT, a simple strategy that sets a hard-clipping to 1.0 is adopted during training. When training the model, we set the initial learning rate to 0.5 and halve the learning rate every 100 epochs. The vocabulary size of PTB is limited to 10k and that of Text8 is set to 50k. To improve the generalisation of the model, the dropout regularisation is adopted and the dropout coefficient is set to 0.5 [38]. We use 500 nodes in each hidden layer and the input size to 300 for the PTB and Text8 data set. To improve the training efficiency, we use Glove representation technique [39]. In our experiment, we set different input sequence lengths ranging from 64, 128 to 256 on PTB. Similarly, we set sequence lengths ranging from 128, 200 to 256 on Text8.

TABLE 1 Details about PTB and Text8 datsets

Corpus	Train size	Test size
PTB	4.9M	0.44M
Text8	20M	1.2M

The corresponding experimental results on different datasets are demonstrated in Table 2, where we can find that the T-RNN can achieve lower test perplexity compared to the RNN on two datasets, which means the superiority of T-RNN compared to RNN.

TABLE 2 Results about different RNNs models on PTB and Text8 datasets

Models	PTB	Text8	Criteria
64	128	256	64	200	256
LSTM	94.9	101.5	115.9	195.9	215.7	226.1	Test PPL
GRU	96.4	96.4	107.5	195.9	204.0	211.5	Test PPL
RNN	110.0	110.1	117.4	256.3	251.8	255.6	Test PPL
T-RNN	108.4	106.2	110.5	238.3	230.3	232.3	Test PPL

Text classification

Text classification is an important part in text process, which has abundant applications, such as Garbage Filtering, News classification, and Part-Of-Speech Tagging. In this part, we evaluate the proposed T-RNN and RNNs models on text classification tasks. R8 and R52 datasets are chosen to train the models, which are all the subset of Reuters-21578 datasets. Their details are shown in Table 3. As for the experimental setting, we set the hard-clipping to 1.0, the learning rate to 0.1, and the back-propagation step to 64 for two datasets. When training on the R8 dataset, the input size is 100, the hidden size is fixed to 256 and the batch size is 64, and the training epoch is 800. As for the R52 dataset, the input size is 300, the hidden size is fixed to 500, and the batch size is 32, and the training epoch is 1000. In addition, we use the hidden state of the last step as the result of model to predict the final result. We choose the highest test accuracy as the final result and the complete results are presented in Table 4. We choose accuracy, loss value, and F1 score as the criterion to demonstrate the degree of performance gains, Different RNN models, such as bi-directional recurrent neural network (BIRNN) and bi-directional gated recurrent unit (BIGRU), are used as comparisons to evaluate the performance of T-RNN.

TABLE 3 Details about R8 and R52 datasets

Corpus	Total size	Train item	Test item	Class
R8	4.5M	5485	2189	8
R52	5.5M	6097	3003	52

TABLE 4 Comparison of different RNNs models for three evaluation metrics under two datasets (Acc: accuracy; loss: test loss; F1: F1 score)

Models	R8 dataset	R52 dataset
Acc	Loss	F1	Acc	Loss	F1
BIRNN	0.875 ± 0.002	0.016 ± 0.001	0.870 ± 0.002	0.862 ± 0.003	0.034 ± 0.002	0.852 ± 0.003
BILSTM	0.920 ± 0.006	0.006 ± 0.001	0.931 ± 0.006	0.899 ± 0.005	0.020 ± 0.001	0.899 ± 0.005
BIGRU	0.927 ± 0.006	0.006 ± 0.001	0.927 ± 0.006	0.891 ± 0.004	0.025 ± 0.003	0.886 ± 0.004
LSTM	0.913 ± 0.007	0.011 ± 0.003	0.915 ± 0.007	0.859 ± 0.007	0.036 ± 0.003	0.857 ± 0.008
GRU	0.911 ± 0.009	0.012 ± 0.000	0.912 ± 0.010	0.842 ± 0.004	0.044 ± 0.002	0.836 ± 0.005
RNN	0.765 ± 0.005	0.019 ± 0.001	0.750 ± 0.004	0.664 ± 0.004	0.059 ± 0.010	0.633 ± 0.010
T-RNN	0.816 ± 0.013	0.015 ± 0.003	0.809 ± 0.012	0.682 ± 0.009	0.049 ± 0.004	0.660 ± 0.007

From Table 4, we find that the accuracy of T-RNN on the R8 dataset is 81.6% and 5.1% higher than that of RNN, and the accuracy on the R52 dataset is 1.8% higher. There are also corresponding improvements in test loss and F1 score. Although the performance of T-RNN is lower than that of LSTM or GRU, it is worth pointing out that the performance improvement of T-RNN compared to RNN demonstrates the promising feasibility to improve the simple RNN model from a numerical method perspective, which provides a reference for improving the performance of neural network models.

DISCUSSION

When conducting experiments on deep learning tasks, we observe some consistent experimental phenomena on three experimental tasks, which are shown in Figures 5–10. In the experiment of sentiment analysis, we conclude from Figures 5 and 6 that, at the beginning of the training, the test accuracy of T-RNN fluctuates and its convergence rate is slow, but eventually T-RNN reaches a higher accuracy in the end. Similarly, when evaluated on statistical language model tasks, as shown in Figures 9 and 10, for the same input sequence length, T-RNN reaches the test perplexity that lags behind RNN in the early stage of training. However, after a certain epoch, the performance of T-RNN exceeds RNN. For the fixed epoch, as the input sequence increases, the performance gap between models becomes larger, which is demonstrated by the increasing margin between two curves in Figure 7. The results on Text8 also have the same tendency and the corresponding difference values are 18, 21, 23 ranging from (a), (b) to (c) in Figure 8. We explain this phenomenon by resorting to numerical experiments.

[IMAGE OMITTED. SEE PDF]

Numerical experiment

Nonlinear function optimisation problem is fairly common in many scientific problems. Numerical methods are widely used to solve them. For example, a non-linear function optimisation problem is presented as 19 $\begin{align*}\hfill \underset{f\left({\zeta }_{i}\right)\in {\mathbb{R}}^{4}}{\min \,}& {\digamma}\left(\,f\left({\zeta }_{i}\right),\left.{\zeta }_{i}\right)\right)\hfill \\ \hfill & =\left(\,{f}_{1}\left({\zeta }_{i}\right)+\sin \left({\zeta }_{i}\right)\right){f}_{3}\left({\zeta }_{i}\right)\hfill \\ \hfill & \quad +0.1\left({\zeta }_{i}-1\right){f}_{3}\left({\zeta }_{i}\right){f}_{4}\left({\zeta }_{i}\right)\hfill \\ \hfill & \quad -\left(\,{f}_{1}\left({\zeta }_{i}\right)+\log \left(0.1{\zeta }_{i}+1\right)\right)\left(\,{f}_{2}\left({\zeta }_{i}\right)+\sin \left({\zeta }_{i}\right)\right)\hfill \\ \hfill & \quad +{\left(\,{f}_{3}\left({\zeta }_{i}\right)-\mathrm{exp}\left(-{\zeta }_{i}\right)\right)}^{2}+{\left(\,{f}_{4}\left({\zeta }_{i}\right)+\mathrm{exp}\left(-{\zeta }_{i}\right)\right)}^{2}\hfill \\ \hfill & \quad +{\left(\,{f}_{1}\left({\zeta }_{i}\right)+{\zeta }_{i}\right)}^{2}+{\left(\,{f}_{2}\left({\zeta }_{i}\right)+{\zeta }_{i}\right)}^{2}.\hfill \end{align*}$

We use the forward Euler scheme and the Taylor-type discrete scheme to solve this non-linear optimisation problem. In our experiments, f(ζ _i) = [f ₁(ζ _i), f ₂(ζ _i), f ₃(ζ _i), f ₄(ζ _i)]. We use the L2-norm of δ(ζ _i) = ∂ϝ(f(ζ _i), ζ _i)/∂f(ζ _t) as the residual error and fix calculation interval to [0, 10]. Different sampling intervals h are set to analyse the convergence performance of the two discrete schemes. The corresponding results are demonstrated in Figure 11, we conclude that compared with the forward Euler scheme, the Taylor-type discrete scheme achieves higher accuracy in the end, which shows its superiority. In addition, at the initial stage of the iteration, the Taylor-type discrete scheme suffers from fluctuations. The historical term can be understood as the time delay that affects the stability of the dynamic system. The Taylor-type discrete scheme needs three previous pieces of information to estimate the value of the next step, so it is more prone to instability at the beginning of the iteration. Therefore, the Taylor-type scheme requires more iterations to be stable and achieve higher accuracy. This explains the experimental phenomenon in Figures 5 and 6. Similarly, in the statistical language model tasks, the perplexity of T-RNN lags behind the RNN, but it eventually performs better. When the number of the epoch is fixed, the long input sequence means a large number of iterations, which makes the performance gap between T-RNN and RNN wider. Our results in the numerical experiment are consistent with the results of T-RNN in different tasks. The Taylor-type scheme suffers from vibration in the beginning but reaches stable and results in less error at the end of iteration.

[IMAGE OMITTED. SEE PDF]

Taking the time cost and accuracy into consideration, we can conclude from Figure 12 that training of T-RNN is more time-consuming than RNN, which attributes to the three history items of T-RNN. The three items need more time to learn the long-distance contextual information, which can be used to memorise longer contextual information. It is worth pointing out that T-RNN is suitable for processing sequence datasets with small quantity such as text or voice data as RNN. Compared with RNN, T-RNN can utilise longer-distance context information, which results in its superiority.

[IMAGE OMITTED. SEE PDF]

CONCLUSION

In this paper, spurred by the connection between neural networks and discretizations of ODEs, we have proposed the T-RNN guided by a Taylor-type discrete scheme deduced from the Taylor expansion. Systematic experiments have been conducted to testify the performance of the proposed model. The noticeable performance gains upon many tasks have indicated that it is feasible to design effective and powerful neural networks by following certain discrete schemes. The relations between neural networks and the discrete numerical scheme also manifest that plenty of mathematical tools from optimal control and dynamic systems can be used to design optimisation algorithms to train the neural network. In addition, the research about robustness and generalisation of the corresponding neural networks is also worthy of diving into from the perspective of the numerical analysis in the future.

ACKNOWLEDGEMENTS

This work was supported in part by the National Natural Science Foundation of China under Grant 62176109, in part by the Tibetan Information Processing and Machine Translation Key Laboratory of Qinghai Province under Grant 2021-Z-003, in part by the Natural Science Foundation of Gansu Province under Grant 21JR7RA531 and Grant 22JR5RA487, in part by the Fundamental Research Funds for the Central Universities under Grant lzujbky-2022-23, in part by the CAAI-Huawei MindSpore Open Fund under Grant CAAIXSJLJJ-2022-020A, and in part by the Supercomputing Center of Lanzhou University, in part by Sichuan Science and Technology Program No. 2022nsfsc0916.

CONFLICT OF INTEREST

The author declares that there is no conflict of interest that could be perceived as prejudicing the impartiality of the research reported.

DATA AVAILABILITY STATEMENT

The data that support the findings of this study are available from the corresponding author upon reasonable request.

References

Zhou, B. , et al.: Cross‐domain sequence labelling using language modelling and parameter generating. CAAI Trans. Intell. Technol. 7(4), 710–720 (2022). [DOI: https://dx.doi.org/10.1049/cit2.12107]

Pachitariu, M. , Sahani, M. : Regularization and nonlinearities for neural language models: when are they needed? arXiv preprint arXiv:1301.5650 (2013)

Pennington, J. , Socher, R. , Manning, C.D. : Glove: global vectors for word representation. In: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing, pp. 1532–1543 (2014)

Word count: 5542

Show less

© 2023. This work is published under http://creativecommons.org/licenses/by/4.0/ (the "License"). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

A variety of neural networks have been presented to deal with issues in deep learning in the last decades. Despite the prominent success achieved by the neural network, it still lacks theoretical guidance to design an efficient neural network model, and verifying the performance of a model needs excessive resources. Previous research studies have demonstrated that many existing models can be regarded as different numerical discretizations of differential equations. This connection sheds light on designing an effective recurrent neural network (RNN) by resorting to numerical analysis. Simple RNN is regarded as a discretisation of the forward Euler scheme. Considering the limited solution accuracy of the forward Euler methods, a Taylor‐type discrete scheme is presented with lower truncation error and a Taylor‐type RNN (T‐RNN) is designed with its guidance. Extensive experiments are conducted to evaluate its performance on statistical language models and emotion analysis tasks. The noticeable gains obtained by T‐RNN present its superiority and the feasibility of designing the neural network model using numerical methods.

Details

Title

Numerical‐discrete‐scheme‐incorporated recurrent neural network for tasks in natural language processing

Author

Liu, Mei¹; Luo, Wendi¹; Cai, Zangtai¹; Du, Xiujuan¹; Zhang, Jiliang²; Li, Shuai¹

¹ The State Key Laboratory of Tibetan Intelligent Information Processing and Application, Qinghai Normal University, Xining, China
² Department of Electronic and Electrical Engineering, The University of Sheffield, Sheffield, UK

Pages

1415-1424

Section

REGULAR ARTICLES

Publication year

2023

Publication date

Dec 1, 2023

Publisher

John Wiley & Sons, Inc.

e-ISSN

24682322

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1049/cit2.12172

ProQuest document ID

3091950219

Numerical‐discrete‐scheme‐incorporated recurrent neural network for tasks in natural language processing

Jump to:

Full text

Abstract

Details

Suggested sources