1. Introduction
Several machine learning techniques are posed as optimization problems and owe their success to gradient-based methods. The success is twofold: first, the optimization task itself, and second, the widespread adoption by the AI community. For example, in machine learning and specifically in neural networks, the multilayer perceptron learning technique defines a training error surface that depends on synaptic weights as free parameters that can be optimized with the backpropagation algorithm, which is a gradient-descent-based algorithm. It finds the optimal parameters to minimize the training error. The success lies not only in solving the optimization problem by minimizing the training error but also in maximizing the generalization capacity on a test dataset, avoiding overfitting, which has led to wide use of multilayer perceptrons in various applications by the artificial intelligence community [1].
Recently, with the boom in research and technological development of machine learning, the need to propose improved optimizers has become more acute, leading to the search for new gradient-based optimizers. In this respect, the authors of [2] performed a comparison of 15 optimizers chosen from a large list of 144 optimizers and schedulers, showing the variability of techniques that continues to evolve and grow. Of course, they include the fundamental SGD and Adam optimizers. The former because it is the cornerstone of all gradient-based techniques [3], whereas the latter calculates adaptive learning rates and is considered state of the art in deep learning [4].
Additionally, a survey of optimization algorithms for neural networks is presented in [5], where modifications to basic optimization algorithms are studied. The paper [6] presents the latest contributions for deep learning based on stochastic gradient-descent methods and a summary of applications of network architectures together with the methods used for specific purposes.
A brief summary of relevant variants of SGD, Adam, Adagrad, and Adadelta is also presented in [7], along with a review of their parameter-update formulas that reveals the combination of concepts, including momentum, velocity, learning rate adaptation, parameter normalization, and gradient-memory.
Although the majority of optimizers consider these concepts to enhance the training and the capacity of generalization, in this work, the ancient but powerful concept of fractional derivatives is applied to several gradient-based optimizers available in PyTorch [8,9]. In this way, several fractional versions of optimizers have been implemented for PyTorch, and they are presented as generalizations to the first-order derivative. In other words, the fractional derivative of -order ( ) includes the classical first-order gradient when . From this point of view, it provides an additional freedom degree to the hyperparameters that allows us to exploit the properties and advantages of fractional derivatives, including the effect of non-locality which obtains information from the neighborhood of the derivation-point by applying integro-differential operators [10,11].
Certainly, the application of fractional derivatives is not recent, and it can be verified in previous works on different areas, including linear viscoelastivity [12], partial differential equations [13], signal processing [14], and image processing [15], among others.
With respect to neural networks, there is also evidence of applications of fractional derivatives. For example, in [16], Fractional Physics-Informed Neural Networks are developed employing partial differential equations embedded in architectures of feedforward neural networks with automatic differentiation to optimize the network parameters. Another work is [17] on the study of two-layer neural networks trained with backpropagation and fractional derivatives. The authors of [18] studied a fractional deep backpropagation algorithm for neural networks with -regularization, and in [19] the stability for Hopfield neural networks of fractional order is investigated, just to mention a few works but the list of fractional-gradient applications seems to grow promisingly.
Although there are many works that use fractional-order derivatives in neural networks, it is notorious that they focus on ad hoc solutions and do not offer easy adaptation or reusability to other applications. Thereupon, it has been identified the need and importance of implementing fractional optimizers in frameworks such as PyTorch [9] that offer versatility and flexibility to apply gradient-based optimizers to different areas of great interest to the machine learning community.
In this regard, about frameworks for machine learning, a related work is [7] that presents a Keras–Tensorflow [20,21] implementation of several fractional optimizers successfully applied to human activity recognition.
Since these frameworks have become a popular and powerful tool that takes advantage of high-performance computing using GPUs and cloud platforms, this article aims to contribute to the implementation of fractional optimizers by extending current versions of integer-order gradient algorithms available in PyTorch. Once described how the implementation of fractional optimizers is done, two case studies are presented, firstly on generative adversarial networks (GAN) [22] and secondly on natural language processing (NLP) with Bidirectional Encoder Representations from Transformers (BERT) [23]. Many other applications may be possible, but for now, only these are shown. The results are encouraging and are expected to provide enough motivation and justification for the success of applying fractional calculus concepts in machine learning.
The remainder of the paper has the following structure. In Section 2, fundamental concepts are revised of fractional derivatives to propose a gradient-update formula based on the Caputo definition. In Section 3, fractional implementations for PyTorch are presented with some comparative experiments that aim to show how the fractional versions of gradient-based optimizers could improve their performance on GAN and NLP applications with BERT. Finally, Section 4 presents some discussions following the experiments and comments on some directions for future work.
2. Materials
The following topics are covered in this section: the Caputo fractional derivative, the backpropagation update formula for multilayer perceptrons (MLP) and the implementations of fractional gradient optimizers for PyTorch. It provides the necessary materials to develop the experiments that support the conclusions.
2.1. Caputo Fractional Derivative
Let , and . The Caputo fractional derivative of order ν for is [17]:
(1)
It is one of the most preferred definitions of fractional derivative, since if , , then [18]. In particular, for it corresponds to the classical differential calculus, that means that the derivative of a constant is zero. In general, it is not the same for other definitions such as the Riemman–Liouville or Grünwald–Letnikov [17].
In Equation (1), a convolutional kernel is used and for it yields to [24]:
(2)
Equation (2) represents a relevant property since it allows to extend the integer-order gradient optimizers to their fractional versions, as described in Section 2.2.
2.2. Backpropagation Update Formula for MLP
Given the original backpropagation formula to update the parameters of MLP, the corresponding fractional versions will be obtained.
Let a training set with N samples, and a neural network architecture described as follows:
X is the input layer (input data),
H hidden layers,
O is the output layer,
L layers, because of the hidden layers and the output layer,
is a matrix of synaptic weights, , that connects neuron k of layer with neuron j of layer l,
are synaptic weights () that connect the first hidden layer with X,
is the desired output of neuron k at output layer when the i-th input data is presented,
is the activation function in the L layers,
is the output of neuron k at output layer O, when the i-th input data is presented and at layer O,
is the potential activation of neuron k at layer l, , with inputs . For , considering the j-th component of X,
is the output of neuron k at a hidden layer l, .
Note that, at the output layer, the error of neuron k is . Subindex i means that the i-th input pattern is presented to the neural network. For all the neurons, the error at the output layer is:
(3)
and the cumulative error of the N training samples is E:(4)
The main goal of the backpropagation algorithm is to find optimal values of the free parameters of the weight matrix that minimize E.
In the backpropagation algorithm, the error of the output layer O is propagated to the hidden layers in reverse order until it reaches the input layer, and the gradient-descent updates are applied to each layer.
The optimization with the gradient-descent method applied to the weight updates is:
(5)
that points to the direction where decays. Here, is the learning rate.It should be clarified that Equation (5) uses the nomenclature to match with the Caputo fractional derivative definition of Section 2.1.
At this point, the local gradient is defined as:
(6)
and since(7)
then, can be expressed as:(8)
For , Equation (6) becomes and then, at the output layer O:
(9)
For , the local gradient for hidden layers is:
(10)
and consequently, the weight updates are:(11)
The Formulas (3) to (11) are well known by the neural network community. However, now, to make way for the fractional optimizers, the same approach for the first-order derivative can be used with the fractional gradient . In such case, the chain rule yields to [18]:
(12)
Equation (12) seems identical to Equation (7) except by that is obtained when Equation (2) is applied to . Note that if , Equation (12) becomes the classical integer case. So, Equation (12) represents a gradient-descent generalization, for .
In practice, it is necessary to avoid two conditions:
When synaptic weights take zero values that yields to the indetermination of for ,
When is rational, let and s is even (for example and ) hence if , then complex values will be generated.
These situations have been explored previously in [7] and a solution consists of replacing by , for . In this way, the fractional gradient factor is defined as:
(13)
and the limit exists, and is equal to 1, for as .Hence, Equation (12) becomes:
(14)
that generalizes the known gradient-descent update rule.It is worth noting that is not negative for . Therefore, the fractional gradient of Equation (14) modifies the magnitude of the classical gradient , but preserves the negative sign of the gradient-descent of Equation (5). Hence, the fractional gradient also points to the same direction of the gradient-descent on the error surface given by of Equation (3), and thus to the direction in which a loss function for the neural network will decay.
2.3. Fractional Gradient Optimizers for PyTorch
PyTorch is a Python-based scientific computing package for machine learning. As framework, PyTorch follows two purposes: (i) To use GPUs (ii) To provide automatic differentiation for neural networks [9].
The package torch.optim [8] implements various optimization algorithms such as SGD [3], Adam [4], Adadelta [25], Adagrad [26], AdamW [27] and RMSProp [28] among others.
To apply an optimizer in PyTorch is enough to use a line of code like the following for the SGD optimizer:
opt=optim .SGD(model . parameters () , learning_rate=0.001 ,momentum=0.9)
whereas the Adam optimizer can be used as follows:
opt=optim .Adam(model . parameters () , learning_rate =0.001).
Now, since the main idea is to apply Equation (14) to obtain fractional gradient optimizers in PyTorch, simply multiply by the integer-gradient. For this purpose, a new class is defined in PyTorch with the prefix “F” for each existing optimizer. In the case of SGD, the new class is FGSD and the line
-
__all__ = [ ’SGD’ , ’sgd’ ]
is replaced by
-
__all__ = [ ’FSGD’ , ’fsgd’ ].
Moreover, the source code of the update method _single_tensor_sgd is modified as follows, in the Listing 1:
Listing 1FSGD class definition and single_tensor_sgd method modification.
# Parameters: Set v= Cnnu.nnu, 0 < v < 2.0 |
Cnnu.nnu = 1.75 |
eps = 0.000001 |
class FSGD(Optimizer): |
… |
def _single_tensor_sgd (…): |
… |
for i, param in enumerate (params): |
d_p = d_p_list [ i ] if not maximize else -d_p_list [ i ] |
v = Cnnu.nnu |
t1 = torch . pow( abs(d_p)+eps, 1-v ) |
t2 = torch . exp(torch . lgamma(torch . tensor(2.0-v))) |
d_p = d_p * t1/ t2 |
The same procedure can be applied to other gradient-descent optimizers. Let us consider another example with Adam.
The new fractional optimizer is FAdam, and it was obtained by modifying the _single_tensor_adam method of the class FAdam, as described below in the Listing 2:
Listing 2FAdam class definition and single_tensor_adam method modification.
#Parameters: Set v= Cnnu.nnu, 0 < v < 2.0 |
Cnnu.nnu = 1.75 |
eps = 0.000001 |
class FAdam(Optimizer): |
… |
def _single_tensor_adam (…): |
… |
for i , param in enumerate (params): |
grad = grads [ i ] if not maximize else -grads [ i ] |
v = Cnnu . nnu |
t1=torch . pow( abs(grad)+eps , 1-v ) |
t2=torch . exp ( torch.lgamma(torch . tensor(2.0-v))) |
grad = grad * t1/t2 |
For the purposes of this paper, the fractional versions of AdamW, RMSProp, and Adadelta optimizers were also implemented. However, the same methodology can be applied to other optimizers.
2.4. Fractional GAN
Generative Adversarial Networks (GAN) constitute a representative case of artificial creativity where two artificial neural networks are confronted: the generative G that proposes instances and the discriminative D that tries to detect the degree of falsehood of those instances. After repeating the algorithm, the result is a set of objects that share many characteristics of the training objects but are not identical to them.
If G and D use MLP, then backpropagation can be used to train the whole system [22]. In this way, the generative and discriminative models can apply gradient-descent optimizers, and consequently is possible to create fractional versions of G and D. Thus, a Fractional Generative Adversarial Network (FGAN) is obtained.
In connection with the above, the proposed FGAN minibatch stochastic gradient-descent training algorithm is the one shown in Algorithm 1, which is based on the integer gradient version for GAN-training described in [22]. Essentially, both stochastic gradients for the generator and discriminator are updated with the fractional factor and respectively (see lines 5 and 8 of Algorithm 1). In this sense, the FGAN represents a generalization of the GAN version.
Algorithm 1 Fractional GAN minibatch stochastic gradient-descent training algorithm. |
|
2.5. Fractional BERT
BERT is the acronym of Bidirectional Encoder Representations from Transformers and is a machine learning and language representation model that involves the transformer architecture with encoder and decoder modules to extract patterns or representations from data [23]. BERT was developed in the context of computational linguistic and uses bidirectional transformers to learn from both the left and right contexts of a vocabulary. BERT combines two complementary tasks: Pre-training and Fine-tuning. Pre-training uses a lot of unlabeled data to train the model. Fine-tuning is a transfer-learn step where the previous learning is potentiated on specific labeled data for different applications.
The encoder, conformed by a self-attention layer and a feed-forward neural network, aims to map words to intermediate representations together with their relationships.
The decoder has the same structure as the encoder, but inserts a middle layer of Encoder-Decoder Attention.
The main goal is to model patterns of long sequences to improve some drawbacks of previous approaches, such as LSTM [29] that only models a single context direction.
Since BERT includes neural network modules, and they are frequently optimized via gradient methods, fractional gradient optimizers can be applied to obtain a Fractional BERT version (FBERT). Essentially, the unique difference is the use of fractional optimizers, described in Section 2.3, instead of others based on integer-order derivatives.
The fractional optimizers of this paper have been included in a torch.Foptim package that refers to Fractional Optimization for PyTorch. Then, instead of using a PyTorch optimizer such as Adam from the package torch.optim with a line of code like this
optim = optim.Adam(model.parameters(), learning_rate=0.001)
a fractional optimizer from the package torch.Foptim can be used. In case of the fractional Adam (FAdam) the code is as follows:
optim = Foptim.FAdam(model.parameters(), learning_rate=0.001).
It is emphasized that for , the fractional case is reduced to the well-known integer case.
3. Results
In this section, two experiments are described, and their results are shown.
3.1. Experiment 1: FGAN
A first experiment implements an FGAN based on [30] that presents a GAN trained with the MNIST [31] dataset of grayscale images of 28 × 28 pixels. The discriminator network D considers both real and fake images as unidimensional 1 × 784 vectors. The cost function is:
(15)
where is the output of D with real images as inputs, and corresponds to the output of D with fake images as inputs.The FGAN was executed 30 times with FSGD and FAdam for different values of . Figure 1 and Figure 2 allow us to compare FSGD and FAdam with . In other words, it represents the integer case of GAN+SGD vs. GAN+Adam (here + means “optimized with”). Note that in Figure 1, GAN+SGD fails completely since it does not produce any digit shape, whereas in Figure 2 GAN+Adam is better since it produces 22 of 30 digit images successfully.
Other experiments were developed with , but for reasons of space, only a few of them are reported. From these experiments, it was observed that FGAN + FSGD with gave the best results because it produced a digit shape in all the 30 executions, as illustrated in Figure 3.
In an attempt to obtain a similar result with FAdam, the FGAN was trained with some values of . The results for and are reported in Figure 4, Figure 5, Figure 6, Figure 7 and Figure 8, respectively. From these figures, it can be deduced that as grows, there is a greater number of failures because more images look noisy (no shape of some digit is visible), and the best of FGAN+FAdam was for with 3 fails, as shown in Figure 4.
Experimentally, it was not possible to find a -value for FGAN+FAdam that always produced digits like FSGD. This suggests that SGD is still competitive in certain applications, such as GANs, by introducing the fractional gradient.
3.2. Experiment 2: FBERT
The second experiment is based on [32] that implements a BERT architecture in PyTorch and has 4 modules: Preprocessing, Building model, Loss and Optimization, and Training.
Preprocessing section. Defines a text data and applies several tasks, including those where the sentences are converted to lowercase and a vocabulary is created, as well as others where special tokens are defined as follows:
hCLS: token classification,
SEP: sentence separation,
END: end of sentence,
PAD: equal length and sentence truncation,
MASK: mask creation and word replacement.
Additionally, embedding and masking tasks are included, and they are briefly described below.
Embedding tasks. Three embedding tasks are developed: token embedding to insert the special tokens and to replace each token with its index, segment embedding to separate two sentences from each other, and position embedding that assigns positions to the embeddings of a sequence.
Masking tasks. Randomly assign masks to of the sequence except to the special tokens and then aim to predict these masked words. Additionally, a padding is used to ensure that all sentences are equally long.
The experiment focuses on the Next-Word Prediction case of study where a label is created to predict consecutive sentences. A true value is assigned for consecutive sentences in the sense that the first sentence and the second sentence positions are in the same context.
Building model section. Given the previously described tasks, the building model section involves 4 components for BERT: Embedding layer, Attention Mask, Encoder layer, and BERT Assembling.
An Embedding layer applies the embedding tasks. The Attention Mask applies the masking tasks and attempts to predict a masked word randomly selected from the input. The encoder establishes representations and patterns from the embedding and masking tasks by combining the embedding information via three variables Query, Key, and Value, and the attention information to produce a score via an operator of the scaled dot product. This operator has two outputs, the attention and the context vectors evaluated in a linear layer.
Loss and Optimization section. The original experiment of [23] uses only the Adam optimizer. In this experiment, fractional Foptim.FAdam, Foptim.FSGD (with and without momentum), Foptim.FAdamW, Foptim.FAdan, Foptim.FRMSProp and Foptim.FAdadelta optimizers are used. Essentially, it is enough to change the optimizer of the original line:
optim = optim.Adam(model.parameters(), learning_rate=0.001)
to the corresponding fractional. For example, for the fractional Foptim.FAdam, the following line of code can be used:
optim = Foptim.FAdam(model.parameters(), learning_rate=0.001)
and similarly for the other fractional optimizers.
The loss function is the same from [23] and it is the CrossEntropyLoss defined in Equation (15).
Training section. Like the original work, it runs 100 epochs and reports the loss function for each 10 epoch.
In our experiment, seven fractional optimizers were considered with the self-descriptive labels FSGD, FSGDm (using momentum), FAdam, FAdamW, FAdan, FRMSProp, and FAdadelta. The -derivative is controlled by the variable and the values in this experiment were , and . As previously stated, for the fractional optimizers becomes the first-order integer case.
The text for training was the same originally used in [23]. The FBERT training results are reported in the boxplot of Figure 9, where the following can be observed:
Focused on FAdam, and considering the original experiment with Adam (i.e., FAdam with ), it is suboptimal and is outperformed by others with fractional derivatives
Focused on FSGD and FSGDm, the best results are for and
Focused on FAdan, the best results are for and
FRMSProp and FAdadelta do not show a competitive performance
From all the 42 bars in the boxplot, the best results are for FSGD, FSGDm, and FAdam, and the minimum is for FSGD with .
In addition to the fact that the best optimizers turned out to be fractional, they achieved better consistency in the boxplot with respect to others with . The numerical data of Figure 9 were not included because of simplicity and space-saving.
Figure 9Boxplot of loss functions for FBERT trained with fractional optimizers and , .
[Figure omitted. See PDF]
The source code of all fractional optimizers of this paper are available for download.
4. Discussion
One of the optimizers considered state of the art in machine learning is Adam, along with “Adam-flavors”, which focus mainly on the history and self-tuning of hyperparameters, in particular of the learning rate. In contrast, this work focuses on applying the ancient but powerful concept of fractional derivative to give an additional degree of freedom to existing optimizers.
Two experiments were developed to show how the fractional versions of gradient-based optimizers could improve their performance on GAN and natural language applications with BERT.
For Experiment 1, it is worth mentioning that the author of the original program admits the complexity of finding a set of hyperparameters that gives satisfactory results with GAN+Adam over MNIST (SGD is not even considered by many authors). This paper shows that using Adam (i.e., FAdam with ) does not always lead to successful results, and on the contrary, SGD does (not for but ). With these encouraging results, there are reasons to affirm that fractional gradients favor controlled artificial creativity, useful in neural networks such as GAN.
In Experiment 2, the fractional BERT was successfully implemented. Certainly, it was not trained on a large data set because the objective was to appreciate and compare the influence of the fractional gradient. The running time for each combination of fractional optimizer and -derivative was not prohibitive and were successfully executed with a single GPU.
The experimental results show that SGD can be as competitive as other optimizers, which can also improve their performance when considering the fractional gradient. Indeed, fractional SGD has shown better performance in artificial creativity applications with GAN and NLP applications with BERT.
In future work, with the current background, it is proposed to apply fractional optimizers on large data sets, with transfer-learning from pre-trained models, as well as explore other application areas.
An open research area is to study the existence of some optimal value of the fractional derivative given by the data, the self-tuning of the fractional -value, as well as the use of other fractional derivatives definitions.
The source code of the fractional optimizers for PyTorch resulting from this work are available online with the aim of being used and improved. At the same time, we hope that more members of the AI community will learn, apply and see the benefits of applying fractional calculus concepts.
Conceptualization, O.H.-A.; Methodology, O.H.-A.; Software, O.H.-A. and J.R.C.-A.; Validation, O.H.-A. and J.R.C.-A.; Writing—review and editing, O.H.-A. and J.R.C.-A. All authors have read and agreed to the published version of the manuscript.
The source code of the fractional optimizers of this paper are available for download at:
The authors declare no conflict of interest.
Footnotes
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
References
1. Haykin, S.S. Neural Networks and Learning Machines; 3rd. ed. Pearson Education: Upper Saddle River, NJ, USA, 2009.
2. Schmidt, R.M.; Schneider, F.; Hennig, P. Descending through a Crowded Valley—Benchmarking Deep Learning Optimizers. Proceedings of the Proceedings of the 38th International Conference on Machine Learning; PMLR, Virtual Event 18–24 July 2021; Meila, M.; Zhang, T. 2021; Volume 139, pp. 9367-9376.
3. Robbins, H.; Monro, S. A stochastic approximation method. Ann. Math. Stat.; 1951; pp. 400-407. [DOI: https://dx.doi.org/10.1214/aoms/1177729586]
4. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv; 2014; arXiv: 1412.6980
5. Abdulkadirov, R.; Lyakhov, P.; Nagornov, N. Survey of Optimization Algorithms in Modern Neural Networks. Mathematics; 2023; 11, 2466. [DOI: https://dx.doi.org/10.3390/math11112466]
6. Tian, Y.; Zhang, Y.; Zhang, H. Recent Advances in Stochastic Gradient Descent in Deep Learning. Mathematics; 2023; 11, 682. [DOI: https://dx.doi.org/10.3390/math11030682]
7. Herrera-Alcántara, O. Fractional Derivative Gradient-Based Optimizers for Neural Networks and Human Activity Recognition. Appl. Sci.; 2022; 12, 9264. [DOI: https://dx.doi.org/10.3390/app12189264]
8. PyTorch-Contributors. TOPTIM: Implementing Various Optimization Algorithms. 2023; Available online: https://pytorch.org/docs/stable/optim.html (accessed on 1 May 2023).
9. Paszke, A.; Gross, S.; Massa, F.; Lerer, A.; Bradbury, J.; Chanan, G.; Killeen, T.; Lin, Z.; Gimelshein, N.; Antiga, L. et al. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Advances in Neural Information Processing Systems 32; Curran Associates, Inc.: Red Hook, NY, USA, 2019; pp. 8024-8035.
10. Oldham, K.B.; Spanier, J. The Fractional Calculus; Academic Press [A Subsidiary of Harcourt Brace Jovanovich, Publishers]: New York, NY, USA, London, UK, 1974; Volume 111, xiii+234.
11. Miller, K.; Ross, B. An Introduction to the Fractional Calculus and Fractional Differential Equations; Wiley: Hoboken, NJ, USA, 1993.
12. Mainardi, F. Fractional Calculus and Waves in Linear Viscoelasticity; 2nd ed. Number 2 World Scientific: Singapore, 2022; 628.
13. Yousefi, F.; Rivaz, A.; Chen, W. The construction of operational matrix of fractional integration for solving fractional differential and integro-differential equations. Neural Comput. Applic; 2019; 31, pp. 1867-1878. [DOI: https://dx.doi.org/10.1007/s00521-017-3163-9]
14. Gonzalez, E.A.; Petráš, I. Advances in fractional calculus: Control and signal processing applications. Proceedings of the 2015 16th International Carpathian Control Conference (ICCC); Szilvasvarad, Hungary, 27–30 May 2015; pp. 147-152. [DOI: https://dx.doi.org/10.1109/CarpathianCC.2015.7145064]
15. Henriques, M.; Valério, D.; Gordo, P.; Melicio, R. Fractional-Order Colour Image Processing. Mathematics; 2021; 9, 457. [DOI: https://dx.doi.org/10.3390/math9050457]
16. Pang, G.; Lu, L.; Karniadakis, G.E. fPINNs: Fractional Physics-Informed Neural Networks. SIAM J. Sci. Comput.; 2019; 41, pp. A2603-A2626. [DOI: https://dx.doi.org/10.1137/18M1229845]
17. Wang, J.; Wen, Y.; Gou, Y.; Ye, Z.; Chen, H. Fractional-order gradient descent learning of BP neural networks with Caputo derivative. Neural Netw.; 2017; 89, pp. 19-30. [DOI: https://dx.doi.org/10.1016/j.neunet.2017.02.007] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/28278430]
18. Bao, C.; Pu, Y.; Zhang, Y. Fractional-Order Deep Backpropagation Neural Network. Comput. Intell. Neurosci.; 2018; 2018, 7361628. [DOI: https://dx.doi.org/10.1155/2018/7361628] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/30065757]
19. Wang, H.; Yu, Y.; Wen, G. Stability analysis of fractional-order Hopfield neural networks with time delays. Neural Netw.; 2014; 55, pp. 98-109. [DOI: https://dx.doi.org/10.1016/j.neunet.2014.03.012] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/24819875]
20. Chollet, F.; Zhu, Q.; Rahman, F.; Lee, T.; Marmiesse, G.; Zabluda, O.; Qian, C.; Jin, H.; Watson, M.; Chao, R. et al. Keras. 2015; Available online: https://keras.io/ (accessed on 1 May 2023).
21. Abadi, M.; Agarwal, A.; Barham, P.; Brevdo, E.; Chen, Z.; Citro, C.; Corrado, G.S.; Davis, A.; Dean, J.; Devin, M. et al. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. 2015; Available online: https://tensorflow.org (accessed on 1 May 2023).
22. Goodfellow, I.; Pouget-Abadie, J.; Mirza, M.; Xu, B.; Warde-Farley, D.; Ozair, S.; Courville, A.; Bengio, Y. Generative adversarial nets. Proceedings of the Advances in Neural Information Processing Systems; Montreal, QC, Canada, 8–13 December 2014; pp. 2672-2680.
23. Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics; Minneapolis, MN, USA, 2019; pp. 4171-4186. [DOI: https://dx.doi.org/10.18653/v1/N19-1423]
24. Garrappa, R.; Kaslik, E.; Popolizio, M. Evaluation of Fractional Integrals and Derivatives of Elementary Functions: Overview and Tutorial. Mathematics; 2019; 7, 407. [DOI: https://dx.doi.org/10.3390/math7050407]
25. Zeiler, M.D. ADADELTA: An Adaptive Learning Rate Method. arXiv; 2012; [DOI: https://dx.doi.org/10.48550/ARXIV.1212.5701] arXiv: 1212.5701
26. Duchi, J.; Hazan, E.; Singer, Y. Adaptive Subgradient Methods for Online Learning and Stochastic Optimization. J. Mach. Learn. Res.; 2011; 12, pp. 2121-2159.
27. Zhuang, Z.; Liu, M.; Cutkosky, A.; Orabona, F. Understanding adamw through proximal methods and scale-freeness. arXiv; 2022; arXiv: 2202.00089
28. Tieleman, T.; Hinton, G. Neural Networks for Machine Learning; Technical Report COURSERA: Napa County, CA, USA, 2012.
29. Hochreiter, S.; Schmidhuber, J. Long short-term memory. Neural Comput.; 1997; 9, pp. 1735-1780. [DOI: https://dx.doi.org/10.1162/neco.1997.9.8.1735] [PubMed: https://www.ncbi.nlm.nih.gov/pubmed/9377276]
30. Seo, J.D. Only Numpy: Implementing GAN (General Adversarial Networks) and Adam Optimizer Using Numpy with Interactive Code. 2023; Available online: https://towardsdatascience.com/only-numpy-implementing-gan-general-adversarial-networks-and-adam-optimizer-using-numpy-with-2a7e4e032021 (accessed on 1 May 2023).
31. Deng, L. The mnist database of handwritten digit images for machine learning research. IEEE Signal Process. Mag.; 2012; 29, pp. 141-142. [DOI: https://dx.doi.org/10.1109/MSP.2012.2211477]
32. Barla, N. How to code BERT using PyTorch. Available online: https://neptune.ai/blog/how-to-code-bert-using-pytorch-tutorial (accessed on 1 May 2023).
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.
Abstract
Machine learning is a branch of artificial intelligence that dates back more than 50 years. It is currently experiencing a boom in research and technological development. With the rise of machine learning, the need to propose improved optimizers has become more acute, leading to the search for new gradient-based optimizers. In this paper, the ancient concept of fractional derivatives has been applied to some optimizers available in PyTorch. A comparative study is presented to show how the fractional versions of gradient optimizers could improve their performance on generative adversarial networks (GAN) and natural language applications with Bidirectional Encoder Representations from Transformers (BERT). The results are encouraging for both state-of-the art algorithms, GAN and BERT, and open up the possibility of exploring further applications of fractional calculus in machine learning.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer