Content area
Purpose
The purpose of this paper is three-fold: to review the categories explaining mainly optimization algorithms (techniques) in that needed to improve the generalization performance and learning speed of the Feedforward Neural Network (FNN); to discover the change in research trends by analyzing all six categories (i.e. gradient learning algorithms for network training, gradient free learning algorithms, optimization algorithms for learning rate, bias and variance (underfitting and overfitting) minimization algorithms, constructive topology neural networks, metaheuristic search algorithms) collectively; and recommend new research directions for researchers and facilitate users to understand algorithms real-world applications in solving complex management, engineering and health sciences problems.
Design/methodology/approachThe FNN has gained much attention from researchers to make a more informed decision in the last few decades. The literature survey is focused on the learning algorithms and the optimization techniques proposed in the last three decades. This paper (Part II) is an extension of Part I. For the sake of simplicity, the paper entitled “Machine learning facilitated business intelligence (Part I): Neural networks learning algorithms and applications” is referred to as Part I. To make the study consistent with Part I, the approach and survey methodology in this paper are kept similar to those in Part I.
FindingsCombining the work performed in Part I, the authors studied a total of 80 articles through popular keywords searching. The FNN learning algorithms and optimization techniques identified in the selected literature are classified into six categories based on their problem identification, mathematical model, technical reasoning and proposed solution. Previously, in Part I, the two categories focusing on the learning algorithms (i.e. gradient learning algorithms for network training, gradient free learning algorithms) are reviewed with their real-world applications in management, engineering, and health sciences. Therefore, in the current paper, Part II, the remaining four categories, exploring optimization techniques (i.e. optimization algorithms for learning rate, bias and variance (underfitting and overfitting) minimization algorithms, constructive topology neural networks, metaheuristic search algorithms) are studied in detail. The algorithm explanation is made enriched by discussing their technical merits, limitations, and applications in their respective categories. Finally, the authors recommend future new research directions which can contribute to strengthening the literature.
Research limitations/implicationsThe FNN contributions are rapidly increasing because of its ability to make reliably informed decisions. Like learning algorithms, reviewed in Part I, the focus is to enrich the comprehensive study by reviewing remaining categories focusing on the optimization techniques. However, future efforts may be needed to incorporate other algorithms into identified six categories or suggest new category to continuously monitor the shift in the research trends.
Practical implicationsThe authors studied the shift in research trend for three decades by collectively analyzing the learning algorithms and optimization techniques with their applications. This may help researchers to identify future research gaps to improve the generalization performance and learning speed, and user to understand the applications areas of the FNN. For instance, research contribution in FNN in the last three decades has changed from complex gradient-based algorithms to gradient free algorithms, trial and error hidden units fixed topology approach to cascade topology, hyperparameters initial guess to analytically calculation and converging algorithms at a global minimum rather than the local minimum.
Originality/valueThe existing literature surveys include comparative study of the algorithms, identifying algorithms application areas and focusing on specific techniques in that it may not be able to identify algorithms categories, a shift in research trends over time, application area frequently analyzed, common research gaps and collective future directions. Part I and II attempts to overcome the existing literature surveys limitations by classifying articles into six categories covering a wide range of algorithm proposed to improve the FNN generalization performance and convergence rate. The classification of algorithms into six categories helps to analyze the shift in research trend which makes the classification scheme significant and innovative.
1. Introduction
The Feedforward Neural Network (FNN) has gained much attention from researchers in the last few decades (Abdel-Hamid et al., 2014; Babaee et al., 2018; Chen et al., 2018; Chung et al., 2017; Deng et al., 2019; Dong et al., 2016; Ijjina and Chalavadi, 2016; Kastrati et al., 2019; Kummong and Supratid, 2016; Mohamed Shakeel et al., 2019; Nasir et al., 2019; Teo et al., 2015; Yin and Liu, 2018; Zaghloul et al., 2009) because of its ability to extract useful patterns and make a more informed decision from high dimensional data (Kumar et al., 1995; Tkáč and Verner, 2016; Tu, 1996). With modern information technology advancement, the challenging issue of high dimensional, non-linear, noisy and unbalanced data are continuously growing and varying at a rapid rate so that it demands efficient learning algorithms and optimization techniques (Shen, Choi and Chan, 2019; Shen and Chan, 2017). The data may become a costly resource if not analyzed properly in the process of business intelligence. Machine learning is gaining significant interest in facilitating business intelligence in the process of data gathering, analyses and extracting knowledge to help users in making better informed decisions (Bottani et al., 2019; Hayashi et al., 2010; Kim et al., 2019; Lam et al., 2014; Li et al., 2018; Mori et al., 2012; Wang et al., 2005; Wong et al., 2018). Efforts are being made to overcome the challenges by building optimal machine learning FNNs that may extract useful patterns from the data and generate information in real-time for better-informed decision making. Extensive knowledge and theoretical information are required to build FNNs having the characteristics of better generalization performance and learning speed. The generalization performance and learning speed are the two criteria that play an essential role in deciding on the use of learning algorithms and optimization techniques to build optimal FNNs. Depending upon the application and data structure, the user might prefer either better generalization performance or faster learning speed, or a combination of both. Some of the drawbacks that may affect the generalization performance and learning speed of FNNs include local minima, saddle points, plateau surfaces, hyperparameters adjustment, trial and error experimental work, tuning connection weights, deciding hidden units and layers and many others. The drawbacks that limit FNN applicability may become worse with inappropriate user expertise and insufficient theoretical information. Several questions were identified in Part I of the study that may be the causes of the above drawbacks. For instance, how to define network size, hidden units, hidden layers, connection weights, learning rate, topology and many others.
Previously, in Part I of the study, we made an effort to answer the key questions by reviewing two categories explaining the learning algorithms. The categories were named as gradient learning algorithms for network training and gradient free learning algorithms. In the current paper, Part II, we made an effort to review the remaining four categories explaining optimization techniques (i.e. optimization algorithms for learning rate, bias and variance (underfitting and overfitting) minimization algorithms, constructive topology neural networks, metaheuristic search algorithms). Part II is an extension of Part I. For each category, researcher efforts to demonstrate the effectiveness of their proposed optimization techniques in solving real-world management, engineering, and health sciences problems are also explained to enrich the content and make users familiar with FNN applications areas. Moreover, all categories are collectively analyzed in the current paper in order to discover the shift in research trends. Based on the review of the existing literature, the authors recommended future research directions for strengthening the literature. In-depth knowledge from the survey will help researchers to design a new, simple and compact algorithm having better generalization performance characteristics, and in generating results in the shortest possible time. Similarly, users may be able to decide and select the algorithm that best suits their application area.
The paper is organized as follow: Section 2 is about survey methodology. Section 3 briefly overviews Part I of the study. In Section 4, four categories that focus on optimization techniques are reviewed with a detailed description of each algorithm in terms of its merits, limitations, real-world management, engineering, and health sciences applications. Section 5 is about future directions to improve FNN generalization performance and learning speed. Section 6 concludes the paper.
2. Survey methodology
2.1 Source of literature and philosophy of review work
The sources of the literature and philosophy of review work are identical to those in Part I. Combining the review articles of Part I, the authors studied in total 80 articles, in which 63 (78.75 percent) were journal papers, 10 (12.50 percent) conference papers, 3 (3.75 percent) online arXiv archives, 2 (2.50 percent) books, 1 (1.25 percent) technical report and 1 (1.25 percent) online academic lecture. Previously, in Part I, only 38 articles were reviewed, mainly in learning algorithms categories. In the current paper, the remaining articles are reviewed in Section 4, which explains the optimization techniques needed to improve the generalization performance and learning speed of FNN. In this section, all 80 articles (Part I and Part II) are collectively analyzed to discover the shift in research trends.
2.2 Classification schemes
This paper classification is an extension of Part I and focuses on optimization techniques recommended in the last three decades for improving the generalization performance and learning speed of FNN. In this subsection, all categories are collectively analyzed to discover the shift in research trends. Combining the categories explained in Part I, in total six categories are:
gradient learning algorithms for network training;
gradient free learning algorithms;
optimization algorithms for learning rate;
bias and variance (underfitting and overfitting) minimization algorithms;
constructive topology FNN; and
metaheuristic search algorithms.
Categories one and two are extracted from Part I, whereas, categories three to six are reviewed in the current paper. In Part I, the first category considered gradient learning algorithms that need first order or second order gradient information to build FNNs and the second category contained gradient free learning algorithms which analytically determine connection weights rather than first or second order gradient tuning. Categories three to six are reviewed in the current paper and are highlighted as: the third category contains optimization algorithms for varying learning rate at different iterations to improve generalization performance by avoiding divergence from minimum of loss function; the fourth category contains algorithms to avoid underfitting and overfitting to improve generalization performance with the additional need for training time; the fifth category contains constructive topology learning algorithms to avoid the need for trial and error approaches for determining the number of hidden units in fixed-topology networks, and the sixth category contains metaheuristic global optimization algorithms to search for loss functions in global search space instead of local space.
Figure 1 illustrates the classification of algorithms into six categories. We identified a total of 54 unique algorithms proposed in 80 articles. Among the 54 unique algorithms, 27 learning algorithms were reviewed in Part I and the remaining 27 optimization algorithms are reviewed in Section 4 of the current paper. Figure 2 illustrates the number of algorithms identified in each category over time. Other than the proposed algorithms, a small number of articles that support or criticize identified algorithms are also included to widen review. The unique algorithms, supportive and criticized articles result in a total of 80 articles. Figure 3 illustrates the total number of articles reviewed in each category. Table I provides a detailed summary of the algorithms identified in each category along with references to the total number of papers reviewed. In the table, the references of the article reviewed in each specific category are given, however, articles referenced in more than one category are identified with an asterisk (*). For instance, article Rumelhart et al. (1986), Setiono and Hui (1995), Hinton et al. (2012), Zeiler (2012) and Hunter et al. (2012) appears in the first, third and fifth category. The major contribution of Rumelhart et al. (1986) and Setiono and Hui (1995) is in the first category, Hinton et al. (2012) and Zeiler (2012) is in the third category and Hunter et al. (2012) is in the fifth category. To avoid repetition, Figure 3 is plotted showing the major contribution of articles in the respective categories. This is a reason that the distribution of the gradient learning algorithm category in Part II conflicts with Part I. Initially, in Part I, the authors mentioned 17 articles were reviewed for gradient learning algorithm category (Figure 5 of Part I). However, in Part II, 14 articles were reported. References Hinton et al. (2012), Zeiler (2012) and Hunter et al. (2012) are cited in the gradient learning algorithm category to support the relevant literature but their original work relates to optimization algorithms. This implies that Hinton et al. (2012) and Zeiler (2012) have major contributions in the third category, and the major contribution of (Hunter et al., 2012) is in the fifth category. To avoid repetition, the distribution (i.e. 17−14=3) is reported in the third and fifth category, reflecting original work. The distribution in the Figures 1–3 and content in Table I shows the researchers interest and trend in specific categories. The distribution and content explain that the research directions are changing from complex gradient-based algorithms to gradient free algorithms, fixed topology to constructive topology, hyperparameter initial guess work to analytically calculation, and converging algorithms at a global minimum rather than the local minimum. In another sense, the categories may be considered as an opportunity for discovering research gaps and further improvement can bring a significant contribution.
3. Part I: an overview
The two categories reviewed in Part I are briefly explained in this section, which is divided into two subsections. The first subsection gives a concise overview of the FNNs working method and the second subsection summarizes the two categories in that are explained detailed in Part I.
3.1 FNN: an overview
FNN processes information in the parallel structure by connecting the input layer to the output layer in a forward direction without forming a cycle or loop. The input layer and output layer are connected through a hidden unit by a series of connection weights (Hecht-Nielsen, 1989). Figure 4 illustrates a simple FNN with three layers. The weight connections linking the input layer consisting of input features x with an added bias bu to the hidden layers u, are known as input connection weights wicw. Similarly, the weight connections linking u with an added bias bo and the output layer, are known as output connection weights wocw. The output layer p is calculated in a forward propagation step by applying the activation function f(z) on the product of wocw and u, and comparing with target vector a to determine the loss function E. This can be expressed mathematically as:
Such that:
The objective is to minimize the loss function E of the network by achieving a p value approximately equal to a:
At each instance h, the error eh can be expressed as:
If E is larger than the predefined expected error ε, the connection weights are backpropagated by taking the derivative of E with respect to each incoming weight in the direction of descending gradient following the chain rule:
where ∇E is a partial derivative of the error with respect to weight w; outi the activation output; neti the weighted sum of inputs of the hidden unit. For the sake of simplicity in this study, the connection weights w in the equations refer to all types of connection weights in the network, unless otherwise specified. This continues, and the connection weights are updated at each iteration i in gradient descent (GD) direction to bring p closer to a:
where ∝ is a learning rate hyperparameter. A similar method is followed for the second order ∇2E learning algorithms.
3.2 Part I classification
In Part I, the authors identified in a total of 27 unique algorithms proposed in the 38 articles. Other than the proposed algorithms, a small number of articles that support or criticize identified algorithms were also incorporated for a broader review. The identified 27 algorithms were classified into two main categories based on their problem identification, mathematical model, technical reasoning and proposed solution. The categories were named as: gradient learning algorithms for network training; and gradient free learning algorithms.
Table I illustrates the classification of algorithms into their respective categories. Table I summarizes the algorithm in each category with the respective paper citation reference. The gradient learning algorithm category includes first order and second order learning algorithms used to train the network following the backpropagation (BP) delta rule. The category was further divided into two subcategories. Subcategory one covers algorithms belonging to the first order derivative rule, whereas, subcategory two covers algorithms belonging to the second order derivative rule. The gradient free learning category describes algorithms that do not need gradient information for the iterative tuning of connection weights, and was further classified into three subcategories. Subcategory one covers probability and general regression neural networks, subcategory two covers extreme learning machine (ELM) and its variants algorithms in that they randomly generate hidden unit and analytically calculate output connection weight, and subcategory three describes algorithms adopting the hybrid approach of gradient and gradient-free methods.
4. Optimization techniques
This section covers the remaining four categories (i.e. optimization algorithms for learning rate, bias and variance (underfitting and overfitting) minimization algorithms, constructive topology neural networks, metaheuristic search algorithms) explaining mainly the optimization techniques as summarized in Table I. The optimization algorithms (techniques) are explained along with their merits and technical limitations in order to understand the FNN drawbacks, answer to questions involving user expertise and theoretical information, so as to identify research gaps and future directions. The list of applications mentioned in each category gives an indication of the successful implementation of optimization techniques in various real-world management, engineering, and health sciences problems. The sections below explain algorithms categories and their subcategories.
4.1 Optimization algorithms for learning rate
BP gradient learning algorithms are powerful and have been extensively studied to improve the convergence. The important global hyperparameter that sets the position of a new weight for BP is the learning rate ∝. If ∝ is too large, the algorithms may diverge from a minimum loss function and will oscillate around it, however, if ∝ is too small it will converge very slowly. The suboptimal learning rate can cause FNN to trip into a local minimum or saddle point. The objective of FNN is to determine the global minimum of the loss function that truly represents the domain of the function rather than a local minimum. A saddle point can be defined as a point where the function has both minimum and maximum directions (Gori and Tesi, 1992).
The algorithms in this category, that improve the convergence of BP by optimizing learning rate, are based on knowledge of exponential moving weighted average (EMA) statistical technique (Lucas and Saccucci, 1990) and its bias correction (Kingma and Ba, 2014). Before moving into details of each optimization algorithm, it is important to understand the basic intention behind EMA and the role of bias correction for better optimization. EMA is best suitable when the data are noisy, and it helps to make it denoise by taking a moving average of the previous values to define the next sequence in the data. With an initial guess of Δwo=0, the next sequence can be computed from the expression:
where β is a moving exponential hyperparameter with value [0, 1]. The hyperparameter β plays an important role in EMA. First, it approximates the moving average Δwi by 1/(1−β) which means higher the value of β, the more values will be averaged and the data trend will be less noisy, whereas, a lower value of β will average less values and the trend will fluctuate more. Second, for a higher value of β, more importance will be given to the previous weights as compared to the derivative values. Moreover, it is discovered that during the initial sequence, the trend is biased and is further away from the original function due to the initial guess Δwo=0. This results in much lower values which can be improved by computing the bias such that:
The effect of 1−βi in the dominator decreases with increasing iteration i. Therefore, 1−βi has more influence on the starting iteration and can generate a sequence with better results. The EMA and its bias correction techniques have been extensively utilized in FNN research to optimize learning and convergence in regard to error with a minimum number of iterations.
When the learning rate is small, the gradient slowly moves toward the minimum, whereas, a large learning rate will oscillate the gradient along the long and short axis of the minimum error valley. This reduces gradient movement toward the long axis which faces toward the minimum. The momentum minimizes the short axis of oscillation while adding more contributions along the long axis moves the gradient in larger steps toward the minimum with less iterations (Rumelhart et al., 1986). Equation (7) can be modified for momentum:
where βa is the momentum hyperparameter controlling the exponential decay. The new weight is a linear function of both the current gradient and weight change during the previous step (Qian, 1999). Riedmiller and Braun (1993) argued that in practice, it is always not true that momentum will make learning more stable. The momentum parameter βa is equally problematic similar to the learning parameter ∝ and no general improvement can be achieved. Despite ∝, the second factor that effects the Δwi is the unforeseeable behavior of the derivative ∇Ei whose magnitude will be different for different weights. The resilient propagation (RProp) algorithm was developed to avoid the problem of blurred adaptivity of ∇Ei and changes the size of Δwi directly without considering ∇Ei. For each weight wi + 1, the size of the update value Δi is computed based on the local gradient information:
where η+ and η− are increasing and decreasing hyperparameters. The basic intention is that when the algorithm jumps over the minimum, Δi is decreased by the η− factor and if it retains the sign it is accelerated by the η+ increasing factor. The weight update Δwi is decreased if the derivative is positive and increased if the derivative is negative such that:
RProp has an advantage over BP in that it gives equal importance to all weights during learning. In BP the size of derivative decreases for weights that are far from the output and they are less modified compared to other weights, which make it slower. RProp is dependent on the sign rather than derivative magnitude which give equal importance to far away weights. The simulation results demonstrate that RProp is four times and one time faster than the popular GD and quick prop (QP), respectively. RProp is effective for batch learning but it does not work with mini-batches. Hinton et al. (2012) addressed that RProp does not work with mini-batches because it treats all mini-batches as equivalent. The possibility exists that the weight will grow greatly which may not be noticed by RProp. They proposed RMSProp by combining the robustness of RProp, the efficiency of mini-batches and gradient average over mini-batches. RMSprop is a minibatch version of RProp by keeping a moving average of square gradient such that:
Like βa, βb is a hyperparameter that controls the exponential decay. When the gradient is in the short axis, the moving average will be large which will slow down the gradient, whereas, in the long axis, the moving average will be small which will accelerate the gradient. It will reduce the depth of gradient oscillation and move it faster along the long axis. Another advantage of this algorithm is that a larger learning rate can be used to move it more quickly toward the minimum of the loss function. For optimal results, Hinton et al. (2012) recommended using βb=0.9.
On the contrary, the Adaptive Gradient Algorithm (AdaGrad) performs informative gradient-based learning by incorporating knowledge of the geometry of the data observed during each iteration (Duchi et al., 2011). The features that occur infrequently are assigned a larger learning rate because they are more informative, while features that occur frequently are given a small learning rate. The standard BP follows predetermined procedures in which the weights are updated at once with a similar learning rate which makes convergence poor. AdaGrad computes matrix G as the l2 norm of all previous gradients:
This algorithm is more suitable for high sparse dimensionality data. The above equations increase the learning rate for more sparse data and decrease the learning rate for low sparse data. The learning rate is scaled based on G and ∇Ei to give a parameter specific learning rate. The main drawback of AdaGrad is that it accumulates the squared gradient which grows after each iteration. This shrinks the learning rate and it becomes infinitesimally small and limits the purpose of gaining additional information (Zeiler, 2012). Therefore, AdaDelta was proposed based on the idea extracted from AdaGrad to improve two main issues in learning: shrinkage of learning rate; and manually selecting the learning rate (Zeiler, 2012). Accumulating the sum of the squared gradients shrinks the learning rate in AdaGrad and this can be controlled by restricting the previous gradient to a fixed size s. This restriction will not accumulate gradient to infinity and more importance will goes to the local gradient. The gradient can be restricted by accumulating an exponential decaying average of the square gradient. Assume at i the running average of the gradient is E[∇E2]i then:
Like AdaGrad, taking the square root RMS of the parameter E[∇E2]i:
where the constant
The performance of this method on FNN is better with no manual setting of the learning rate, insensitivity to hyperparameters, and robust to large gradients and noise. However, it requires some extra computation per iteration compared to GD and requires expertise in selecting the best hyperparameters.
Similarly, the concept of both first and second moments of gradients was incorporated in Adaptive Moment Estimation (Adam). It requires first order gradient information to compute the individual learning rate for parameters from an exponential moving average of first and second moments of gradients (Kingma and Ba, 2014). For parameter updating, it combines the idea of both the exponential moving average of gradients like momentum, and an exponential moving average of square gradients like RMSProp and AdaGrad. The moving average of the gradient is an estimate of the 1st moment (mean) and the square gradient is an estimate of the 2nd moment (the uncentered variable) of the gradient. Like momentum and RMSProp, 1st-moment estimation and 2nd-moment estimation can be expressed in Equations (9) and (14), respectively. The above estimations are initialized with 0s vectors which make it biased toward zero during the initial iterations. This can be corrected by computing the bias correction such as:
The parameters are updated by:
The parameter update rule in Adam is based on scaling their gradient to be inversely proportional to the l2 norm of their individual present and past gradients. In AdaMax, the l2 norm based update rule is extended to the lp norm based update rule (Kingma and Ba, 2014). For a more stable solution and avoiding the instability of large p, let p→∞ then:
where Δwid is exponentially weighted infinity norm. The updated parameter can be expressed as:
The advantages of Adam and AdaMax are that it combines the characteristics of both AdaGrad while dealing with sparse gradient and RMSProp in dealing with non-stationary objectives. Adam and its extension AdaMax require less memory and are suitable for both convex and non-convex optimization problems.
4.1.1 Application of optimization algorithms for learning rate
The learning algorithms are found to have application in complex domain problem solving. The high dimensional data with a lot of features may make the task difficult to solve with an initial guess and manual settings of the parameter such as learning rate. Deciding on a large learning rate may make the network unstable causing the generalization performance to be poor, whereas, a small learning rate may reduce the learning speed. The algorithms in this category help to adjust the learning rate on each iteration and move the gradients faster toward the long axis of the valley for better performance. Table II contains some of the applications on complex task problem solving. AdaGrad solved the problem of newspaper articles classification by improving various categories performance, on average, by 9–109 percent. Similarly, for the subcategory image classification problem, AdaGrad was able to improve the precision by 2.09–9.8 percent. The graphical representation of the AdaDelta results showed that it gained highest accuracy on speech recognition of English data. The application of Adam on multiclass logistics regression, multilayer neural networks and convolutional neural networks for classification of handwritten images, object images and movie reviews graphically showed that Adam has a better generalization performance by having a smaller convergence per iteration. The above problems illustrate that the application of optimization algorithms for the learning rate are more suitable for high dimension complex data sets to avoid manual adjustment of the learning rate.
4.2 Bias and variance (underfitting and overfitting) minimization algorithms
The best FNN should be able to truly approximate the training and testing data. Researchers extensively studied FNN to avoid high bias and variance to improve convergence approximations. The high bias is referred to a problem known as underfitting and high variance is referred to as overfitting. Underfitting occurs when the network is not properly trained and the patterns in the training data are not fully discovered. It occurs before the convergence point and, in this case, the generalization performance (also known as test data performance) does not deviate much from the training data performance. This is the reason that the underfitting region has high bias and low variance. In contrast, overfitting occurs after the convergence point when the network is overtrained. In this scenario, the training error is in a continuous state of decreasing whereas the testing error starts increasing. The overfitting is known as having the properties of low bias and high variance. The high variance in training and the testing error occurs because the network trains the noise in the training data which might be not part of the actual model data. The best choice is to choose a model which balances bias and variance, which is known as bias and variance trade-off. Bias and variance trade-off allows FNN to discover all patterns in the training data and simultaneously give better generalization performance (Geman et al., 1992).
For simplicity, the bias-and-variance trade-off point is referred to as the convergence point. The FNN should be stopped from further training when it approaches the convergence point to balance bias (underfitting) and variance (overfitting). The most popular method to identify and stop the FNN at the convergence point is to select suitable validation data. Validation data sets and test data sets are used interchangeably and are unseen data held back from the training data to check the network performance. The error estimate by the training data are not a useful estimator and comparing its performance with the validation data set generalization performance can help to identify the convergence point (Zaghloul et al., 2009). One of the drawbacks associated with the validation technique is that the data set should be large enough to split it (Reed, 1993). This makes it unsuitable for small data sets especially complicated data with few instances and many variables. When sufficient data are not available, the n-fold cross-validation technique can be considered as an alternative to the validation technique (Seni and Elder, 2010). The n-fold cross-validation technique splits data randomly into n-equal sized data subsamples. The FNN is repeated n times by allocating one subsample for testing and the remaining n−1 for the training model. The n-subsample is allocated once for each testing. Finally, the n-results are averaged to get a single accurate estimation. The suitable selection of n-fold size is based on data set complexity and network size which make this technique challenging for complicated applications.
The underfitting problem is easily detectable and therefore the literature has focused on proposing techniques for improving overfitting. Overfitting is a common problem in nonparametric models and it greatly influences generalization performance. Overfitting mainly results from hyperparameters initialization and adjustment. For instance, the size of hidden units for the successful training of the network is not always evident. A large size network has a greater possibility to overfit as compared to a smaller size (Setiono and Hui, 1995). Similarly, overfitting chances increase when the network degree of freedom (such as weights) increases significantly from the training sample (Reed, 1993). These two problems have gained greater attention and led to the development of the techniques, explained below, to avoid overfitting analytically rather than trial and error approaches. The algorithms for this category are subcategorized into regularization, pruning and ensembles. The algorithms and techniques in this category are discussed by considering their contribution scope in FNN only. For instance, the dropout technique, as discussed below, is not limited to FNN but also applies to graphical models such as Boltzmann Machines.
4.2.1 Regularization algorithms
Overfitting occurs when the training examples information does not match with network complexity. The network complexity depends upon the free parameters such as the number of weights and added bias. Often it is desirable to lessen the network complexity by using less weight to train networks. This can be done by limiting the weight growth by techniques known as weight decay (Krogh and Hertz, 1992). It penalizes the large weights in growing too large by adding a penalize term in the loss function:
where
l1 regularization adds absolute value, while l2 regularization add the squared value of the weight to the loss function. Krogh and Hertz (1992) experimental work demonstrated that FNN with the weight decay technique can avoid overfitting and improve the generalization performance. The improvement mentioned in their paper was less significant. Several high-level techniques, as discussed below, were proposed to achieve better results.
A renowned regularization technique dropout was proposed for large fully connected networks to overcome the issues of overfitting and the need for FNN ensembles (Srivastava et al., 2014). With a limited data set, the estimation of the training data with noise does not approximate the testing data which leads to overfitting. First, this happens because each parameter during the gradient update influences the other parameters, which creates a complex coadaptation among the hidden units. This coadaptation can be broken by removing some hidden units from the network. Second, during FNN ensembles, combining the average of many outputs of a separately trained FNN is an expensive task. Training many FNN requires many trial and error approaches to find the best possible hyperparameters, which makes it a daunting task and needs a lot of computation efforts. Dropout avoids overfitting by temporarily dropping the units (hidden and visible) along with all connections with certain random probability during the network forward steps. Training FNN with dropout means training 2n thinned networks where n is a number of units in FNN. During testing time, it is not recommended to take an average of the prediction from all thinned trained FNNs. It is best to use a single FNN during the testing time without dropout. The weights of this single FNN provide a scaled down version of all the trained weights. If the unit output is denoted by
where r is a vector of the independent Bernoulli random variable with * denoting an element wise product. The downside of dropout is that its training time is two to three times longer than the standard FNN for the same architecture. However, the experimental results achieved by dropout compared to FNN are remarkable. DropConnect, a generalization of dropout, is another regularization technique for a large fully connected layer FNN to avoid overfitting (Wan et al., 2013). The key difference between dropout and DropConnect is in their dropping mechanism. The dropout temporary drops the units along with their connection weights, whereas, DropConnect randomly drops a subset of weight connections with a random probability. The technique is like Dropout and the key difference is that DropConnect drops connection weights during forward propagation of the network. The output
Experimental results on MNIST, CIFAR-10, SVHN and NORB demonstrates that DropConnect outperforms no-drop (backpropagation) and dropout. The training time of DropConnect is slightly higher than no-drop and Dropout due to feature extractor bottleneck in large models. The advantage of DropConnect is that it allows training a large network without overfitting with better generalization performance. The limitation of both dropout and DropConnect are that they are suitable only for fully connected layer FNN. Other regularization techniques includes Shakeout that was proposed to remove unimportant weights to enhance FNN prediction performance (Kang et al., 2017). The performance of FNN may not be severely affected if most of the unimportant connection weights are removed. One technique is to train a network, prune the connections and finetune the weights. However, this technique can be simplified by imposing sparsity-inducing penalties during the training process. During implementation, Shakeout randomly chooses to reverse or enhance the unit contribution to the next layer in the training forward stage to avoid overfitting. Dropout can be considered a special case of shakeout by keeping the enhancement factor to one and the reverse factor to zero value. Shakeout induces l0 and l1 regularization which penalizes the magnitude of the weight and leads to a sparse weight that truly represents a connection among units. This sparsity induced penalty (l0 and l1 regularization) is combined with l2 regularization to get a more effective prediction. It randomly modifies the weight based on r and can be expressed as:
where
Beside dropping hidden units or weights, the batch normalization method avoids overfitting and improves the learning speed by preventing the formulation of the internal covariant shift problem, which occurs due to changes in the parameters during the training process of FNN (Ioffe and Szegedy, 2015). Updating the parameters during training affects the layers of the input distribution in that it makes it deeper and slower with a requirement for more careful hyperparameters initialization. The batch normalization technique eliminates the internal covariant shift and accelerates network training. This simple technique improves the layers distribution by normalizing activation of each layer with the mean zero and unit variance. Unlike BP with a higher learning rate which results in gradient overshoot or diverges and gets stuck in a local minima, the stable distribution of the activation values by batch normalization during training can tolerate larger learning rates. The effective results can be achieved by performing the normalization of mini-batches and backpropagating the error through normalized parameters. During the training process in batch normalization, the goal is to accomplish a stable activation function during training which results in either the reduction or elimination of overfitting from the network. This makes it a regularization technique and eliminates the need for Dropout in most cases.
Similarly, besides using the validation and its variant techniques which may not capture overfitting, it might go undetected, which cost more. The Optimized approximation algorithm (AOA) avoids overfitting without any need for a separate validation set (Liu et al., 2008). It uses stopping criteria formulated in their original paper known as signal to noise ratio figure (SNRF) to detect overfitting during the training error measurement. It calculates SNRF during each iteration and if it is less than the defined threshold value, the algorithm is stopped. Experimental work on a number of different hidden nodes and iterations has validated the AOA effectiveness.
4.2.2 Pruning FNN
Training a network with larger size requires more time because of many hyperparameters adjustments which can lead to overfitting if not initialized and controlled properly, whereas, the smaller network will learn faster but may converge to poor local minima or be unable to learn data. Therefore, it is often desirable to get an optimal network size with no overfitting, better generalization performance, less hyperparameters, and fast convergence. An effective way is to train a network with a large possible architecture and remove the parts that are insensitive to performance. This approach is known as pruning because it prunes unnecessary units from the large network and reduces its depth to the most optimal possible small network with few parameters. This technique is considered effective in eliminating the overfitting problem and in improving generalization performance, but at the same time, it requires a lot of effort and training time to construct a large network and then to reduce it. The simplest technique of pruning involves training a network, pruning the unnecessary parts and retraining. The training and retraining of weights in pruning will sustainably increase the FNN training time. The training time can be reduced to some extent by calculating the sensitivity of each weight with respect to the global loss function (Karnin, 1990). The weight sensitivity S can be calculated from the expression:
The idea is to calculate sensitivity concurrently during the training process without interfering with the learning process. At the end of the training process, the sensitivity will be sorted in descending order and weights with fewer sensitivity values will be pruned. This method has advantages over traditional approaches in that, instead of training and retraining which consumes more training time, it prunes one-time weights that are insensitive and are less important for error reduction. Reed (1993) explained that besides sensitivity methods, penalty methods (weight decay or regularization) are also other types of pruning. In sensitivity methods, weight or units are removed, whereas, in the penalty method, weights or units are set to zero which is like removing them from the FNN. Kovalishyn et al. (1998) applied the pruning technique to the constructive FNN algorithm and demonstrate that constructive algorithms (also known as cascade algorithms) generalization performance improves significantly compared to fixed topology FNNs. Han et al. (2015) demonstrated the effectiveness of the pruning method by performing experimental work on AlexNet and VGG-16. Pruning method reduced the parameters from 61m to 6.37m and 138m to 10.3m for AlexNet and VGG-16, respectively, without losing accuracy. They used a three stage iterative pruning technique: learning with normal FNN, prune low weight connections and retrain the pruned network to learn the final weights. Using pruned weights without retraining may significantly impact the accuracy of FNN.
4.2.3 Ensembles of FNNs
Another effective method of avoiding overfitting is the ensembles of FNNs, which means training multiple networks and combining their prediction as opposed to single FNNs. The combination is done through averaging in regression and majority in classification or weighted combination of FNNs. The more advanced techniques include stacking, boosting and bagging and many others. The error and variance obtained by the collective decision of ensembles is considered to be less as compared to individual networks (Hansen and Salamon, 1990). Many ensembles combine three stages: defining the structure of FNN in ensembles, training each FNN, and combining the results. Krogh and Vedelsby (1995) explained that the generalization error of ensembles is less than network individual errors for uniform weights. If the ensembles are strongly biased then the ensembles generalization error will be equal to the individual network error, whereas, if the variance is high, the ensemble generalization error will be higher than individual network error. Therefore, during combination, it reduces the variance (overfitting) by reducing the uncertainty in prediction. Ensembles are best suitable to deal with complex, large problems which are difficult to solve by the single FNN. Islam et al. (2003) explained that the drawback with existing ensembles is that the number of FNNs in the ensemble and the number of hidden units in individual FNN need to be predefined for FNN ensemble training. This makes it suitable for applications where prior rich knowledge and FNN experts are available. This traditional method involves a lot of trial and error approaches requiring tremendous effort. The above-mentioned drawbacks can be overcome by the constructive NN ensemble (CNNE) by determining both the number of FNNs in the ensemble and their hidden units in individual FNNs based on negative correlation learning. Negative correlation learning promotes diversity among individual FNN by learning different patterns of training data which will enable the ensemble to learn better as a whole. The stopping criteria are defined as an increase in ensemble error rather than individual error increase. FNN ensembles are still being used in many applications, with different strategies, by researchers. The above discussion covers some of the techniques for effective ensembles and others can be more found in the literature.
4.2.4 Applications of bias and variance minimization algorithms
The algorithms in this category improve the generalization performance of the network and include applications that cannot be solved by a simple network, and more sophisticated knowledge is needed to avoid overfitting in gaining better performance. The regularization, pruning and ensembles methods are implemented by training many networks simultaneously to get the average best results. The limitation of this category is that it requires more learning time compared to the standard networks for convergence. The limitation can be minimized by adopting a hybrid approach using learning optimization algorithms and this category. Besides, the limitation of regularization technique such as dropout, DropConnect, and shakeout are that they are suitable to work with fully connected neural networks.
Table III shows some of the applications in areas such as classification, recognition, segmentation, and forecasting. Shakeout work on the classification of high dimensional digital images and complex handwritten digits images, comprising more than 50,000 images, improved the accuracy by about 0.95–2.85 percent and 1.63–4.55 percent, respectively, for fully connected neural networks. For classifying more than 50,000 house number images generated from Google Street Views, DropConnect slightly enhanced the accuracy to 0.02–0.03 percent. Similarly, DropConnect work on 3D object recognition achieved a better accuracy of 0.34 percent. Shakeout and DropConnect show promising results on highly complex data, however, the popularity and success of the dropout technique in many applications is comparatively high. Some applications of dropout include classifying species images into their classes and subclasses, recognizing the different dialect speech, classifying newspaper articles into different categories, detecting human diseases by predicting alternative splicing and classifying hundred and thousands of images into different categories. Dropout, because of its property of dropping hidden unit to avoid overfitting, achieved an average more than 5 percent better accuracy for the mentioned applications.
The application of the ensemble technique CNNE on various problems, such as assessing credit card requests, diagnosing breast cancer, diagnosing diabetes, identifying objects used in conducting crime, categorizing heart diseases, identifying images as English letters and determining the types of defects in soybean, were on average one to three times better in generalization performance compared to other popular ensembles techniques. Some other problems that have gained popularity in this category are predicting energy consumption, estimating the angular acceleration of the robot arm, analyzing semiconductor manufacturing by examining the number of dies, and predicting credit defaulters.
4.3 Constructive topology FNN
FNN with a fixed number of hidden layers and units are known as fixed-topology networks which can be either shallow (single) or deep (more than one) depending on the application task. Whereas, the FNN that starts with the simplest network and then adds hidden units until the error convergence is known as a constructive (or cascade) topology network. The drawbacks associated with fixed typology FNNs is that the hidden layers and units need to be predefined before training initialization. This needs a lot of trial and error to find the optimal hidden units in the layers. If too many hidden layers and units are selected, the training time will increase, whereas, if fewer layers and hidden unit are selected, it might result in poor convergence. Constructive topology networks have advantages over fixed topologies that in it starts with a minimal simple network and then adds hidden layers with some predefined hidden units until error convergence occurs. This eliminates the need for trial and error approaches and automatically constructs the network. Recent studies have shown that constructive algorithms are more powerful than fixed algorithms. Hunter et al. (2012) in their FNNs comparative study explained that a constructive network can solve the Parity-63 problem with only six hidden units compared to a fixed network which required 64 hidden units. The training time is proportional to the hidden unit’s quantity. Adding more hidden units will cause the network to be slow because of more computational work. The computational training time of constructive algorithms is much better than in a fixed topology. However, it does not mean that the generalization performance of constructive networks will be always superior to a fixed network. A more careful approach is needed to handle hidden units with a lot of connections and parameters to stop the addition of further hidden units to avoid decreasing generalization performance (Kwok and Yeung, 1997).
The most popular algorithm for constructive topology networks in the literature is the Cascade-Correlation neural network (CNN) (Fahlman and Lebiere, 1990). CNN was developed to address the slowness of BP based fixed topology FNN (BPFNN). The factor that contributes to the slowness of BPFNN is that each hidden unit faces a constantly changing environment when all weights in the network are changed collectively. CNN is initialized with a simple network by linearly connecting the input x to the output p by the output connection weight wocw. Most often, the QP learning algorithm is applied for repetitively tuning of wocw. When training converges, and the E is more than ε, then a new hidden unit u is added receiving all input connections wicw from the x and any pre-existing hidden units. The u is trained to maximize its covariance Scov magnitude with E, as expressed below:
where
The Evolving cascade Neural Network (ECNN) was proposed to select the most informative features from the data to resolve the issue of overfitting in CCN (Schetinin, 2003). Overfitting in CNN results from the noise and redundant features in the training data which affects its performance. ECNN selects the most informative features by adding initially one input unit and the network is evolved by adding new input unit as well as a hidden unit. The final ECNN has a minimal number of hidden units and the most informative input features in the network. The selection criteria for a neuron is based on the regularity criterion C extracted from the Group Method of Data Handling algorithm (Farlow, 1981). The higher the value of C, the less informative the hidden unit will be. The selection criteria for hidden units can be expressed as:
The selection criteria illustrate that if Cr calculated for the current hidden unit is less than the previous hidden unit Cr−1, it means that the current hidden unit is more informative and relevant than the previous unit and will be then added in the network.
Huang et al. (2012) explained that the CNN covariance objective function is maximized by training wicw which cannot guarantee a maximum error reduction when a new hidden unit is added. This will slow down the convergence and more hidden units will be needed which will reduce the generalization performance. In addition to this, the repetitive tuning of wocw after each hidden unit addition is more time consuming. They proposed an algorithm named as Orthogonal Least Squares-based Cascade Network (OLSCN), which used an orthogonal least squares technique and derived a new objective function Sols for input training, expressed as below:
where γ are the elements of the orthogonal matrix obtained by performing QR factorization of the output of the r hidden unit. This objective function is optimized iteratively by using second order algorithm as expressed below:
where H is the Hessian matrix; I an identity matrix; and μ the damping hyperparameter factor; gr the gradient of the new objective function Sols with respect to wicw for hidden unit r. The information generated from input training are further continued to calculate wocw using the back substitution method for linear equations. The wicw values, after randomly initializing, are trained based on the above objective function, however, wocw values are calculated after all u are added and thus there is no need for any repeatedly training. The benefit of OLSCN is that it needs less hidden units as compared to original CCN for the same training example, with some improvement in generalization performance.
In addition, the Faster Cascade Neural Network (FCNN) was proposed to improve the generalization performance of CCN and improve the existing OLSCN drawbacks: First, the linear dependence of candidate units with the existing u can cause mistake in the new objective function in OLSCN. Second, the wicw modification by the modified Newton Method is based on a second-order H which may result in a local minimum and slow convergence due to heavy computation (Qiao et al., 2016). In FCNN, hidden nodes are generated randomly and remain unchanged as inspired by randomly mapping algorithms. The wocw connecting both input and hidden units to output units are calculated after addition of all the necessary input and hidden units. The FCNN initializes, with no input and hidden units in the network. The bias unit is added to the network and input units are added one by one with error calculation for each input unit. When there are no input units to be added, a pool of candidate units is randomly generated. The candidate unit with the maximum capability to error reduction computed from the reformulated modified index is added as a hidden unit to the network. When a maximum number of hidden units or defined threshold are achieved, the addition of hidden units is stopped. Finally, the output units are calculated by back substitution computing like OLSCN. The experimental comparison among FCNN, OLSCN and CNN demonstrated that FCNN achieved better generalization performance and fast learning, however, the network architecture size increases many times.
Nayyeri et al. (2018) proposed to use a correntropy based objective function with a sigmoid kernel to adjust the wicw of the newly added hidden unit rather than a covariance (correlation) objective function. The success of correlation heavily relies on the Gaussian and linearity assumptions, whereas, the properties of correntropy are to make the network more optimal in nonlinear and non-Gaussian terms. The new algorithm with a correntropy objective function based on a sigmoid kernel is named as Cascade Correntropy Network (CCOEN). Similar to other cascade algorithms described above, wicw is optimized by using correntropy objective function Sct with sigmoid:
where tanh (ar〈.,.〉+c) is defined as a sigmoid kernel with hyperparameters a and c. C∈{0, 1} is a constant. If the new hidden unit and residual error are orthogonal then C=0 ensures algorithm convergence, otherwise C=1. The wocw is adjusted as:
An experimental study was performed on regression problems, with and without noise and outliers, by comparing CCOEN with six other objective functions as defined in (Kwok and Yeung, 1997), the CNN covariance objective function as shown in Equation (35), and one hidden layer FNN. The study demonstrates that the CCOEN correntropy objective function, in most cases, achieved better generalization performance and increase the robustness to noise and outliers compared to other objective functions. The network size of CCOEN showed slightly less compactness compared to other objective functions, however, no specific objective function was found to have a better compact size, in general.
4.3.1 Application of constructive FNN
The benefit of constructive over fixed topology FNN, which make it more favorable, is the addition of a hidden unit in each hidden layer until error convergence. This eliminates the problem of finding an optimal network for fixed topology FNN by doing a lot of experimental work. The learning speed of constructive algorithms are better than for fixed topology, however, the generalization performance is not always guaranteed to be optimal. Table IV shows some of the applications of constructive FNN.
The application of CCOEN on regression prediction of human body fat, automobile prices, voters, weather, winning teams, strike volumes, earthquake strength, heart diseases, house prices and species age showed that the algorithm gives generalization performance, on average, four times better in most cases. CCOEN theoretically guarantees the solution will be the global minimum, however, the improvement in learning speed is not clear. The prediction accuracy and learning speed of FCNN is considered to be 5 percent better and more than 20 times faster, respectively, on predicting beverages quality, classifying soil in to different types, identifying black and white image as one of the English letter, identifying objects used in crime, classifying images into different types of vehicle and segmenting outdoor images. The problem of vowel recognition and car fuel consumption are considered to be better solved by OLSCN. Vowel recognition had an improvement in accuracy of 8.61 percent and car fuel consumption was estimated with improvement in performance of 0.15 times. Some other problems demonstrating the applications of this category include molecular biological activities prediction, equalizing burst of bits and classifying artifacts and normal segment in clinical electroencephalograms.
4.4 Metaheuristic search algorithms
The major drawback of FNN is that it stuck at a local minimum with a poor convergence rate and it become worse at the plateau surface where the rate of change in error is very slow at each iteration. This increases the learning time and coadaptation among the hyperparameters. Therefore, trial and error approaches are applied to find optimal hyperparameters, and this makes gradient learning algorithms even more complex and it becomes difficult to select the best FNN. Moreover, the unavailability of gradient information in some applications makes FNN ineffective. To solve the two main issues of FNN that need to find best optimal hyperparameters and to make it useable with no gradient information, the application of metaheuristic algorithms such as genetic algorithm (GA), particle swarm optimization, and whale optimization algorithm (WOA) are implemented in combination with FNN. The major contribution of the applications of a metaheuristic algorithm is that it may converge at a global minimum rather than the local minimum by moving from a local search to a global search. Therefore, these are more suitable for global optimization problems. The metaheuristic algorithms are good for identifying the best hyperparameters and convergence at a global minimum, but they have drawbacks in the high memory requirements and processing time.
4.4.1 Genetic algorithm (GA)
GA is a metaheuristic technique belonging to the class of evolutionary algorithms inspired by Darwinian theory about evolution and natural selection and was invented by John Holland in the 1960s (Mitchell, 1998). It is used to find a solution for optimization problems by performing chromosome encoding, fitness selection, crossover, and mutation. Like BP, it is used to train FNNs to find the best possible hyperparameters. The Genetic algorithm based FNN (GAFNN) is initialized with a population of several possible solutions of FNN with randomly assigned hyperparameters. The hyperparameters are encoded in a chromosome string of 0 and 1 values. The fitness function is reciprocal to the loss function in FNN in evaluating each chromosome in the population. The higher the value of the fitness means the lower its loss function will be and there are better chances of reproduction. Initially, two chromosomes are selected from a population with either higher fitness value or using selection techniques such as roulette wheel, tournaments, steady state and ranked positions. Reproduction of selected chromosomes is performed by crossover and mutation. Crossover produces new offspring and possible techniques include single point crossover, multiple point crossovers, shuffle crossover and uniform crossover. Mutation slightly alters the new offspring to create a better string, and possible methods include bit string, flip bit, shift, inverse, and Gaussian. The two newly generated offspring are added in the population and the cycle repeated in selection and reproduction until a maximum number of generations has been achieved. The old population will be deleted to create space for new offspring. When the algorithm is stopped, it will return a chromosome that represents the best hyperparameters for FNN. The above-explained GAFNN computational method is simple and has been adopted by many researchers with different techniques for selection, crossover, and mutation. Researchers demonstrated that the generalization performance of GA is better than BP (Mohamad et al., 2017) with the additional need for training time. Liang and Dai (1998) highlighted the same issue by applying GA to FNN. They demonstrated that the performance of GA is better than BP, but the training time increased. Ding et al. (2011) proposed that GA performance can be improved by integrating the concept of both BP gradient information and GA genetic information to train an FNN for superior generalization performance. GA is good for global searching, whereas, BP is appropriate for local searching. The algorithm first uses GA to optimize the initial weights by searching for better search space and then uses BP to fine tune and search for an optimal solution. Experimental results demonstrated that FNN trained on the hybrid approach of GA and BP achieved better generalization performance than individual BPFNN and GAFNN. Moreover, in term of learning speed, a hybrid approach is better than GAFNN compared to BPFNN.
4.4.2 Particle swarm optimization (PSO)
Eberhart and Kennedy (1995) argued that a GA used to find hyperparameters may not be suitable for the crossover operator. The two chromosomes selected with high fitness values might be very different from one another and reproduction will be not effective. The particle swarm optimization (PSO) based FNNs (PSOFNN) can overcome this issue in GA by searching for the best optimal solution through the movement of particles in space. The concept of particle movement is motivated by the flocking of birds. Like GA, the particles in PSO can be considered as hyperparameters. Each particle represents a hyperparameter that needs to be optimized for the global search rather than the local search. PSO is an iterative algorithm in which a particle w adjusts its velocity v during each iteration based on momentum, its best possible position achieved ppbest so far and the best possible position achieved by global search pgbest. This can be expressed in mathematically as:
where r is a random number with value [0, 1] and c is an acceleration constant. A more robust generalization may be achieved by balancing the global search and local search, and PSO being modified to Adaptive PSO (APSO) by multiplying the current velocity with the inertia weight winteria such that (Shi and Eberhart, 1998):
Zhang et al. (2007) argued that PSO converges faster in a global search but around the global optimum, it becomes slow, whereas, BP converges faster at a global optimum due to efficient local search. The faster property of APSO in a global search at the initial stages of learning and the BP of local search motivates the use of a hybrid approach known as particle swarm optimization backpropagation (PSO-BP) to train the FNN hyperparameters. Experimental results indicate that PSO-BP generalization performance and learning speed are better than both PSOFNN and BPFNN. The critical point in PSO-BP is to decide when to shift learning from PSO to BP during learning. This can be done by a heuristic approach, and if the particle has not changed from many iterations then the learning should be shifted to gradient BP. One of the difficulties associated with PSO algorithms is the selection of optimal hyperparameters such as r, c and winteria. The performance of the PSO algorithm is highly influenced on hyperparameter adjustment and is currently being investigating by many researchers. Shaghaghi et al. (2017) detailed a comparative analysis of a GMBH neural network optimized by GA and PSO, and concluded that GA is more efficient than PSO. However, in the literature, the performance advantage of GAFNN over PSOFNN and vice versa is still unclear.
4.4.3 Whale optimization algorithm (WOA)
The application of GAs and APSO have been investigated widely in the literature, however, according to the no free lunch (NFL) theorem, there is no algorithm that is superior for all optimization problems. The NFL theorem is the motivation for proposing the WOA to solve the problem of local minima, slow convergence, and dependency on hyperparameters. The WOA is inspired by the bubble net hunting strategy of humpback whales (Mirjalili and Lewis, 2016). The idea of humpback whale hunting is incorporated in WOA, which creates a trap using bubbles, moving in a spiral path around the prey. It can be expressed mathematically as:
where A=2ar−a, C = 2r such that a is linearly decreased from 2 to 0 and r is a random number in [0, 1], b is constant and represent the shape of the spiral, l is a random number in [−1, 1], and p is a random number in [0, 1].
Ling et al. (2017) mentioned that WOA, due to its better performance compared to other known metaheuristics, has not only gained the attention of researchers but also has been increasingly applied in many areas such as neural networks and optimization. However, it has the disadvantage of slow convergence, low precision and being trapped in a local minimum because of lack of population diversity. They argued that WOA is more useful for solving low dimensional problems, however, when dealing with high dimensional and multi-model problems, its solution is not quite optimal. To overcome the WOA drawbacks, an extension known as the Lévy flight WOA was proposed to enhance the convergence by avoiding local minimum. The findings suggest that the adoption of the Lévy flight trajectory in WOA helps to obtain a better trade-off between exploration and exploitation and leads to global optimization.
4.4.4 Application of metaheuristic search algorithms
The basic purpose of metaheuristics search algorithms is to be used in applications having incomplete gradient information. However, because of its successful application in many problems attracted the attention of researchers. At present, the application is not limited to incomplete gradient information but also to other problems searching for the best hyperparameter that can converge at a global minimum. Table V shows some of the applications researchers used to demonstrate the effectiveness of the metaheuristic search algorithm. Mohamad et al. (2017) recommended GAFNN to be a reliable technique for solving complex problems in the field of excavatability. Their work on predicting ripping production, from experimental collected inputs such as the weather zone, joint spacing, point load strength index and sonic velocity, shows that GAFNN achieved generalization performance 0.42 times better compared to FNN. Shaghaghi et al. (2017) applied the GA optimized neural network to estimate the width of a river and found a 0.40 times better generalization performance. Besides, it was reported that neural networks optimized by GA are more efficient than PSO. Ding et al. (2011) worked on predicting flower species, beverages type, identifying objects used in crime and classifying images showed that the hybrid approach of GA and BP achieved a prediction accuracy on average 2%-3% better than GA and BP. The study further explained that the accuracy of BP in the above problems is better than GA. Furthermore, it was concluded that the GA needs more learning time compared to BP and this learning speed can be improved to some extent by using the hybrid approach. The hybrid approach achieved learning speed 0.05 times better than GA but was still slower than BP.
5. Discussion of future research directions
The classification of learning algorithms and optimization techniques into six categories provides insights that the trend in FNNs research work is changing with time. In a broad perspective, the categories can be considered large research gaps and further improvements may bring a significant contribution. Based on extensive review and discussion on the optimization algorithms in the current paper, Part II, the authors further recommend four new future research directions to add to the existing literature.
5.1 Data structure
Future research may include designing FNN architecture with less complexity with the characteristics of learning noisy data without overfitting. Few attempts have been made to study the effect of data size and features on FNN algorithms. The availability of training data in many applications (for instance, medical science and so on) is limited (small data rather than big data) and costly (Choi et al., 2018; Shen, Choi and Minner, 2019). Once trained, the same model may become unstable for less or more data within the same application and may result in a loss of generalization ability. Each time, for the same application area and algorithm, a new model is needed to be trained with a different set of parameters. Future research on designing an algorithm that can approximate the problem task equally, regardless of data size (instance) and shape (features), will be a breakthrough achievement.
5.2 Eliminating the need for data pre-processing and transformation
Better results of machine learning algorithms are highly dependent on data pre-processing and the transformation steps. In pre-processing, various algorithms/techniques are adapted to clean data to reduce noise, outliers, and missing values, whereas, in transformation, various algorithms/techniques are adapted to transform the data into formats and forms, by encoding or standardization, that are appropriate for machine learning. Both steps are performed prior to feeding data into FNN and are adopted in many algorithms. Insufficient knowledge or inappropriate application of pre-processing and transformation techniques may cause the algorithm to wrongly conclude the findings. Future algorithms may be designed that are less sensitive to noise, outliers, missing values, and do not need a specific form of reducing data features magnitude to a common scale.
5.3 A hybrid approach for real-world applications
Researchers demonstrate the effectiveness of the proposed algorithm either on benchmarking or on real-world problems or a combination of both. The study of all six categories applications gives a clear indication that the interest in researchers proposed algorithms on real-world problems is rapidly increasing. Traditionally, in 2000 and earlier, experimental work for demonstrating the effectiveness of proposed algorithms was limited to artificial benchmarking problems. The possible reasons were the unavailability of database sources and user reluctance in using FNN. In the literature survey, it is noticed that nowadays researchers most often use real-world data to ensure consistency and compare their results with other popular published known algorithms. Frequently using same real-world applications data may cause specific data to become benchmark. The same issue has been observed in all categories and especially in the second category. It is important to avoid causing specific data to become a benchmark because that data may become unsuitable for practical application in the future with the passage of time. The role of high dimensional, nonlinear, noisy and unbalanced big data are changing at a rapid rate. The best practice may be to use a hybrid approach of well-known data in the field along with new data applications during comparative study of algorithms. This may increase and maintain user interest in FNN over the passage of time.
5.4 Hidden units’ analytical calculation
The successful application of learning and regularization algorithms on complex high dimensional big data application areas, involving more than 50,000 instances, to improve the convergence of deep neural networks is noteworthy. The maximum error reduction property of hidden units in deep neural networks is dependent on connection weights and activation function determination. More research work is needed to focus on hidden units’ analytical calculation to avoid the trial and error approach. Calculating the optimal number of hidden units and connection weights for single or multiple hidden layers in the network, such that no linear dependence exists may help in achieving better and stable generalization performance. Similarly, future research work may give clear direction to users about the application of different activation functions for enhancing business decisions.
6. Conclusions
The FNN is gaining much popularity because of its ability to solve complex high dimensional nonlinear problems more accurately. There is a growing and varying nature of big data demand to design and propose compact and efficient learning algorithms and optimization techniques in the area of the FNN. Traditional FNNs are far slower and may not suitable for many applications. The drawbacks that may affect the generalization performance and learning speed of FNNs includes local minimum, saddle point, plateau surface, tuning connection weights, deciding hidden units and layers, and many others. The inappropriate user expertise and insufficient theoretical information may cause the drawbacks to become worse. A lot of trial and error work may need to get an optimal network by varying hidden units and layers, changing connection weights, altering learning rate and many other hyperparameters. This raises important questions which need to be resolved prior to applying FNN. A comprehensive literature review was carried out to understand the reasons causing FNNs drawbacks and efforts were made to answers the question by studying learning algorithms and optimization techniques merits, technical limitations and applications. The selective databases were searched with popular keywords and the screened articles were classified into six broad categories: gradient learning algorithms for network training; gradient free learning algorithms; optimization algorithms for learning rate; bias and variance (underfitting and overfitting) minimization algorithms; constructive topology FNN; and metaheuristic search algorithms. Reviewing and discussing all the six categories in one paper would prove to be too long. Therefore, the authors further divided the six categories into two parts (i.e. Part I and Part II). Part I reviews mainly two categories explaining the learning algorithms and Part II reviews the remaining four categories, explaining optimization techniques in details. The main conclusions related to the six categories are:
First order gradient learning algorithms may be more time consuming compared to second order gradients. The increase in the number of iterations for the first-order gradient may result in the network being stuck at the local minimum. The second order algorithms are considered to more efficient than the first order with the additional need for more computational memory to store a Hessian matrix. However, the second order algorithm maybe not suitable for all types of application. For instance, the Levenberg-Marquardt Algorithm (LM) gradient algorithm is considered to be faster than GD, with the limitation that it can only be applied with least square loss function and fixed topology FNN. The wide ranging applications of the gradient learning algorithms to facilitate business intelligence in improving business decisions is explored in Part I.
The use of gradient information may make the task difficult to solve. A lot of experimental work may be needed to search for an optimal network which may be more time-consuming. This can be improved by randomly generating hidden units and analytically calculating connection weights using gradient free learning algorithms. Gradient learning algorithms can be avoided, and connection weights can be calculated more analytically by gradient free learning algorithms. The better learning speed and generalization performance (in most cases) of gradient free learning algorithms are evident. However, the network complexity may increase because of the increase in a number of hidden units compared to the gradient learning algorithms compact network size which may increase the chances of overfitting. The wide ranging applications of gradient free learning algorithms to facilitate business intelligence in improving business decisions is explored in Part I.
Finding a minimum of the function by taking steps proportional to the negative of the gradient is influenced by the learning rate. Various proposed algorithms work by increasing convergence by assigning a higher learning rate at the initial stages of training and a lower learning rate when a minimum of the function is nearer. Traditionally, the learning rate was fixed to a certain value based on user expertise, and encountered many difficulties. Keeping the learning rate at a lower value is better to approach a downward slope and reach a minimum of the function, but it can also mean converging very slowly and may be stuck in the plateau region. Whereas, a large learning rate will oscillate the gradient along the long and short axis of the minimum error valley. This may reduce the gradient movement toward the long axis which faces the minimum. The learning optimization algorithms minimize the short axis of oscillation while adding more contributions along the long axis to move the gradient in larger steps toward the minimum. The category range of applications include problems having a high dimension complex data. This category shows an improvement of 2.09 to 109 percent in prediction accuracy of high dimension complex problems by adjusting the learning rate to make the network stable.
The regularities in training examples and the generalization performance of unseen data can be improved by bias and variance trade-off. Underfitting the network will result in higher bias and low variance in that it will not discover regularities in training data, whereas, overfitting the network will result in low bias and high variance between training data set and testing data set. Overfitting decreases the generalization performance because the network complexity (the number of connection weights) increases compared to the training examples. Therefore, it is often desirable to make the network simple by keeping a number of weight less and limiting their growth by suitable weight decay algorithms. Other algorithms to make the network simpler include limiting the number of hidden units or normalizing the input distribution to hidden units, pruning, and ensembles. The major applications include areas such as classification, recognition, segmentation, and forecasting. The algorithms achieved improvement in accuracy 0.02–5 percent for predicting high dimensional featured data having instances more than 50,000. The limitation of this category is that it requires more learning time compared to standard networks for convergence. A possible solution is to use the hybrid approach of learning optimization algorithms and this category to increase learning speed.
The issue with the fixed topology FNN is that it needs many experimental trials to determine the number of hidden units in the hidden layer. For deep architecture, the experimental trial is more time consuming and difficult to determine the ideal combination of hidden units in multiple hidden layers. The constructive topology FNN algorithms have advantages over a fixed topology, determining network topology by adding hidden units step by step in the hidden layers. This category was able to improve regression performance 0.5–4 times and classification prediction accuracy 5–8.61 percent with an average learning speed 20 times faster in solving various real-world problems.
The unavailability of gradient information, the problem of local minimum, and finding optimal hyperparameters make FNN ineffective. The metaheuristic global optimization algorithms in combination with FNN can overcome such problems, assisting searching for a global minimum of the loss function with best hyperparameters but with an additional need for memory and learning time. The metaheuristic algorithms application on regression problems finds a 0.40 times better performance, whereas, for applications on classification problems, the prediction accuracy is enhanced by 2–3 percent. Mostly, the metaheuristic algorithms learning speed is less than the traditional FNN, however, the learning speed can be improved to some extent by using the hybrid approach of metaheuristic and gradient learning algorithms for training FNN.
The categories explain the noteworthy contributions made by researchers in improving FNNs’ generalization performance and learning speed. The study shows a major change in research trends and future research surveys may include adding knowledge to the existing categories or identifying new categories by studying further algorithms. The successful application of FNN learning algorithms and optimization techniques on real-world management, engineering, and health sciences problems demonstrate the advantages of algorithms in enhancing decision making for practical operations. Based on the existing literature and research trends, the authors suggest a total seven research direction (Part I and II) that can contribute to enhancing existing knowledge. The three future research directions suggested in Part I were studying the role of various activation functions, designing efficient and compact algorithm with fewer hyperparameters, and determining optimal connection weights. The new four future research directions proposed in the current paper (Part II) is to design algorithms with the ability to approximate any problem task equally regardless of data size (instance) and shape (features), eliminating the need of data pre-processing and transformation, using a hybrid approach of well-known data and new data to demonstrate the effectiveness of the algorithms, and analytically calculating hidden units.
This work was supported by a grant from the Research Committee of The Hong Kong Polytechnic University under the account code RLKA, and supported by RGC (Hong Kong) – GRF, with the Project Number: PolyU 152131/17E.
Algorithms distribution categories wise
Algorithms proposed over time
Papers distribution categories wise
Feedforward neural networks with three layers (input, hidden and output)
Classification of FNN published algorithms
| No. | Category | Algorithms published | References |
|---|---|---|---|
| 1 | Gradient learning algorithms for network training | Gradient descent, stochastic gradient descent, mini-batch gradient descent, Newton method, Quasi-Newton method, conjugate gradient method, Quickprop, Levenberg-Marquardt Algorithm, Neuron by Neuron | Hecht-Nielsen (1989), Bianchini and Scarselli (2014), LeCun et al. (2015), Wilamowski and Yu (2010), Rumelhart et al. (1986)*, Wilson and Martinez (2003), Wang et al. (2017), Hinton et al. (2012)*, Ypma (1995), Zeiler (2012)*, Shanno (1970), Lewis and Overton (2013), Setiono and Hui (1995)*, Fahlman (1988), Hagan and Menhaj (1994), Wilamowski et al. (2008), Hunter et al. (2012)* |
| 2 | Gradient free learning algorithms | Probabilistic Neural Network, General Regression Neural Network, Extreme learning machine (ELM), Online Sequential ELM, Incremental ELM (I-ELM), Convex I-ELM, Enhanced I-ELM, Error Minimized ELM (EM-ELM), Bidirectional ELM, Orthogonal I-ELM (OI-ELM), Driving Amount OI-ELM, Self-adaptive ELM, Incremental Particle Swarm Optimization EM-ELM, Weighted ELM, Multilayer ELM, Hierarchical ELM, No propagation, Iterative Feedforward Neural Networks with Random Weights | Huang et al. (2015), Ferrari and Stengel (2005), Specht (1990), Specht (1991), Huang, Zhu and Siew (2006), Huang, Zhou, Ding and Zhang (2012), Liang et al. (2006), Huang, Chen and Siew (2006), Huang and Chen (2007), Huang and Chen (2008), Feng et al. (2009), Yang et al. (2012), Ying (2016), Zou et al. (2018), Wang et al. (2016), Han et al. (2017), Zong et al. (2013), Kasun et al. (2013), Tang et al. (2016), Widrow et al. (2013), Cao et al. (2016) |
| 3 | Optimization algorithms for learning rate | Momentum, Resilient Propagation, RMSprop, Adaptive Gradient Algorithm, AdaDelta, Adaptive Moment Estimation, AdaMax | Gori and Tesi (1992), Lucas and Saccucci (1990), Rumelhart et al. (1986)*, Qian (1999), Riedmiller and Braun (1993), Hinton et al. (2012)*, Duchi et al. (2011), Zeiler (2012)*, Kingma and Ba (2014) |
| 4 | Bias and variance (underfitting and overfitting) minimization algorithms | Validation, n-fold cross-validation, weight decay (l1 and l2 regularization), Dropout, DropConnect, Shakeout, Batch normalization, Optimized approximation algorithm (Signal to Noise Ratio Figure), Pruning Sensitivity Methods, Ensembles Methods | Geman et al. (1992), Zaghloul et al. (2009), Reed (1993), Seni and Elder (2010), Setiono and Hui (1995)*, Krogh and Hertz (1992), Srivastava et al. (2014), Wan et al. (2013), Kang et al. (2017), Ioffe and Szegedy (2015), Liu et al. (2008), Karnin (1990), Kovalishyn et al. (1998)*, Han et al. (2015), Hansen and Salamon (1990), Krogh and Vedelsby (1995), Islam et al. (2003) |
| 5 | Constructive topology neural networks | Cascade Correlation Learning, Evolving cascade Neural Network, Orthogonal Least Squares-based Cascade Network, Faster Cascade Neural Network, Cascade Correntropy Network | Hunter et al. (2012)*, Kwok and Yeung (1997), Fahlman and Lebiere (1990), Lang (1989), Hwang et al. (1996), Lehtokangas (2000), Kovalishyn et al. (1998)* Schetinin (2003), Farlow (1981), Huang, Song and Wu (2012), Qiao et al. (2016), Nayyeri et al. (2018) |
| 6 | Metaheuristic search algorithms | Genetic algorithm, Particle Swarm Optimization (PSO), Adaptive PSO, Whale optimization algorithm (WOA), Lévy flight WOA | Mitchell (1998), Mohamad et al. (2017), Liang and Dai (1998), Ding et al. (2011), Eberhart and Kennedy (1995), Shi and Eberhart (1998), Zhang et al. (2007), Shaghaghi et al. (2017), Mirjalili and Lewis (2016), Ling et al. (2017) |
Applications of optimization algorithms for learning rate
| Application | Description |
|---|---|
| Game decision rule | Predicting decision rules for the Nine Men’s Morris game (Riedmiller and Braun, 1993) |
| Census | Predicting whether the individual has income above or below the average income based on certain demographic and employment-related information (Duchi et al., 2011) |
| Newspaper articles | Classifying the newspapers articles into four major categories such as economics, commerce, medical, and government with multiple more specific categories (Duchi et al., 2011) |
| Subcategories image classification | Classifying thousands of images in each of their individual subcategory (Duchi et al., 2011) |
| Handwritten images classification | Classifying images of the handwritten digits (Duchi et al., 2011; Kingma and Ba, 2014; Zeiler, 2012) |
| English data recognition | Recognizing speech from several hundred hours of the US English data collected from voice IME, voice search and read data (Zeiler, 2012) |
| Movie reviews | Classifying review of the movies either positive or negative to know the sentiment of the reviewers (Kingma and Ba, 2014) |
| Digital images classification | Classifying the images into one of a category such as an airplane, deer, ship, frog, horse, truck, cat, dog, bird and automobile (Kingma and Ba, 2014) |
Applications of bias and variance minimization algorithms
| Application | Description |
|---|---|
| Handwritten images classification | Classifying the images of hand-written numerals with 20% flip rate (Geman et al., 1992) |
| Text to speech conversion | Learning to convert English text to speech (Krogh and Hertz, 1992) |
| Credit card | Deciding to approve or reject credit card request based on the available information such as credit score, income level, gender age, sex, and many others (Islam et al., 2003) |
| Breast cancer | Diagnosing breast cancer as a malignant or benign based on the feature extracted from the cell nucleus (Islam et al., 2003) |
| Diabetes | Diagnosing whether the patient has diabetes based on certain diagnostic measurements (Islam et al., 2003) |
| Crime | Identifying glass type used in crime scene based on chemical oxide content such as sodium, potassium, calcium, iron and many others (Islam et al., 2003) |
| Heart diseases | Diagnosing and categorizing the presence of heart diseases in a patient by studying the previous history of drug addiction, health issues, blood tests and many others (Islam et al., 2003) |
| English letters | Identifying black and white image as one of the English letters among 26 capital letters (Islam et al., 2003) |
| Soybean defects | Determining the type of defect in soybean based on physical characteristics of the plant (Islam et al., 2003) |
| Energy consumption | Predicting the hourly consumption of building electricity and associated cost based on environmental and weather conditions (Liu et al., 2008) |
| Robot arm acceleration | Estimating the angular acceleration of the robot arm based on a position, velocity, and torque (Liu et al., 2008) |
| Semiconductor manufacturing | Analyzing semiconductor manufacturing by examining the number of dies in a wafer that pass electrical tests (Seni and Elder, 2010) |
| Credit defaulter | Predicting credit defaulters based on the credit score information (Seni and Elder, 2010) |
| 3D object recognition | Recognizing 3D object by classifying the images into generic categories (Wan et al., 2013) |
| Google street images classification | Classifying the images containing information of house numbers collected by Google Street View (Srivastava et al., 2014; Wan et al., 2013) |
| Class and superclass | Classifying the images into class (e.g. shark) and their superclass (e.g. fish) (Srivastava et al., 2014) |
| Speech recognition | Recognizing speech from different dialects of the American language (Srivastava et al., 2014) |
| Newspaper articles | Classifying the newspapers articles into major categories such as finance, crime, and many others (Srivastava et al., 2014) |
| Ribonucleic acid | Understanding human disease by predicting alternative splicing based on ribonucleic acid features (Srivastava et al., 2014) |
| Subcategories image classification | Classifying thousands of images in each of their individual subcategory (Han et al., 2015; Ioffe and Szegedy, 2015; Srivastava et al., 2014) |
| Handwritten digits classification | Classifying images of the handwritten digits (Han et al., 2015; Kang et al., 2017; Srivastava et al., 2014; Wan et al., 2013) |
| Digital images classification | Classifying the images into one of a category such as an airplane, deer, ship, frog, horse, truck, cat, dog, bird and automobile (Kang et al., 2017; Srivastava et al., 2014; Wan et al., 2013) |
Applications of constructive algorithms
| Application | Description |
|---|---|
| Biological activity | Predicting of the biological activity of molecules such as benzodiazepine derivatives with anti-pentylenetetrazole activity, antimycin analogs with antifilarial activity and many others from molecular structure and physicochemical properties (Kovalishyn et al., 1998) |
| Communication channel | Equalizing burst of bits transferred through a communication channel (Lehtokangas, 2000) |
| Clinical electroencephalograms | Classifying artifacts and a normal segment in clinical electroencephalograms (Schetinin, 2003) |
| Vowel recognition | Recognizing vowel of different or same languages in the speech mode (Huang, Song and Wu, 2012) |
| Cars fuel consumption | Determining the fuel consumption of cars in terms of engine specification and car characteristics (Huang, Song and Wu, 2012) |
| Beverages quality | Determining quality of same class of the beverages based on relevant ingredients (Huang, Song and Wu, 2012; Qiao et al., 2016) |
| House price | Estimating the price of houses based on the availability of clean quality air (Huang, Song and Wu, 2012; Nayyeri et al., 2018) |
| Species | Determining the age of species from their known physical measurements (Huang, Song and Wu, 2012; Nayyeri et al., 2018) |
| Soil classification | Classifying image according to a different type of soil such as gray soil, vegetation soil, red soil and many others based on a database consisting of the multi-spectral images (Qiao et al., 2016) |
| English letters | Identifying black and white image as one of the English letters among 26 capital letters (Qiao et al., 2016) |
| Crime | Identifying glass type used in crime scene based on chemical oxide content such as sodium, potassium, calcium, iron and many others (Qiao et al., 2016) |
| Silhouette vehicle images classification | Classifying image into different types of vehicle based on the feature extracted from the silhouette (Qiao et al., 2016) |
| Outdoor objects segmentation | Segmenting the outdoor images into many different classes such as window, path, sky and many others (Huang, Song and Wu, 2012; Qiao et al., 2016) |
| Human body fats | Determining the percentage of human body fats from key physical factors such as weight, age, chest size and other body parts circumference (Nayyeri et al., 2018) |
| Automobile prices | Determining the prices of automobile based on various auto specifications, the degree to which auto is risky than price, and an average loss per auto per year (Nayyeri et al., 2018) |
| Presidential election | Estimating the proportion of voter in the presidential election based on key factors such as education, age and income (Nayyeri et al., 2018) |
| Weather forecasting | Forecasting weather in terms of cloud appearance (Nayyeri et al., 2018) |
| Basketball winning | Predicting basketball winning team based on players, team formation and actions information (Nayyeri et al., 2018) |
| Industrial strike volume | Estimating the industrial strike volume for the next fiscal year considering key factors such as unemployment, inflation and labor unions (Nayyeri et al., 2018) |
| Earthquake strength | Forecasting the strength of earthquake given its latitude, longitude and focal point (Nayyeri et al., 2018) |
| Heart diseases | Diagnosing and categorizing the presence of heart diseases in a patient by studying the previous history of drug addiction, health issues, blood tests and many others (Nayyeri et al., 2018) |
Applications of Metaheuristic search algorithms
| Application | Description |
|---|---|
| Ripping production | Predicting ripping production, used as an alternative to blasting for ground loosening and breaking in mining and civil engineering, from experimental collected input such as weather zone, joint spacing, point load strength index and sonic velocity (Mohamad et al., 2017) |
| Crime | Identifying glass type used in crime scene based on chemical oxide content such as sodium, potassium, calcium, iron and many others (Ding et al., 2011) |
| Flowers species | Classifying the flowers into different species from available information on the width and length of petals and sepals (Ding et al., 2011) |
| River width | Estimating the width of a river, to minimize erosion and deposition, from fluid discharge rate, bed sediments of the river, shields parameter and many other (Shaghaghi et al., 2017) |
| Beverages | Identifying the type of beverages in term of its physical and chemical characteristics (Ding et al., 2011) |
| Silhouette vehicle images classification | Classifying image into different types of vehicle based on the feature extracted from the silhouette (Ding et al., 2011) |
© Emerald Publishing Limited 2019
