Optimization of Deep Neural Networks Using a

Full text

Turn on search term navigation

1. Introduction

Machine learning is a branch of artificial intelligence dedicated to the development of specialized algorithms for constructing complex mathematical models from large volumes of data [1].

Traditionally, feature extraction for various machine learning tasks has been performed through human engineering, where a designer manually identifies a set of descriptors for a specific problem (hand-crafted features). However, in the paradigm known as deep learning, the set of descriptors is automatically learned. Thus, deep learning is a branch of machine learning in which a computer learns through hierarchies of concepts at distinct levels of abstraction. This approach reduces the need for domain experts to manually extract attributes from datasets [2].

In recent years, deep learning models have achieved high performance on complex classification problems [3]. In particular, the convolutional neural network (CNN) represents the most widely used deep learning model to address computer vision tasks. Examples of such applications include object detection, semantic segmentation, and medical image classification, among others [4,5,6].

A CNN is an artificial neural network (ANN) designed to process data with a grid-like topology (e.g., images). This type of network is composed of two main parts: the first is a feature (attribute) extractor that performs convolution and pooling operations on the input data. The second part consists of one or more fully connected (FC) layers, in which neurons connect to all neurons in the previous layer to solve a supervised learning problem (e.g., classification).

Learning in CNNs requires a massive amount of training data, which poses several challenges. First, training a CNN architecture involves a high computational cost [7]. Additionally, data collection problems can also arise, which is a limitation for various real-world applications where a sufficiently large training dataset is often not available [8].

1.1. Transfer Learning

The paradigm known as transfer learning (TL) has been introduced in recent years to try to address these challenges. The main idea is that a model obtained to solve a classification problem in one domain can be applied to another different, but related, classification problem. In the context of CNN training, TL transfers the network parameters from the source domain to the target domain. This approach is effective in reducing the risk of overfitting in a CNN model trained on a small dataset [9]. This seeks to improve the CNN feature selector in the target domain, such that only the tuning of the parameters related to the FC layers is required. These parameters can be organized in two levels: the weights of the network connections and the parameters that refer to the model design, known as hyperparameters [10].

The optimal adjustment of the number of layers and neurons in each layer of the FC part of a CNN remains an open problem in the research field. Therefore, this task is commonly posed as an optimization problem, where the aim is to maximize the classification accuracy and minimize the complexity of the network architecture. The goal is to obtain a simpler network topology that enhances classification accuracy and reduces the computational cost in the training process [11].

The number of neurons in the input layer of the FC part of a CNN model corresponds to the number of attributes obtained from the convolutional layers. Feature selection (FS) seeks to reduce the space of input variables by selecting a subset of attributes that offer a better description of the problem. Thus, the objective is to maximize classification accuracy while minimizing the complexity of the network architecture by reducing the number of neurons in the input layer. Therefore, several methods have been proposed to adjust the number of neurons in this layer, performing the FS task to select those attributes that allow increasing the classification accuracy of the model [12].

Techniques designed to reduce the number of parameters in the architecture of a CNN, both in the convolutional layers and in the FC-type layers, are called pruning algorithms. These algorithms seek to reduce the complexity of the model by minimizing the total number of layers in the network and the number of neurons in each layer while attempting to avoid negative impacts on classification accuracy [13].

1.2. Optimization Algorithms

Genetic algorithms (GAs) are meta-heuristics inspired by Darwin’s theory of evolution, in which natural selection plays a crucial role. In GAs, this process is conceptualized as an optimization task. Given an environment (optimization problem) with a population of individuals (candidate solutions) competing to survive and reproduce, each individual is assigned a fitness value (its evaluation in the objective function). The higher the fitness, the greater the probability of survival and reproduction. Based on their fitness values, a selection of parents is made, and crossover and mutation are applied to generate offspring. These offspring then compete with their parents to become part of the population of the next generation [14].

The GA starts with a randomly generated population, where the population size is user-determined, typically ranging from 50 to 200 individuals. The fitness of each element of the population is evaluated, and new solutions are generated through the application of crossover and mutation. After applying these genetic operators, it is common to use a simple elitism scheme to preserve the best solution from each generation.

GAs have been shown to outperform algorithms based on a single solution, such as gradient-based methods, particularly when applied to challenging problems involving local optima or large plateaus. That is why, in specialized literature, there is an increasing use of GAs to perform pruning tasks in both the convolutional and FC layers of CNNs.

However, GAs are computationally intensive, mainly because of the large populations to be evaluated at each generation. This way, the high computational cost involved in evaluating a large population that encodes the hyperparameters related to convolutional layers has led to the use of pruning methods with GAs, transferring pre-trained parameters more efficiently [8]. Additionally, there have been proposals to keep the pre-trained model (i.e, the convolutional layers) intact and pruning only the FC-type layers with a GA, as in [13], where they also consider the optimization of the FS task.

Despite these advancements, the use of GAs remains computationally expensive. In addition to populations of up to 200 individuals, they encode variables from 3 to 12,416 dimensions [8,15]. For this reason, we propose a method based on a micro genetic algorithm ( $μ$ GA) that uses only four individuals in the population [16]. Some theoretical studies suggest that a small population of even three or four individuals is sufficient to achieve convergence in an optimization problem [17]. However, practical implementations of this idea are scarce, with the $μ$ GA being one of the most notable approaches. By using this algorithm, this proposal seeks to balance the global exploration capabilities of traditional GAs with the reduced computational burden of gradient-based numerical methods.

Our approach involves using a $μ$ GA to simultaneously optimize both the performance (by adjusting the learning rate) and the architecture of the FC layers (by adjusting the number of input and hidden neurons) of a CNN. The primary indicators used are classification accuracy and complexity of the network architecture. Reducing the number of input neurons in the FC part of a CNN involves not only determining how many to include but also identifying which input features are most relevant for the model (i.e., performing the FS task). All of this is performed within the framework of the recently proposed TL paradigm.

2. State of the Art

A significant amount of research has been conducted on optimizing the architecture of traditional ANNs (i.e., networks that do not use convolutional layers). Some approaches focus on designing the input layer [15,18,19], others on the hidden layer [20,21,22], and some on optimizing both [23,24]. However, these methods do not involve tuning weights using GAs, since they primarily concentrate on the optimization of the network architecture and/or the adjustment of the parameters of the backpropagation-based training algorithms.

For CNN models, existing works are centered around TL, with approaches oriented to the design of convolutional layers [8,25,26,27,28]. These works address the tasks of optimizing network architecture and layer selection using GAs. However, optimizing the convolutional layers involves a considerably higher computational cost than optimizing the FC layers due to the large number of parameters.

2.1. Approaches with Transfer Learning That Optimize Fully Connected Layers

A recent trend in the literature is to optimize the FC layers while keeping the convolutional architecture intact from the source domain in TL. This approach is based on the assumption that the features extracted in the source domain are representative enough to support the effective design and training of FC layers in a given target domain. There are two main approaches in the literature using CNNs with TL that aim to optimize the FC layers.

On the one hand, Bibi et al. [29] proposed a content-based image retrieval method using a TL approach with a pre-trained CNN model, a variant of the standard GA, and an ANN model known as the extreme machine learning classifier.

On the other hand, Poyatos et al. [13] developed a method called EvoPruneDeepTL (evolutionary pruning model for deep neural networks based on transfer learning). This approach performs evolutionary pruning by using TL to import the weights of a pre-trained CNN and a steady-state GA that fine-tunes the architecture of the FC layers. Optimization criteria, including maximizing classification accuracy and minimizing model complexity, which depends on the number of neurons in the hidden layers, were considered. However, the objective function only optimizes classification performance, since this method employs a selection criterion that favors solutions with fewer neurons.

During the feature extraction stage, the weights of a pre-trained CNN model (ResNet50 [30]) are used. Additionally, two variants of the proposed method are presented, both using the same cost function. The first variant minimizes the number of neurons in the hidden layers, while the second performs pruning exclusively in the input layer, functioning as a feature selector.

The results demonstrated that the optimization process involving neuron pruning obtained better classification accuracy than pruning connections between neurons. Furthermore, this method achieved better performance in terms of both classification accuracy and model architectural complexity compared to the reference models adopted.

Our proposed approach also focuses on optimizing the FC layers using TL.

2.2. Limitations for Optimization Algorithms

Due to the large number of parameters in CNN models, training is only feasible with the use of GPUs. Depending on the architecture, a single run of the training algorithm can take days or even weeks [31]. As a result, methods that optimize the architecture of pre-trained CNN models are often limited to a small number of evaluations of the objective function. Alternatively, weight-tuning algorithms may be configured with a smaller training batch size and a limited number of epochs.

This limitation in the number of evaluations and configurations for the weight tuning algorithm can lead to issues such as lack of convergence, inadequate exploration of the search space, or, in general, lower-quality solutions. Therefore, it is crucial to find a balance between the quality of solutions and the computational burden in the evolutionary optimization process for these models.

Developing algorithms that accelerate the parameter training process of CNN models is of great relevance due to the significant time and computational resources required for optimization. In this paper, we propose the use of a $μ$ GA with a small population size. This approach allows for a larger number of generations and achieves reliable performance in search space exploration while keeping computational effort bounded.

Using the $μ$ GA in this context has several advantages, such as reducing the computational time and the number of objective function evaluations. By keeping the population small, the algorithm can focus on promising regions of the search space, allowing faster convergence to high-quality solutions. However, the risk of using small populations is stagnation or premature convergence. In this work, we adopted an elitist re-initialization mechanism that helps preserve the best solutions found so far and ensures that the algorithm keeps exploring the search space, thus preventing stagnation.

The approach proposed in this work seeks to balance computational efficiency and solution quality by optimizing the learning rate and the architecture in both input and hidden layers of CNN models. By combining the advantages of $μ$ GA with the elitist re-initialization mechanism, we seek to obtain promising results in terms of performance and reduction in model complexity.

A comparison of previous works, together with our proposed approach, is summarized in Table 1. Several columns are informative for a full comparison, but we would like to highlight columns TL model, GA type Name, GA Pop., FS and HL, as they contain the novel features of our work. Our method, in the last line, proposes the use of TL (in contrast to the first eight approaches listed), together with a $μ$ GA with a reduced population size of just four individuals, applied to the optimization of both the input (performing the FS task) and hidden layers (in contrast to the previous approaches using TL) while maintaining model performance.

3. A Deep Neural Network Based on a Micro Genetic Algorithm ( $μ$ GA-DNN)

This section introduces the proposed algorithm, which consists of a $μ$ GA with a population of four individuals to optimize a weighted objective function that considers two criteria: (1) maximizing the classification accuracy of a DNN model and (2) minimizing the complexity of the network architecture (FC layers of a CNN model).

To achieve this, three different representations in the binary domain have been designed to encode the individuals in the population, considering the number of hidden neurons, the learning rate for training the DNN model, and the features of the dataset to perform the FS task.

Additionally, four variants of the objective function were developed to guide the search process toward solutions that meet the mentioned optimization criteria. By combining the different representations with at least two objective functions, a total of eight variants of the proposed method were obtained. Figure 1 shows the general scheme of the proposed algorithm, named the Deep Neural Network based on a Micro Genetic Algorithm ( $μ$ GA-DNN).

The $μ$ GA-DNN algorithm seeks to find a balance between classification accuracy and complexity of the network architecture, thus allowing it to obtain efficient and effective DNN models for classification tasks. By combining different representations and objective functions, the proposed algorithm is designed to adapt to various scenarios and provide high-quality solutions. In summary, the proposed $μ$ GA-DNN algorithm uses an evolutionary approach to optimize the architecture of the FC layers of a CNN model, the learning rate of the DNN model training algorithm, and the performance of the FS task.

In this work, we use a pre-trained CNN model, ResNet50 [30], which has been pre-trained on the ImageNet [32] dataset. This approach is based on the idea of TL, where the parameters of the pre-trained model are transferred and adapted to solve another problem, known as the target domain.

The $μ$ GA algorithm is applied to tune the architecture and parameters of the DNN training algorithm, tailoring them to the specific requirements of the target domain, thereby improving the performance of the model classification tasks. Once an optimized architecture has been obtained, the model is trained with the target domain dataset to adjust the DNN parameters to the specific characteristics of the problem. The performance of the trained model is then evaluated using an independent test dataset, allowing one to obtain an estimate of the model’s actual performance in the classification task.

The use of a $μ$ GA [16] promotes rapid convergence and is efficient in locating promising regions of the search space, accelerating the evaluation process of the entire population since it works with very few individuals. However, small populations cannot maintain diversity over multiple generations. To address this, a mechanism that restarts the population is included when diversity is compromised. Additionally, a simple elitism scheme is used that allows preserving the best individual in the population. This approach helps prevent convergence to local minima, maintaining the diversity of the population with the information of genes of the reset individuals.

3.1. Objective Functions

This section outlines the design of the four proposed objective functions.

3.1.1. Maximizing Classification Accuracy

The first objective function considers only the classification accuracy criterion of the DNN model, which is measured in terms of the metric ACC.

Definition 1 (Classification Accuracy (ACC)).

The hit rate that measures the classification performance of the network model on an independent test set of patterns [33]. The higher the accuracy, the better the performance of the model.

Therefore, the optimization problem is defined as

(1) $\begin{matrix} Maximize & f_{1} (q) = ACC \\ s . t . & q \in Ω \land \min (L_{i}) \geq 25, i = 1, \dots, m \end{matrix}$

where

f_{1} \in [0, 1]

is the objective function,

q

is the binary vector encoding a DNN architecture, and the learning rate value of the model training algorithm, while

Ω

is the set of feasible solutions. On the other hand,

\min (L_{i}) \geq 25

is a condition indicating the minimum number of neurons that the optimized architecture can have in the i-th hidden layer, which is denoted as

L_{i}

. In this work, a minimum limit of 25 neurons is considered, which was established empirically.

In this work, we consider a reference architecture with $m = 2$ hidden layers to minimize the complexity of the DNN architecture. $L_{i}$ represents the total number of neurons in the i-th hidden layer, with $i = 1, \dots, m$ (Figure 2).

Since the optimization criterion is maximizing classification accuracy ( $ACC$ ), an additional mechanism is needed to address the minimization of the complexity of the network architecture in cases of a tie. To carry this out, a selection rule is introduced, inspired by a constrained optimization work [34], where the objective function is always the dominant criterion and the constraint violation is secondary. This rule is used both in the selection of parents for crossover and in the elitism mechanism of $μ$ GA and is as follows:

Definition 2 (Selection rule $S$ ).

If the algorithm finds two or more solutions with identical classification accuracy, the solution with the fewest neurons in the hidden layers will be selected.

This rule $S$ balances the search for solutions that maximize classification accuracy while minimizing the complexity of the network architecture. As a result, simpler and more efficient solutions can be found that provide a classification performance comparable to that of more complex architectures.

3.1.2. Maximizing Classification Accuracy and Percentage of Hidden Neurons Removed

The second objective function seeks to maximize classification accuracy and minimize the architectural complexity of the DNN. This is achieved through a linear combination of ACC and the performance measure known as the hidden neuron ratio (HNR).

Definition 3 (Hidden Neuron Ratio (HNR)).

The proportion of hidden neurons relative to the maximum number of hidden neurons in the baseline architecture. A lower value of HNR implies that more hidden neurons have been pruned, which in turn can reduce model complexity and help prevent overfitting.

In this scenario, a second criterion in the case of a tie is unnecessary, since a reduction in the number of hidden neurons is explicitly considered in the objective function by including the term HNR. Thus, the optimization problem is formulated as

(2) $\begin{matrix} Maximize & f_{2} (q) = w_{1} \cdot ACC + w_{2} \cdot (1 - HNR) \\ s . t . & q \in Ω \land \min (L_{i}) \geq 25, i = 1, \dots, m \end{matrix}$

where

w_{1}

and

w_{2}

are the weights of relevance of each optimization criterion. In this study, we use

w_{1} = w_{2} = 0.5

, which indicate that both criteria,

ACC

and

HNR

, are equally important.

If $f_{2} \to 1$ , it means that the DNN architecture encoded in the solution $q$ achieves high performance in terms of classification accuracy and obtains a low percentage of hidden neurons. Conversely, $f_{2} \to 0$ suggests poor performance in both aspects.

3.1.3. Maximizing Classification Accuracy and MRMR Feature Selection Criteria

For this objective, we employ the minimum-redundancy and maximum-relevance indicator, widely used in FS.

Definition 4 (Minimum Redundancy and Maximum Relevance (MRMR)).

The criterion that measures the interdependence between attributes (redundancy) and their association with target variables or class labels (relevance) [35].

The concept of feature redundancy refers to determining the attributes that have a high correlation with each other, often computed using the Pearson correlation coefficient and denoted as W. Feature relevance refers to those attributes that have the highest influence on the variability of class labels in a given dataset, computed via point biserial correlation and denoted as V.

The original authors [35] proposed two ways to combine these criteria: $V - W$ and $V / W$ . In this work, we adopt a normalized version of the $V - W$ formulation. So a value near 0 means that the selected features have a high redundancy between each pair of attributes and a low relevance in relation to the class label variable. On the other hand, a value near 1 indicates that there is a low redundancy of information between the attributes and a high relevance of this information in terms of its link with the class label variable.

Thus, the third objective function considers maximizing both the classification accuracy ( $ACC$ ) and the $MRMR$ criterion:

(3) $\begin{matrix} Maximize & f_{3} (q) = w_{1} \cdot ACC + w_{3} \cdot MRMR \\ s . t . & q \in Ω \land \min (L_{i}) \geq 25, i = 1, \dots, m \end{matrix}$

Here, $w_{1} = w_{3} = 0.5$ , indicating that both $ACC$ and $MRMR$ are of equal importance in the optimization process.

Like the objective function $f_{1}$ (Equation (1)), $f_{3}$ does not explicitly consider the criterion for minimizing the complexity of the DNN architecture. Therefore, the selection rule described in Section 3.1.1 is also incorporated.

Thus, if $f_{3} \to 1$ , this indicates that the DNN architecture encoded in the solution $q$ achieves high performance in terms of $ACC$ and selects a feature set that maximizes the $MRMR$ criterion. On the other hand, if $f_{3} \to 0$ , it represents the opposite.

3.1.4. Maximizing Classification Accuracy, Percentage of Hidden Neurons Removed, and MRMR Criterion

The fourth objective function performs a linear combination of $ACC$ , $HNR$ , and $MRMR$ to maximize the classification accuracy, minimize the complexity of the network architecture, and maximize the $MRMR$ criterion simultaneously by selecting a subset of attributes. This optimization problem is defined as follows:

(4) $\begin{matrix} Maximize & f_{4} (q) = w_{1} \cdot ACC + w_{2} \cdot (1 - HNR) + w_{3} \cdot MRMR \\ s . t . & q \in Ω \land \min (L_{i}) \geq 25, i = 1, \dots, m \end{matrix}$

In this work, we adopted equal weights, $w_{1} = w_{2} = w_{3} = 0 . \bar{33}$ , so $ACC$ , $HNR$ and $MRMR$ are equally important.

Therefore, if $f_{4} \to 1$ , it means that the network architecture encoded in the binary vector $q$ achieves high performance in terms of classification accuracy, minimizes the number of hidden neurons, and selects a subset of features that maximize the $MRMR$ criterion. Conversely, if $f_{4} \to 0$ , it means the opposite.

3.2. Representations of Solutions

In this paper, three representations are proposed to encode the solutions of the $μ$ GA-DNN algorithm. This problem is modeled as a search task in a binary space, with each solution encoding the learning rate of the training algorithm, the number of input neurons, and the number of hidden neurons in the network.

Figure 3 shows the network topology used as a reference, where the complexity is to be reduced. The architecture consists of 2048 neurons in the input layer and two hidden layers of 512 neurons each. The number of neurons in the input layer is due to the use of the pre-trained model, ResNet50 [30], which automatically extracts a total of 2048 descriptors from the dataset. The maximum number of neurons in the hidden layers was chosen according to one of the reference architectures [13].

Each of the proposed representations is used with at least two of the objective functions detailed in Section 3.1.

3.2.1. First Representation: $μ$ GA-1

In the first representation, the actual value of the learning rate and the number of neurons in both hidden layers are encoded. It should be noted that in this design, the encoding of the neurons in the input layer is not considered, so the FS task is not performed.

Therefore, the chromosome used in the $μ$ GA is a binary vector consisting of three parts or blocks, the first part encodes the number of neurons in the first hidden layer, the second part encodes the number of neurons in the second hidden layer, and the third part encodes the learning rate used in the DNN model training algorithm.

Decoding is carried out using a linear mapping rule [36], as in all the following representations, where the desired precision $p r$ , together with the maximum $x^{(U)}$ and minimum $x^{(L)}$ values of the variable, determine the number of bits $n_{b}$ needed for each variable. Table 2 shows the mentioned data.

Figure 4 shows an example of a chromosome that uses the $μ$ GA-1 representation to encode a binary vector $q = [q_{1}, \dots, q_{35}]$ with $q_{i} \in {0, 1}$ for $i = 1, \dots, 35$ . This representation generates a search space with $2^{35}$ possible solutions.

3.2.2. Second Representation: $μ$ GA-FS

In the second representation, the number of hidden neurons, the continuous value of the learning rate, and a binary string representing a subset of features are explicitly encoded, indicating that the optimization process includes the FS task.

Thus, the chromosome used in the $μ$ GA is a binary vector consisting of four parts or blocks. The first two parts correspond to the encoding of the number of neurons in the two hidden layers, the third part encodes the value of the learning rate, while the fourth part explicitly encodes the selected features. The aim is to reduce the complexity of the network architecture, also in terms of the input layer.

As a result, a bit is used to represent the possible selection of each of the 2048 features extracted by the pre-trained ResNet50 model, thus adding 2048 binary dimensions to the problem. Therefore, in the last 2048 blocks of the chromosome, if the value of a bit at a certain position is 1, it means that the feature is selected, while if it is 0, it indicates the opposite.

Table 3 shows the search space bounds for the blocks in this representation, along with the resulting number of bits, the minimum and maximum values, and the desired precision.

Figure 5 shows an example of a chromosome that uses the $μ$ GA-FS representation to encode a binary vector $q = [q_{1}, \dots, q_{2083}]$ . Thus, this representation has a search space of $2^{2083}$ solutions, which is considerably larger than the search space of $μ$ GA-1.

3.2.3. Third Representation: $μ$ GA-MRMR

In the third representation, the number of hidden neurons, the continuous value of the learning rate, and the number of predictive variables selected from a feature ordering based on the $MRMR$ criterion are encoded.

Therefore, the chromosome used in the $μ$ GA consists of a binary vector with four sections. The first two parts correspond to the encoding of the number of neurons in the two hidden layers of the network, the third section corresponds to the encoding of the learning rate value, and the fourth part encodes an integer value indicating the number of selected features.

Table 4 shows the search space bounds for the blocks in this representation, as well as the resulting number of bits, the minimum and maximum values, and the desired precision.

Figure 6 shows an example of a chromosome using the $μ$ GA-MRMR representation to encode a binary vector $q = [q_{1}, \dots, q_{46}]$ . Thus, this representation has a search space of $2^{46}$ solutions, which is larger than the search space of $μ$ GA-1. However, it is much smaller than the search space of the $μ$ GA-FS representation, although $μ$ GA-MRMR also considers the FS task with the help of an additional process.

3.3. Naming Variants of $μ$ GA-DNN

From the different representations and objective functions, different variants are obtained. To name them, we use a subscript and a superscript. The variants of the proposed method that use the selection rule described in Section 3.1.1 are indicated with the subscript $S$ , and those that do not consider this rule do not have the subscript.

Regarding the superscript, it is related to the FS task, and we have W, H and R. They indicate that the FS task is addressed from a Wrapper (W) approach, which minimizes the classification error; a Hybrid (H) approach, based on the feature selection criterion $MRMR$ and the classification error; and a Ranking (R) approach, focused on selecting features from a ranking based on the criterion $MRMR$ , respectively. Hence, the variants are F, $F_{S}$ (these two do not perform the FS task), $F^{W}$ , $F_{S}^{W}$ , $F^{H}$ , $F_{S}^{H}$ , $F^{R}$ , and $F_{S}^{R}$ . Table 5 shows the relationship between the three representations and the four proposed objective functions, whose combination generates the eight variants of $μ$ GA-DNN.

3.4. Pseudocode of the Proposed Algorithm

Algorithm 1 shows the pseudocode of the proposed method. Four individuals are used in the population, each encoding the parameters of a DNN architecture and the learning rate of the algorithm for training the synaptic weights of the network. As input data, it receives the crossover probability, the number of generations, and the training and validation datasets necessary for the evaluation of the individual using a 5-fold cross-validation scheme.

Algorithm 1

μ

GA-DNN

Require:
Crossover probability ( $p_{c}$ ), number of generations ( $g_{m a x}$ ), training data ( $Z_{t r a i n}$ ) and validation data ( $Z_{v a l}$ ), search range of learning rate, search range of number of nodes of hidden layers
Ensure:
Parameters of the best DNN individual $λ_{b e s t}$ and its fitness value $f (λ_{b e s t})$
1:. Initialize population of individuals randomly: $Q (0) = {q_{0, 0}, \dots, q_{3, 0}}$
2:. Evaluate fitness of $Q (0)$ : ${f (q_{0, 0}), \dots, f (q_{3, 0})}$
3:. Identify the best individual of the current generation: $q_{b e s t, 0}$
4:. for $g = 1$ to $g_{m a x}$ do
5:. Select $Q^{'}$ from $Q (g - 1)$ with binary tournament and criterion $S$ // Section 3.1.1
6:. Apply two-point crossover with probability $p_{c}$ to $Q^{'}$
7:. Evaluate fitness of $Q^{'}$ : ${f (q_{0, g}), \dots, f (q_{3, g})}$
8:. Apply elitism strategy considering the criterion $S$ // Section 3.1.1
9:. Compute homogeneity of $Q^{'}$ with Hamming distance
10:. if $Q^{'}$ is homogeneous then
11:. Reset population: $Q^{'} = {q_{0, g}, \dots, q_{3, g}}$
12:. Evaluate fitness of $Q^{'}$ : ${f (q_{0, g}), \dots, f (q_{3, g})}$
13:. Apply elitism strategy considering the criterion $S$ // Section 3.1.1
14:. end if
15:. Get new population: $Q (g) \leftarrow Q^{'}$
16:. Identify the best individual of the current generation: $q_{b e s t, g}$
17:. end for
18:. Decode the best solution obtained $q_{b e s t}$ to obtain $λ_{b e s t}$
19:. return $λ_{b e s t}$ and $f (λ_{b e s t})$

Additionally, $μ$ GA-DNN uses the selection rule $S$ described in Section 3.1.1 to perform parent selection ( $Q^{'}$ ) and when applying the elitism operator. This is important as it helps to select DNN solutions or architectures with fewer training parameters.

Algorithm 2 presents the evaluation process of a solution in the objective function. The different representations to build the architecture of a DNN (Section 3.2) are decoded in lines 2, 6 and 10. Likewise, lines 22, 24, 26 and 28 show the computation of the different objective functions described in Section 3.1.

Algorithm 2 Evaluating a solution

q

Require:
Binary individual $q = {q_{1}, \dots, q_{B}}$ , training data ( $Z_{t r a i n}$ ) and validation data ( $Z_{v a l}$ ) split into $k = 5$ folds, parameters of the reference DNN architecture ( $L_{1}$ and $L_{2}$ ), DNN training algorithm configuration (optimizer, number of epochs and batch size), objective function weights vector ( $w$ ), variant of the proposed method (v).
Ensure:
Individual’s fitness: f
1:. if $v \in {F_{S}, F}$ then
2:. Decode $q$ to obtain $λ = [L_{1}, L_{2}, l_{r}]$
3:. Build a DNN architecture from $L_{1}$ and $L_{2}$
4:. $X_{t r a i n} = Z_{t r a i n}$ , $X_{v a l} = Z_{v a l}$ // No feature selection is performed
5:. else if $v \in {F_{S}^{W}, F^{W}, F_{S}^{H}, F^{H}}$ then
6:. Decode $q$ to obtain $λ = [L_{1}, L_{2}, l_{r}, η]$
7:. Build a DNN architecture from $L_{1}$ , $L_{2}$ and $η$
8:. Select the features of $Z_{t r a i n}$ and $Z_{v a l}$ with $η$ to obtain $X_{t r a i n}$ and $X_{v a l}$
9:. else if $v \in {F_{S}^{R}, F^{R}}$ then
10:. Decode $q$ to obtain $λ = [L_{1}, L_{2}, l_{r}, b]$
11:. Build a DNN architecture from $L_{1}$ , $L_{2}$ and b
12:. Select the features of $Z_{t r a i n}$ and $Z_{v a l}$ with b to obtain $X_{t r a i n}$ and $X_{v a l}$
13:. end if
14:. for $i = 1$ to k do
15:. Train with $X_{t r a i n, i}$ the DNN model of the obtained architecture
16:. Validate the trained DNN model with $X_{v a l, i}$ : ${ACC}_{v a l} (i)$
17:. end for
18:. $ACC = \frac{1}{k} \sum_{i = 1}^{k} {ACC}_{v a l} (i)$
19:. Compute $HNR$
20:. Compute $MRMR$
21:. if $v \in {F_{S}, F_{S}^{W}, F_{S}^{R}}$ then
22:. $f = ACC$
23:. else if $v \in {F, F^{W}, F^{R}}$ then
24:. $f = w_{1} \cdot ACC + w_{2} \cdot HNR$
25:. else if $v = F_{S}^{H}$ then
26:. $f = w_{1} \cdot ACC + w_{2} \cdot MRMR$
27:. else if $v = F^{H}$ then
28:. $f = w_{1} \cdot ACC + w_{2} \cdot HNR + w_{3} \cdot MRMR$
29:. end if
30:. return f

4. Experimental Framework

In the present experimental framework, the experiments performed are described, and the performances of eight variants of the proposed algorithm are compared against two reference approaches.

4.1. Datasets

We employed twelve datasets used to solve image classification problems. These datasets cover different domains and problem types, from synthetic data to X-ray images. These data were collected from various sources (some have been previously used in the literature). Table 6 shows the characteristics of the datasets.

In this work, we used the TL paradigm to import the pre-trained CNN model ResNet50 [30] to automatically extract features from image datasets. The source domain of the pre-trained model is the ImageNet dataset, which contains fourteen million images [32].

4.2. Reference Methods

This section shows the two benchmark approaches that were compared with $μ$ GA-DNN. These approaches are (1) a state-of-the-art method called EvoPruneDeepTL [13] and (2) a conventional DNN architecture.

4.2.1. EvoPruneDeepTL Algorithm

This consists of a steady-state GA that aims to find the subset of features that maximize the classification accuracy of a supervised learning model, which is based on an ANN whose architecture consists of an input layer with $d = 2048$ neurons, $m = 1$ hidden layer of the FC type, with $L_{1} = 512$ neurons, and an output layer with c neurons. Additionally, this method uses the ResNet50 architecture to extract features from the datasets.

The binary representation of the solutions in the EvoPruneDeepTL algorithm allows exploring the search space for the features to be used in the ANN model. By using a binary string of length $d = 2048$ , up to $2^{2048}$ combinations of features can be represented.

On the other hand, the objective function maximizes the classification performance of the model in terms of accuracy ( $ACC$ ).

Table 7 presents the parameters used in the EvoPruneDeepTL algorithm for experimentation in this work. These parameters were taken from the original proposal [13]. However, it is important to mention that the number of epochs used here is lower than that used by the original authors (600 epochs) due to computational time limitations and to match the amount used by our proposal. EvoPruneDeepTL will be denoted as $E_{S}$ during the experiments.

4.2.2. DNN Architecture

A DNN architecture based on the network topology shown in Figure 3 was also adopted as a reference method for comparison. It has $d = 2048$ neurons in the input layer, $m = 2$ hidden layers (with $L_{1} = L_{2} = 512$ ), and c neurons in the output layer. This model was trained with a version of the backpropagation algorithm based on the Adam [49] optimizer, using a batch size of 32 and a total of 100 epochs. This network architecture was chosen because it is one of the architectures used in the previous reference work [13].

4.3. Parameters of the Proposed Algorithm

Table 8 summarizes the parameters used by the eight variants of the proposed $μ$ GA-DNN in the experimentation stage. These variants are described in detail in Section 3.3. Note that the number of epochs is the same as the number used by EvoPruneDeepTL.

4.4. Resampling Method

A twice-repeated five-fold cross-validation method was employed to obtain a more accurate assessment of the performance of the proposed algorithms. This approach involves splitting the data into five subsets, performing cross-validation five times, and repeating the process twice, resulting in a total of ten independent experiments. Cross-validation is a commonly used technique to evaluate the performance of machine learning models, as it allows obtaining more reliable estimates of model performance on unseen data.

Using this validation method reduces the influence of chance introduced by splitting the data, thus providing a more robust assessment of the performance of the proposed algorithms [50].

5. Results and Comparisons

The presentation of the results is divided into two parts:

Results of the variants of the $μ$ GA-DNN method and the EvoPruneDeepTL algorithm. In this first section, the experimental results of all the variants of the proposed method ( $F_{S}$ , $F_{S}^{W}$ , $F_{S}^{H}$ , $F_{S}^{R}$ , F, $F^{W}$ , $F^{H}$ and $F^{R}$ ) are compared with the experimental results of $E_{S}$ .
Results of the best variant of each group of the proposed method. The variants are divided into two groups: algorithms that employ the selection rule ( $F_{S}$ , $F_{S}^{W}$ , $F_{S}^{H}$ , and $F_{S}^{R}$ ) and those that do not (F, $F^{W}$ , $F^{H}$ , and $F^{R}$ ). These results are compared to those obtained by the model that was trained from the reference DNN based on the architecture in Figure 3.

5.1. Results of $μ$ GA-DNN Variants and EvoPruneDeepTL

The results are presented and analyzed for several performance measures.

5.1.1. Classification Accuracy

Table 9 shows the results for the $ACC$ indicator of the variants of the proposed method and the reference method $E_{S}$ for each of the datasets used. Additionally, it presents some statistics, namely the mean, standard deviation (STD), median, median absolute deviation (MAD), maximum, minimum, and the count of the highest values obtained by each algorithm. The results indicate that the variants of the proposed algorithm based on the selection rule ( $S$ ) achieved a better performance than their counterparts that do not use the rule $S$ . For example, the $F_{S}^{R}$ method achieved the highest average-accuracy performance among the proposed techniques that use $S$ ( $ACC = 0.814$ ), while its counterpart $F^{R}$ achieved a slightly lower value ( $ACC = 0.804$ ), being the technique that obtained the highest value among those that do not use $S$ . Even though the reference method $E_{S}$ obtained a higher count on the best values across different datasets, it achieved an average classification accuracy similar to the variant $F_{S}^{R}$ ( $ACC = 0.814$ ).

Figure 7 shows the distribution of ACC results for each of the compared algorithms using boxplots. The average value (mean) and its respective p-value corresponding to the Wilcoxon rank sum test ( $α = 0.05$ ) are printed on the top of each plot. In this case, all variants of the proposed algorithm obtained $p > 0.05$ . Therefore, there is no statistically significant difference between any variant of $μ$ GA-DNN and the $E_{S}$ method in terms of ACC.

5.1.2. Feature Selected Ratio

This experiment reports the results of the FSR measure, which is described below.

Definition 5 (Feature Selected Ratio (FSR)).

The percentage of features selected by the evaluated methods. In the input layer, each feature is a neuron, so it also represents the percentage of input neurons.

Table 10 presents the $FSR$ results for the variants of the proposed method and the reference method $E_{S}$ on each of the datasets used, along with the corresponding statistics.

The results show that the proposed variants that rank the features of each dataset according to the MRMR criterion, i.e., $F_{S}^{R}$ and $F^{R}$ , achieved the lowest $FSR$ . In particular, $F^{R}$ achieved the lowest ratio ( $FSR = 0.443$ ). On the other hand, the $E_{S}$ method had a higher ratio compared to all the proposed variants performing the feature-selection task, with a $FSR = 0.659$ .

Figure 8 shows the results for the $FSR$ indicator. The top part shows the p-value of the Wilcoxon rank sum test ( $α = 0.05$ ) that was used to evaluate the statistical significance between $E_{S}$ and the variants of the proposed method with respect to $F^{R}$ (the variant that obtained the best results in terms of the $FSR$ indicator). Thus, $E_{S}$ and most of the variants of the proposed method present a statistically significant difference with respect to $F^{R}$ ( $p < 0.05$ ), with $F_{S}^{R}$ ( $p = 0.46$ ) being the only one that does not present this difference. This is because both $F^{R}$ and $F_{S}^{R}$ use the same method to perform the FS task, i.e., an ordering based on the MRMR criterion. Furthermore, it is worth noting that all the variants of $μ$ GA-DNN that conduct the FS task obtained better results than the reference method $E_{S}$ in terms of the indicator $FSR$ .

These results indicate that the variants of the proposed method are more effective in reducing the number of selected features compared to $E_{S}$ .

5.1.3. Hidden Neurons Ratio

Table 11 presents the results of the $HNR$ metric. The results indicate that the four variants of the proposed method that do not use the $S$ rule achieved a lower average $HNR$ compared to their counterparts that do use the rule. The $F^{R}$ variant achieved the best performance in this metric ( $HNR = 0.097$ ).

These lower values of $HNR$ for the variants without $S$ are because the value of $HNR$ is considered in the objective functions employed by these methods, as shown in Equations (2) and (4). In these functions, equal weight is given to the classification performance in terms of $ACC$ and to the architectural complexity of the hidden layers of the DNN in terms of $HNR$ .

Figure 9 presents the results for the $HNR$ metric. The top section shows the p-value of the Wilcoxon rank sum test ( $α = 0.05$ ) used to assess statistical significance between all the variants with respect to $F^{R}$ (the variant that performed best in terms of the $HNR$ indicator).

The $E_{S}$ method keeps all neurons in the hidden layer, as it only performs the FS task. On the other hand, the variants of the proposed algorithm that employ the rule $S$ achieved a significant reduction in the number of hidden neurons, even when the objective functions used by these variants do not include the $HNR$ indicator in their formulation. $F_{S}^{W}$ and $F_{S}^{H}$ achieved a reduction of more than 60% ( $HNR < 0.40$ ), while $F_{S}$ and $F_{S}^{R}$ achieved a reduction of more than 70% ( $HNR = 0.298$ ). As for the variants that do not use the rule $S$ , they achieved a reduction of more than 80%, with $F^{R}$ obtaining the best result ( $HNR = 0.097$ ).

Finally, the $F^{R}$ variant presents a statistically significant difference compared to all other methods in the study, except for F ( $p = 0.31$ ). It should be noted that F and $F^{R}$ optimize the objective function $f_{2}$ (2) and use the $μ$ GA-1 and $μ$ GA-MRMR representations, respectively. These encoding methods use 35-bit and 46-bit binary vectors (Section 3.2.1 and Section 3.2.3). As a result, both F and $F^{R}$ explore search spaces of comparable size. On the other hand, although $F^{W}$ and $F^{H}$ present a statistically significant difference with respect to $F^{R}$ , they also achieved reduction ratios of over 80%. However, it is important to mention that these methods use the $μ$ GA-FS representation, which employs a 2083-bit binary vector (Section 3.2.2), so the search space is much larger than that explored by $F^{R}$ . As a result, the $F^{W}$ and $F^{H}$ variants are more susceptible to obtaining local minima.

5.1.4. Model Complexity

Table 12 shows the results obtained in terms of the MC indicator for each variant and each dataset.

Definition 6 (Model Complexity (MC)).

The number of trainable parameters of the classification model. For a DNN classifier, the total number of synaptic weights in the network is taken into account. A lower value of $MC$ implies a less complex model, which can help reduce training time.

The Equation (5) describes how the model complexity is calculated for the variants of the proposed algorithm.

(5) $MC = L_{1} (d + 1) + L_{2} (L_{1} + 1) + c (L_{2} + 1)$

Again, d is the number of neurons in the input layer, $L_{1}$ and $L_{2}$ are the number of neurons in the hidden layers, and c is the number of neurons in the output layer. The model complexity of the ANN architecture obtained by EvoPruneDeepTL is calculated as follows:

(6) $MC = L_{1} (d + 1) + c (L_{1} + 1)$

The results indicate that the variants of the proposed algorithm that do not use $S$ achieved a greater reduction in $MC$ , with examples including the methods F ( $MC =$ 1.1 × 10⁵), $F^{W}$ ( $MC =$ 8.5 × 10⁴), $F^{H}$ ( $MC =$ 9.3 × 10⁴), and $F^{R}$ ( $MC =$ 4.5 × 10⁴), the latter having the lowest average $MC$ and the lowest value in 11 out of 12 datasets.

Figure 10 shows a summary of the results described above, showing that the variants using $S$ achieved a higher $MC$ value, with $E_{S}$ achieving an even higher value than all variants of the proposed algorithm.

5.1.5. Number of Objective Function Evaluations

Table 13 shows the results for the number of evaluations performed by each variant. The mean results indicate that the variants that employ the $μ$ GA-FS representation performed fewer evaluations on their respective objective function ( $No . Eval . < 249.0$ ). Notably, $F_{S}^{H}$ used a lower number of evaluations on nine out of twelve datasets, and therefore, a lower mean value than its counterparts ( $No . Eval = 244.2$ ). Conversely, $F^{R}$ obtained the highest number of evaluations, with a difference of only eight evaluations ( $No . Eval = 252.0$ ).

Figure 11 illustrates the notable difference in the number of evaluations performed by the variants of the proposed method and the $E_{S}$ method. Here, the number of evaluations of $E_{S}$ was predefined ( $No . Eval = 300$ ), so this was the expected result.

5.1.6. Hamming Distance Results

Table 14 shows the results for the percentage of similarity of selected feature vectors in terms of the Hamming distance. Results for $F_{S}$ and F are excluded since the same feature vectors were used in all experiments. The results indicate that the variants using the $μ$ GA-FS representation generally obtained a Hamming distance of $0.500$ for $F_{S}^{W}$ , $F_{S}^{R}$ and $F^{R}$ and a Hamming distance of $0.499$ for $F^{W}$ . Variants using the $μ$ GA-MRMR representation obtained a smaller Hamming distance, such as $F^{R}$ with $0.288$ and $F_{S}^{R}$ with $0.282$ . The algorithm $E_{S}$ obtained a Hamming distance of $0.391$ , lower than the variants with $μ$ GA-FS representation but higher than the variants with $μ$ GA-MRMR representation. Therefore, $F_{S}^{R}$ was the variant with the lowest Hamming distance value among all the methods. This is because $F_{S}^{R}$ makes an ordering of the features and then tries to optimize the number of features to select, so it is quite common that some features are repeated, resulting in a lower Hamming distance value.

5.1.7. Runtime Results

Table 15 shows the average runtime (in seconds) of ten experiments corresponding to each dataset for each variant of the proposed method and the reference method $E_{S}$ . Figure 12 also shows the average runtime for each variant of the proposed algorithm and the algorithm $E_{S}$ .

It should be noted that these results are informational only, as the experiments were performed on different hardware, preventing a reliable comparison of the efficiency of these algorithms.

5.2. Comparative Results with the Full Reference Model

This section presents the comparison between the full reference model (DNN) and the proposed algorithm variants $F_{S}^{R}$ and $F^{R}$ , which performed best in terms of $ACC$ (see Section 5.1.1). Table 16 shows the $ACC$ results for the $F_{S}^{R}$ and $F^{R}$ variants and the DNN model for each dataset. The summary of results includes the statistics of the mean, STD, median, MAD, maximum, minimum, and the count of highest values obtained by each algorithm. Results indicate that DNN obtained a better performance ( $ACC = 0.821$ ) compared to the proposed variants $F_{S}^{R}$ ( $ACC = 0.814$ ) and $F^{R}$ ( $ACC = 0.804$ ). However, Figure 13 shows that there is no statistically significant difference between the proposed variants and DNN according to the p-value obtained with the Wilcoxon rank sum statistical test ( $α = 0.05$ ).

The experiment confirms that both $F_{S}^{R}$ and $F^{R}$ can achieve remarkably similar performance with respect to the full reference DNN model in terms of $ACC$ , using a reduced number of features and neurons in the hidden layers (resulting in a lower value of $MC$ ).

It is important to mention that minimizing the $MC$ indicator aims to reduce the computational effort of the training algorithm used to tune neuron weights associated with the FC layers. Consequently, the training and classification times of the network will be decreased, since the number of basic mathematical operations will be reduced by having a network architecture that is composed of a smaller number of input and hidden neurons.

6. Conclusions

This paper presents $μ$ GA-DNN, an evolutionary optimization method using a $μ$ GA algorithm to tune the architecture of a DNN and the learning rate parameter of the model training algorithm.

The general framework of the proposed approach uses TL. This consists of using pre-tuned parameters of a CNN model trained on a source domain to automatically extract features from a different problem, such as image datasets from different domains. Then, the dataset from the target domain is used to tune the number of neurons in the input layer (FS task) and hidden layers of the DNN architecture, using a GA with a population with few individuals. Finally, the model is trained with the obtained architecture, and its performance is evaluated on a test dataset.

Four objective functions and three different representations of the solution in the binary domain are proposed, resulting in eight variants of the proposed algorithm.

In the first scenario (Section 5.1), the proposed method was compared with EvoPruneDeepTL ( $E_{S}$ ), using four criteria: (1) classification accuracy ( $ACC$ ), (2) feature selected ratio ( $FSR$ ), (3) hidden neurons ratio ( $HNR$ ) and (4) model complexity ( $MC$ ). Additional analysis includes the number of evaluations of the objective function and the value of the Hamming distance of the feature vectors used by each method. Runtime measurements (in seconds) of the different variants of the proposed algorithm and the $E_{S}$ method are provided for informational purposes only due to the variety of hardware configurations used in the experimentation stage.

The results showed that the variants of the proposed method did not present a statistically significant difference with respect to its counterpart in terms of the indicator $ACC$ . Moreover, the results regarding the second criterion ( $FSR$ ) showed that the variants of $μ$ GA-DNN that perform the FS task outperformed the reference method $E_{S}$ in terms of this indicator. As a result, the statistical significance was compared with respect to $F^{R}$ (the best variant of the proposed algorithm regarding this criterion), and the results showed that only $F_{S}^{R}$ did not present a statistically significant difference. Notably, both $F_{S}^{R}$ and $F^{R}$ use the same method to perform the FS task (an ordering based on the MRMR criterion).

In terms of the indicator $HNR$ , the variant $F^{R}$ performed best. Furthermore, the results of the statistical test indicated that only the F variant did not obtain a statistically significant difference with respect to $F^{R}$ . It is important to note that both F and $F^{R}$ are modeled in such a way that the size of the search space explored with both methods is not noticeably different, since they use a binary representation with a similar number of bits and the same objective function.

On the other hand, the results in terms of the $MC$ indicator showed that the variants that do not use the selection rule ( $S$ ) obtained the best performance. This is because these methods focus on explicitly reducing the number of neurons in the hidden layer by introducing the $HNR$ indicator in their respective objective functions. Thus, a considerable reduction in the model architecture was achieved.

Furthermore, a comparison was made regarding the number of evaluations of the objective function. Even though the proposed method used fewer evaluations than the reference method, the $μ$ GA-DNN variants achieved competitive results, even surpassing $E_{S}$ on some occasions in terms of the performance measures mentioned above.

Additionally, the value of the Hamming distance between the binary vectors indicating the features selected by the methods that perform the FS task was evaluated. The results showed that the variants of the proposed method achieved a greater reduction in this distance, so they are more robust in terms of the repeatability of the selected features; in particular, $F^{R}$ and $F_{S}^{R}$ obtained the best results.

In the second scenario (Section 5.2), a comparison was made in terms of the indicator $ACC$ between the best variants of the proposed algorithm ( $F^{R}$ and $F_{S}^{R}$ ) and the reference DNN model. The results showed that the compared methods obtained similar values and that there is no statistically significant difference with respect to the variants of the proposed method. Overall, considering all the experimentas, the recommended variants are $F^{R}$ and $F_{S}^{R}$ , in that order.

Thus, within the analysis conducted throughout the experimentation of this work, it was shown that no variant of $μ$ GA-DNN presented a statistically significant difference with respect to $E_{S}$ and a reference DNN model in terms of the $ACC$ indicator. Furthermore, the proposed method is computationally efficient, since all variants managed to reduce the number of neurons in the FC layers. This allows a reduction in the complexity of the network architecture (while decreasing the $MC$ indicator), which implies a lower number of operations within the DNN model.

Limitations and Future Work

The proposed algorithm and reference methods were evaluated on datasets of classification problems with variables in the domain of real numbers. Generalization to other types of variables (e.g., categorical) or other types of problems (e.g., regression) is not straightforward. An extension to other domains is initially possible; however, more experimentation is needed.

Therefore, it is essential to consider the scope of the study when applying the proposed algorithm to different contexts and problems.

Additionally, optimizing FC layers only was a design decision in this work because the optimization of convolutional layers entails a much higher computational cost. While the results obtained are promising, the improvements in terms of accuracy and complexity when optimizing convolutional layers are a matter of further studies. Future research should examine whether possible improvements on some metrics justify the added computational cost.

Author Contributions

Conceptualization, G.T.; methodology, R.L. and G.T.; software, D.T.-A.; validation, R.L. and D.T.-A.; formal analysis, D.T.-A.; investigation, R.L. and D.T.-A.; resources, R.L. and G.T.; data curation, D.T.-A.; writing—original draft preparation, R.L.; writing—review and editing, R.L.; visualization, D.T.-A.; supervision, R.L. and G.T.; project administration, G.T.; funding acquisition, G.T. All authors have read and agreed to the published version of the manuscript.

Institutional Review Board Statement

Not applicable.

Data Availability Statement

The original contributions presented in this study are included in the article. Further inquiries can be directed to the corresponding author.

Conflicts of Interest

The authors declare no conflicts of interest.

Abbreviations

The following abbreviations are used in this manuscript:

ANN	Artificial neural network
CNN	Convolutional neural network
FC	Fully connected
FS	Feature selection
GA	Genetic algorithm
TL	Transfer learning

Footnotes

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Figures and Tables

View Image - Figure 1. General scheme of the proposed method. Top: CNN model trained on a source domain. Center: Transfer learning of the pre-trained model parameters and tuning of the model weights of a DNN (FC layers) using the [Forumla omitted. See PDF.]GA-DNN algorithm. Bottom: Schematic illustrating the operation of the proposed method.

Figure 1. General scheme of the proposed method. Top: CNN model trained on a source domain. Center: Transfer learning of the pre-trained model parameters and tuning of the model weights of a DNN (FC layers) using the [Forumla omitted. See PDF.]GA-DNN algorithm. Bottom: Schematic illustrating the operation of the proposed method.

View Image - Figure 2. Example of an FC layer architecture: The input layer is related to the d features automatically extracted by the convolutional layers of the CNN model. In this architecture, there are m hidden layers, where [Forumla omitted. See PDF.] indicate the number of neurons in each. Finally, the output layer provides a response (prediction) [Forumla omitted. See PDF.], with [Forumla omitted. See PDF.], for each of the c classes of the input dataset.

Figure 2. Example of an FC layer architecture: The input layer is related to the d features automatically extracted by the convolutional layers of the CNN model. In this architecture, there are m hidden layers, where [Forumla omitted. See PDF.] indicate the number of neurons in each. Finally, the output layer provides a response (prediction) [Forumla omitted. See PDF.], with [Forumla omitted. See PDF.], for each of the c classes of the input dataset.

View Image - Figure 3. Example of an FC layer architecture. The input layer is related to the features of the dataset. Thus, this scheme assumes that the problem has a dimensionality of [Forumla omitted. See PDF.]. Additionally, it has [Forumla omitted. See PDF.] hidden layers, each with 512 neurons, and the output layer has c neurons, where c is the number of classes of the problem.

Figure 3. Example of an FC layer architecture. The input layer is related to the features of the dataset. Thus, this scheme assumes that the problem has a dimensionality of [Forumla omitted. See PDF.]. Additionally, it has [Forumla omitted. See PDF.] hidden layers, each with 512 neurons, and the output layer has c neurons, where c is the number of classes of the problem.

View Image - Figure 4. Example of a chromosome using the binary representation [Forumla omitted. See PDF.]GA-1, which is composed of three binary blocks representing the first hidden layer ([Forumla omitted. See PDF.]), the second hidden layer ([Forumla omitted. See PDF.]), and the learning rate ([Forumla omitted. See PDF.]).

Figure 4. Example of a chromosome using the binary representation [Forumla omitted. See PDF.]GA-1, which is composed of three binary blocks representing the first hidden layer ([Forumla omitted. See PDF.]), the second hidden layer ([Forumla omitted. See PDF.]), and the learning rate ([Forumla omitted. See PDF.]).

View Image - Figure 5. Example of a chromosome that uses the binary representation [Forumla omitted. See PDF.]GA-FS, which is composed of 2051 binary blocks. The first two blocks consist of 9 bits each, the third of 17 bits, and the last 2048 blocks each consist of 1 bit.

Figure 5. Example of a chromosome that uses the binary representation [Forumla omitted. See PDF.]GA-FS, which is composed of 2051 binary blocks. The first two blocks consist of 9 bits each, the third of 17 bits, and the last 2048 blocks each consist of 1 bit.

View Image - Figure 6. Example of a chromosome using the binary representation [Forumla omitted. See PDF.]GA-MRMR, which is composed of four binary blocks.

Figure 6. Example of a chromosome using the binary representation [Forumla omitted. See PDF.]GA-MRMR, which is composed of four binary blocks.

View Image - Figure 7. Results of the [Forumla omitted. See PDF.] indicator obtained by each method are shown using boxplots. The best values are indicated in bold. The mean of [Forumla omitted. See PDF.] and the p-value of the Wilcoxon rank sum test are shown at the top of each plot.

Figure 7. Results of the [Forumla omitted. See PDF.] indicator obtained by each method are shown using boxplots. The best values are indicated in bold. The mean of [Forumla omitted. See PDF.] and the p-value of the Wilcoxon rank sum test are shown at the top of each plot.

View Image - Figure 8. Results of the mean [Forumla omitted. See PDF.] obtained by each method. The top part shows the p-value of the Wilcoxon rank sum test obtained by comparing each method with [Forumla omitted. See PDF.] (the best variant of the proposed method in terms of this indicator). In bold, [Forumla omitted. See PDF.].

Figure 8. Results of the mean [Forumla omitted. See PDF.] obtained by each method. The top part shows the p-value of the Wilcoxon rank sum test obtained by comparing each method with [Forumla omitted. See PDF.] (the best variant of the proposed method in terms of this indicator). In bold, [Forumla omitted. See PDF.].

View Image - Figure 9. Results of the mean [Forumla omitted. See PDF.] obtained by each method. The top part shows the p-value of the Wilcoxon rank sum test obtained by comparing each method with [Forumla omitted. See PDF.] (the best variant of the proposed method in terms of this indicator). In bold, [Forumla omitted. See PDF.].

Figure 9. Results of the mean [Forumla omitted. See PDF.] obtained by each method. The top part shows the p-value of the Wilcoxon rank sum test obtained by comparing each method with [Forumla omitted. See PDF.] (the best variant of the proposed method in terms of this indicator). In bold, [Forumla omitted. See PDF.].

Figure 10. Results of the average [Forumla omitted. See PDF.] obtained by each method. The best values are indicated in bold.

View Image - Figure 11. Results of the average number of evaluations of the objective function obtained by each method. The best values are indicated in bold.

Figure 11. Results of the average number of evaluations of the objective function obtained by each method. The best values are indicated in bold.

Figure 12. Results of the average runtime (in seconds) of each method.

Figure 13. Comparison of classification accuracy for the proposed models and the complete reference model.

Table 1

Comparison of methods. B. S. stands for batch size, Gens. means the maximum number of generations, Pop. means the number of individuals in population, FS indicates if the method performs the FS task, and HL indicates if the method reduces the hidden layer size. NS means not specified in the original source.

Method	ANN Model				GA Type			FS	HL
Method	TL Model	B. S.	Epochs	Optimizer	Name	Gens.	Pop.	FS	HL
Ledesma et al. [15]	–	NS	500	NS	GA	8	100	✔
Saxena et al. [18]	–	NS	NS	NS	GA	25	60	✔
WGGA [19]	–	NS	NS	NS	GA	80	10	✔
GNN [20]	–	NS	1000	SGD	GA	1000	28		✔
				SGD, ADAM,
GA-DFNN [21]	–	NS	12,000	NADAM,	GA	30, 40	20		✔
				ADAMAX
				RMSPROP,
				ADAM, SGD,
GA-ANN [22]	–	1024	25	ADADELTA,	GA	25	20		✔
				ADAMAX,
				NADAM.
CEV-MLP [23]	–	NS	NS	NS	GA	120	20, 30, 200	✔	✔
Pham et al. [24]	–	NS	NS	Quasi-Newton, SGD, ADAM	GA	200	25	✔	✔
Baldominos et al. [25]	NS	25–200	5, 30	SGD, ADAM	GA, GE	100	50
Tian y Shyu [26]	Inception V3ResNet50MobileNetDenseNet201	NS	NS	NS	GE	NS	10	✔
MLTGA [27]	Inception V3	16	5, 50	SGD	GA	5	20
EvoNAS-TL [8]	VGG-16	256	3	SGD	KGEA	30, 500 ¹	30, 200 ²
Li et al. [28]	MobileNetV2	NS	NS	NS	GA	14	50
Bibi et al. [29]	VGG-19	NS	NS	SGD	GA	NS	NS	✔
EvoPruneDeepTL [13]	ResNet50	32	600	SGD	GA	10	30	✔ ³
$μ$ GA-DNN (Our proposed approach)	ResNet50	32	100	ADAM	$μ$ GA	50	4	✔	✔

¹ 30 for global search, 500 for local search. ² 30 for global search, 500 for local search. ³ EvoPruneDeepTL provides two variants, one for FS and one for HL, but the one for FS outperformed in the original source and is the one used here for comparison.

Table 2

Bounds of the search space in each parameter, its desired precision, and the resulting number of bits ( $μ$ GA-1 representation).

Binary Block	$x^{(L)}$	$x^{(U)}$	$pr$	$n_{b}$
Hidden layer 1 ( $L_{1}$ )	1	512	0	9
Hidden layer 2 ( $L_{2}$ )	1	512	0	9
Learning rate ( $l_{r}$ )	$1 \times 10^{- 6}$	$1 \times 10^{- 1}$	6	17

Table 3

Bounds of the search space in each parameter, its desired precision, and the resulting number of bits ( $μ$ GA-FS representation).

Binary Block	$x^{(L)}$	$x^{(U)}$	$pr$	$n_{b}$
Hidden layer 1 ( $L_{1}$ )	1	512	0	9
Hidden layer 2 ( $L_{2}$ )	1	512	0	9
Learning rate ( $l_{r}$ )	$1 \times 10^{- 6}$	$1 \times 10^{- 1}$	6	17
$S_{1}$	0	1	0	1
⋮	⋮	⋮	⋮	⋮
$S_{2048}$	0	1	0	1

Table 4

Bounds of the search space in each parameter, its desired precision, and the resulting number of bits ( $μ$ GA-MRMR representation).

Binary Block	$x^{(L)}$	$x^{(U)}$	$pr$	$n_{b}$
Hidden layer 1 ( $L_{1}$ )	1	512	0	9
Hidden layer 2 ( $L_{2}$ )	1	512	0	9
Learning rate ( $l_{r}$ )	$1 \times 10^{- 6}$	$1 \times 10^{- 1}$	6	17
Number of selected features ( $n_{s}$ )	1	2048	0	11

Table 5

Relationship between the three proposed representations and the four designed objective functions. Each combination of a representation with an objective function forms one of the eight variants of the proposed algorithm.

Representation	$f_{1}$	$f_{2}$	$f_{3}$	$f_{4}$
$μ$ GA-1	$F_{S}$	F
$μ$ GA-FS	$F_{S}^{W}$	$F^{W}$	$F_{S}^{H}$	$F^{H}$
$μ$ GA-MRMR	$F_{S}^{R}$	$F^{R}$

Table 6

Description of the adopted datasets. n is the number of instances, d is the number of predictor variables, and c indicates the number of classes.

Name	n	d	c	Source
Cataract	601	2048	4	[37]
Chessman	556	2048	6	[38]
COVID-19	317	2048	3	[39]
Flowers	3670	2048	5	[40]
Leaves	596	2048	4	[41]
MIT-IS	15,620	2048	67	[42]
Painting	8577	2048	5	[43]
Plants	2576	2048	27	[44]
RPS	2892	2048	3	[45]
Skincancer	3297	2048	2	[46]
SRSMAS	409	2048	14	[47]
Weather	1125	2048	4	[48]

Table 7

Parameters of EvoPruneDeepTL.

Parameter	Value
Steady-state GA
Population size	30
Number of evaluations	300
Crossover probability (uniform)	0.5
Mutation probability	0.07
SGD optimizer
Learning rate ( $η$ )	0.001
Moment Nesterov	0.9
Batch size	32
Number of epochs	100

Table 8

Parameters of the proposed algorithm.

$μ$ GA	Value
Population size ( $n_{p}$ )	4
Number of generations ( $g_{m a x}$ )	50
Crossover probability ( $p_{c}$ )	0.9
Convergence threshold	0.05
Adam optimizer
Batch size	32
Number of epochs	100

Table 9

Experimental results of ACC obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.

Name	$F_{S}$	$F_{S}^{W}$	$F_{S}^{H}$	$F_{S}^{R}$	F	$F^{W}$	$F^{H}$	$F^{R}$	$E_{S}$
Cataract	0.616	0.576	0.570	0.600	0.611	0.542	0.567	0.586	0.607
Chessman	0.798	0.775	0.769	0.800	0.778	0.764	0.741	0.801	0.756
COVID-19	0.948	0.972	0.868	0.970	0.976	0.967	0.951	0.975	0.973
Flowers	0.881	0.866	0.862	0.878	0.874	0.793	0.848	0.869	0.879
Leaves	0.899	0.891	0.865	0.894	0.888	0.894	0.857	0.886	0.901
MIT Indoor Scenes	0.697	0.679	0.666	0.694	0.638	0.573	0.556	0.673	0.735
Painting	0.937	0.933	0.931	0.938	0.934	0.918	0.922	0.930	0.935
Plants	0.349	0.357	0.323	0.366	0.320	0.226	0.281	0.327	0.376
RPS	1.000	1.000	1.000	1.000	1.000	0.999	1.000	1.000	1.000
Skincancer	0.819	0.861	0.854	0.859	0.855	0.810	0.847	0.859	0.869
SRSMAS	0.803	0.801	0.800	0.804	0.785	0.751	0.780	0.782	0.777
Weather	0.964	0.960	0.960	0.964	0.966	0.952	0.954	0.962	0.960
Statistic
Mean	0.809	0.806	0.789	0.814	0.802	0.766	0.775	0.804	0.814
STD	0.176	0.180	0.182	0.176	0.188	0.214	0.202	0.186	0.172
Median	0.850	0.864	0.858	0.869	0.865	0.802	0.848	0.864	0.874
MAD	0.093	0.092	0.081	0.082	0.094	0.133	0.105	0.090	0.098
Maximum	1.000	1.000	1.000	1.000	1.000	0.999	1.000	1.000	1.000
Minimum	0.349	0.357	0.323	0.366	0.320	0.226	0.281	0.327	0.376
Count	3	1	1	3	3	0	1	2	5

Table 10

Experimental results of $FSR$ obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.

Name	$F_{S}$	$F_{S}^{W}$	$F_{S}^{H}$	$F_{S}^{R}$	F	$F^{W}$	$F^{H}$	$F^{R}$	$E_{S}$
Cataract	1.000	0.508	0.504	0.522	1.000	0.501	0.495	0.468	0.536
Chessman	1.000	0.492	0.501	0.223	1.000	0.502	0.502	0.333	0.667
COVID-19	1.000	0.503	0.498	0.333	1.000	0.498	0.497	0.233	0.496
Flowers	1.000	0.506	0.494	0.669	1.000	0.499	0.497	0.568	0.746
Leaves	1.000	0.501	0.507	0.349	1.000	0.504	0.498	0.357	0.614
MIT Indoor Scenes	1.000	0.497	0.501	0.629	1.000	0.501	0.504	0.488	0.935
Painting	1.000	0.499	0.503	0.693	1.000	0.497	0.505	0.586	0.728
Plants	1.000	0.502	0.499	0.506	1.000	0.501	0.500	0.555	0.856
RPS	1.000	0.500	0.497	0.298	1.000	0.498	0.501	0.474	0.240
Skincancer	1.000	0.499	0.501	0.530	1.000	0.495	0.500	0.500	0.667
SRSMAS	1.000	0.501	0.501	0.346	1.000	0.504	0.499	0.332	0.762
Weather	1.000	0.501	0.494	0.576	1.000	0.503	0.499	0.420	0.656
Statistic
Mean	1.000	0.501	0.500	0.473	1.000	0.500	0.500	0.443	0.659
STD	0.000	0.004	0.004	0.151	0.000	0.003	0.003	0.105	0.173
Median	1.000	0.501	0.501	0.514	1.000	0.501	0.500	0.471	0.667
MAD	0.000	0.002	0.003	0.160	0.000	0.003	0.002	0.091	0.087
Maximum	1.000	0.508	0.507	0.693	1.000	0.504	0.505	0.586	0.935
Minimum	1.000	0.492	0.494	0.223	1.000	0.495	0.495	0.233	0.240
Count	0	0	2	2	0	2	0	5	1

Table 11

Experimental results of HNR obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.

Name	$F_{S}$	$F_{S}^{W}$	$F_{S}^{H}$	$F_{S}^{R}$	F	$F^{W}$	$F^{H}$	$F^{R}$	$E_{S}$
Cataract	0.352	0.367	0.447	0.384	0.096	0.117	0.132	0.093	1.000
Chessman	0.333	0.454	0.344	0.410	0.098	0.157	0.202	0.099	1.000
COVID-19	0.221	0.338	0.183	0.215	0.092	0.141	0.143	0.080	1.000
Flowers	0.326	0.382	0.344	0.273	0.100	0.127	0.116	0.098	1.000
Leaves	0.290	0.350	0.239	0.340	0.102	0.154	0.149	0.091	1.000
MIT Indoor Scenes	0.491	0.593	0.465	0.454	0.129	0.247	0.195	0.117	1.000
Painting	0.180	0.309	0.318	0.199	0.105	0.154	0.153	0.093	1.000
Plants	0.422	0.634	0.453	0.461	0.092	0.158	0.189	0.114	1.000
RPS	0.122	0.148	0.134	0.113	0.099	0.172	0.141	0.093	1.000
Skincancer	0.233	0.360	0.322	0.217	0.086	0.139	0.116	0.081	1.000
SRSMAS	0.454	0.499	0.492	0.345	0.130	0.161	0.216	0.111	1.000
Weather	0.151	0.284	0.192	0.170	0.100	0.147	0.116	0.099	1.000
Statistic
Mean	0.298	0.393	0.328	0.298	0.102	0.156	0.156	0.097	1.000
STD	0.115	0.129	0.116	0.112	0.013	0.031	0.034	0.011	0.000
Median	0.308	0.364	0.333	0.307	0.100	0.154	0.146	0.096	1.000
MAD	0.101	0.067	0.117	0.098	0.004	0.010	0.030	0.004	0.000
Maximum	0.491	0.634	0.492	0.461	0.130	0.247	0.216	0.117	1.000
Minimum	0.122	0.148	0.134	0.113	0.086	0.117	0.116	0.080	1.000
Count	0	0	0	0	2	0	0	10	0

Table 12

Experimental results of MC obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.

Name	$F_{S}$	$F_{S}^{W}$	$F_{S}^{H}$	$F_{S}^{R}$	F	$F^{W}$	$F^{H}$	$F^{R}$	$E_{S}$
Cataract	5.5 × 10⁵	2.2 × 10⁵	3.4 × 10⁵	2.8 × 10⁵	1.0 × 10⁵	6.3 × 10⁴	7.5 × 10⁴	5.2 × 10⁴	5.6 × 10⁵
Chessman	3.6 × 10⁵	2.9 × 10⁵	2.1 × 10⁵	1.5 × 10⁵	9.9 × 10⁴	7.4 × 10⁴	1.4 × 10⁵	2.9 × 10⁴	7.0 × 10⁵
COVID-19	1.8 × 10⁵	2.5 × 10⁵	1.0 × 10⁵	8.4 × 10⁴	9.5 × 10⁴	7.8 × 10⁴	8.9 × 10⁴	2.1 × 10⁴	5.2 × 10⁵
Flowers	4.6 × 10⁵	2.5 × 10⁵	2.4 × 10⁵	2.1 × 10⁵	1.1 × 10⁵	7.0 × 10⁴	6.6 × 10⁴	4.6 × 10⁴	7.8 × 10⁵
Leaves	3.9 × 10⁵	2.5 × 10⁵	1.5 × 10⁵	1.7 × 10⁵	1.1 × 10⁵	1.0 × 10⁵	7.8 × 10⁴	3.1 × 10⁴	6.5 × 10⁵
MIT Indoor Scenes	5.6 × 10⁵	4.1 × 10⁵	3.6 × 10⁵	4.3 × 10⁵	1.5 × 10⁵	1.4 × 10⁵	1.3 × 10⁵	7.0 × 10⁴	1.0 × 10⁶
Painting	2.1 × 10⁵	1.9 × 10⁵	2.0 × 10⁵	1.6 × 10⁵	1.2 × 10⁵	6.9 × 10⁴	8.4 × 10⁴	4.4 × 10⁴	7.7 × 10⁵
Plants	5.8 × 10⁵	5.1 × 10⁵	3.0 × 10⁵	3.4 × 10⁵	1.1 × 10⁵	8.4 × 10⁴	1.5 × 10⁵	7.0 × 10⁴	9.1 × 10⁵
RPS	1.6 × 10⁵	9.8 × 10⁴	7.7 × 10⁴	4.7 × 10⁴	1.0 × 10⁵	9.6 × 10⁴	6.3 × 10⁴	6.0 × 10⁴	2.5 × 10⁵
Skincancer	2.6 × 10⁵	2.5 × 10⁵	2.3 × 10⁵	1.8 × 10⁵	6.9 × 10⁴	7.6 × 10⁴	6.3 × 10⁴	3.9 × 10⁴	7.0 × 10⁵
SRSMAS	5.7 × 10⁵	3.5 × 10⁵	3.2 × 10⁵	1.7 × 10⁵	1.7 × 10⁵	9.1 × 10⁴	1.2 × 10⁵	3.7 × 10⁴	8.1 × 10⁵
Weather	1.9 × 10⁵	2.1 × 10⁵	1.3 × 10⁵	1.2 × 10⁵	1.2 × 10⁵	7.3 × 10⁴	6.7 × 10⁴	3.5 × 10⁴	6.9 × 10⁵
Statistic
Mean	3.7 × 10⁵	2.7 × 10⁵	2.2 × 10⁵	2.0 × 10⁵	1.1 × 10⁵	8.5 × 10⁴	9.3 × 10⁴	4.5 × 10⁴	7.0 × 10⁵
STD	1.6 × 10⁵	1.0 × 10⁵	9.1 × 10⁴	1.0 × 10⁵	2.4 × 10⁴	2.0 × 10⁴	3.0 × 10⁴	1.5 × 10⁴	1.9 × 10⁵
Median	3.8 × 10⁵	2.5 × 10⁵	2.2 × 10⁵	1.7 × 10⁵	1.1 × 10⁵	7.7 × 10⁴	8.1 × 10⁴	4.1 × 10⁴	7.0 × 10⁵
MAD	1.8 × 10⁵	4.0 × 10⁴	8.6 × 10⁴	4.4 × 10⁴	1.1 × 10⁴	7.9 × 10³	1.7 × 10⁴	1.1 × 10⁴	9.4 × 10⁴
Maximum	5.8 × 10⁵	5.1 × 10⁵	3.6 × 10⁵	4.3 × 10⁵	1.7 × 10⁵	1.4 × 10⁵	1.5 × 10⁵	7.0 × 10⁴	1.0 × 10⁶
Minumum	1.6 × 10⁵	9.8 × 10⁴	7.7 × 10⁴	4.7 × 10⁴	6.9 × 10⁴	6.3 × 10⁴	6.3 × 10⁴	2.1 × 10⁴	2.5 × 10⁵
Count	0	0	0	1	0	0	0	11	0

Table 13

Experimental results of the number of objective function evaluations obtained on each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.

Name	$F_{S}$	$F_{S}^{W}$	$F_{S}^{H}$	$F_{S}^{R}$	F	$F^{W}$	$F^{H}$	$F^{R}$	$E_{S}$
Cataract	249.2	250.8	247.6	249.2	251.2	249.6	247.2	253.2	300.0
Chessman	251.6	247.2	252.4	253.6	254.0	248.4	250.4	253.2	300.0
COVID-19	249.2	248.0	243.6	246.8	250.8	250.0	248.8	252.0	300.0
Flowers	249.2	248.8	241.2	253.6	252.8	246.8	250.8	250.4	300.0
Leaves	245.2	252.0	242.8	248.0	250.0	250.4	246.8	250.4	300.0
MIT Indoor Scenes	251.2	250.0	244.4	252.0	249.6	246.4	250.0	255.2	300.0
Painting	252.0	241.2	236.8	252.4	251.6	250.8	246.0	248.0	300.0
Plants	250.4	248.0	245.6	248.8	253.6	248.0	247.6	254.0	300.0
RPS	246.8	240.4	239.2	243.6	248.4	241.2	247.2	248.0	300.0
Skincancer	250.8	246.8	244.0	250.4	252.4	249.6	248.0	248.8	300.0
SRSMAS	249.6	255.2	254.8	249.2	253.2	253.2	247.6	258.0	300.0
Weather	250.4	250.0	237.6	250.0	253.6	244.8	248.0	252.4	300.0
Statistic
Mean	249.6	248.2	244.2	249.8	251.8	248.3	248.2	252.0	300.0
STD	1.887	3.978	5.227	2.788	1.724	3.021	1.438	2.896	0.000
Median	250.0	248.4	243.8	249.6	252.0	249.0	247.8	252.2	300.0
MAD	0.800	1.600	3.200	2.000	1.400	1.600	0.800	1.800	0.000
Maximum	252.0	255.2	254.8	253.6	254.0	253.2	250.8	258.0	300.0
Minimum	245.2	240.4	236.8	243.6	248.4	241.2	246.0	248.0	300.0
Count	0	1	9	0	0	0	2	0	0

Table 14

Results of the percentage of similarity between the binary vectors of selected features in each independent experiment in terms of the Hamming distance obtained in each dataset. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.

Name	$F_{S}$	$F_{S}^{W}$	$F_{S}^{H}$	$F_{S}^{R}$	F	$F^{W}$	$F^{H}$	$F^{R}$	$E_{S}$
Cataract	0.000	0.498	0.499	0.333	0.000	0.499	0.497	0.369	0.500
Chessman	0.000	0.501	0.504	0.164	0.000	0.498	0.501	0.257	0.445
COVID-19	0.000	0.500	0.503	0.283	0.000	0.501	0.501	0.158	0.501
Flowers	0.000	0.499	0.500	0.276	0.000	0.498	0.497	0.219	0.379
Leaves	0.000	0.500	0.502	0.313	0.000	0.501	0.500	0.301	0.478
MIT Indoor Scenes	0.000	0.500	0.498	0.261	0.000	0.502	0.502	0.230	0.122
Painting	0.000	0.499	0.500	0.246	0.000	0.501	0.500	0.298	0.396
Plants	0.000	0.503	0.501	0.313	0.000	0.499	0.501	0.268	0.247
RPS	0.000	0.498	0.500	0.304	0.000	0.499	0.500	0.421	0.365
Skincancer	0.000	0.498	0.501	0.359	0.000	0.497	0.501	0.337	0.442
SRSMAS	0.000	0.499	0.497	0.190	0.000	0.500	0.501	0.231	0.365
Weather	0.000	0.500	0.500	0.343	0.000	0.499	0.499	0.362	0.454
Statistic
Mean	0.000	0.500	0.500	0.282	0.000	0.499	0.500	0.288	0.391
STD	0.000	0.001	0.002	0.057	0.000	0.001	0.001	0.072	0.106
Median	0.000	0.499	0.500	0.294	0.000	0.499	0.500	0.283	0.419
MAD	0.000	0.001	0.001	0.036	0.000	0.001	0.001	0.054	0.054
Maximum	0.000	0.503	0.504	0.359	0.000	0.502	0.502	0.421	0.501
Minimum	0.000	0.498	0.497	0.164	0.000	0.497	0.497	0.158	0.122
Count	0	0	0	6	0	0	0	4	2

Table 15

Runtime results (in seconds) for each independent experiment obtained on each dataset. Seven statistics summarizing the results for each method are shown.

Name	$F_{S}$	$F_{S}^{W}$	$F_{S}^{H}$	$F_{S}^{R}$	F	$F^{W}$	$F^{H}$	$F^{R}$	$E_{S}$
Cataract	3979.2	3848.3	4147.0	8806.3	5995.5	4211.9	8245.6	3216.7	5143.2
Chessman	4222.0	3843.2	4491.5	9413.0	6734.1	5138.0	7530.5	3443.4	6653.2
COVID-19	2642.1	3186.6	3186.6	3810.2	7141.2	3122.8	6294.3	2538.0	5521.4
Flowers	14,724.3	12,703.0	23,039.7	19,934.2	22,521.7	14,444.0	25,522.2	15,989.5	23,206.4
Leaves	4202.4	4107.8	4111.9	5991.3	6670.5	3240.6	7113.7	7219.4	7278.7
MIT Indoor Scenes	46,470.8	40,690.1	39,701.5	51,172.4	46,451.4	46,016.7	43,993.5	54,461.5	59,243.7
Painting	34,243.5	34,206.0	32,419.4	36,665.5	34,821.8	33,535.0	49,336.5	42,956.4	49,177.9
Plants	12,191.3	10,965.9	14,134.0	12,613.0	16,206.6	9014.7	18,021.1	11,433.4	16,839.1
RPS	11,306.1	9309.9	15,451.0	19,411.4	16,854.8	11,502.3	21,079.9	19,926.0	17,154.9
Skincancer	14,809.9	12,778.0	20,445.7	14,653.2	21,197.2	14,366.5	22,900.5	23,124.5	19,775.6
SRSMAS	3035.0	3376.3	3988.6	3536.4	5545.1	3679.4	6441.5	3114.4	4588.2
Weather	5463.2	6263.2	5925.4	6276.9	13,010.8	6269.7	10,675.6	10,159.1	7741.5
Statistic
Mean	13,107.5	12,106.5	14,253.5	16,023.7	16,929.2	12,878.5	18,929.6	16,465.2	18,527.0
STD	13,167.0	11,922.4	11,874.6	13,839.4	12,285.5	12,902.5	14,097.6	15,994.6	17,204.4
Median	8384.6	7786.5	10,029.7	11,013.0	14,608.7	7642.2	14,348.3	10,796.2	12,290.3
MAD	4877.5	4176.8	5979.5	6112.2	7893.8	4182.2	7570.7	7630.7	6958.0
Maximum	46,470.8	40,690.1	39,701.5	51,172.4	46,451.4	46,016.7	49,336.5	54,461.5	59,243.7
Minimum	2642.1	3186.6	3186.6	3536.4	5545.1	3122.8	6294.3	2538.0	4588.2

Table 16

Comparison of the experimental results of ACC obtained on each dataset, between the models obtained by the methods $F_{S}^{R}$ , $F^{R}$ and the model trained from the reference DNN architecture. Seven statistics summarizing the results for each method are shown below. The best values are indicated in bold.

Name	$F_{S}^{R}$	$F^{R}$	DNN
Cataract	0.600	0.586	0.631
Chessman	0.800	0.801	0.791
COVID-19	0.970	0.975	0.981
Flowers	0.878	0.869	0.881
Leaves	0.894	0.886	0.901
MIT Indoor Scenes	0.694	0.673	0.700
Painting	0.938	0.930	0.941
Plants	0.366	0.327	0.374
RPS	1.000	1.000	1.000
Skincancer	0.859	0.859	0.868
SRSMAS	0.804	0.782	0.815
Weather	0.964	0.962	0.966
Statistic
Mean	0.814	0.804	0.821
STD	0.176	0.186	0.172
Median	0.869	0.864	0.875
MAD	0.082	0.090	0.087
Maximum	1.000	1.000	1.000
Minimum	0.366	0.327	0.374
Count	1	2	11

References

1. Samuel, A.L. Machine learning. Technol. Rev.; 1959; 62, pp. 42-45.

2. Bengio, Y. Deep Learning; Adaptive Computation and Machine Learning Series; MIT Press: London, UK, 2016.

3. Pouyanfar, S.; Sadiq, S.; Yan, Y.; Tian, H.; Tao, Y.; Reyes, M.P.; Shyu, M.L.; Chen, S.C.; Iyengar, S.S. A survey on deep learning: Algorithms, techniques, and applications. ACM Comput. Surv.; 2019; 51, pp. 1-36. [DOI: https://dx.doi.org/10.1145/3234150]

4. Pathak, A.R.; Pandey, M.; Rautaray, S. Application of deep learning for object detection. Procedia Comput. Sci.; 2018; 132, pp. 1706-1717. [DOI: https://dx.doi.org/10.1016/j.procs.2018.05.144]

5. Liu, X.; Deng, Z.; Yang, Y. Recent progress in semantic image segmentation. Artif. Intell. Rev.; 2019; 52, pp. 1089-1106. [DOI: https://dx.doi.org/10.1007/s10462-018-9641-3]

6. Yadav, S.S.; Jadhav, S.M. Deep convolutional neural network based medical image classification for disease diagnosis. J. Big Data; 2019; 6, 113. [DOI: https://dx.doi.org/10.1186/s40537-019-0276-2]

7. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, L.; Polosukhin, I. Attention is All You Need. Proceedings of the 31st International Conference on Neural Information Processing Systems (NIPS’17); Long Beach, CA, USA, 4–9 December 2017; pp. 6000-6010.

8. Wen, Y.W.; Peng, S.H.; Ting, C.K. Two-stage evolutionary neural architecture search for transfer learning. IEEE Trans. Evol. Comput.; 2021; 25, pp. 928-940. [DOI: https://dx.doi.org/10.1109/TEVC.2021.3097937]

9. Zhuang, F.; Qi, Z.; Duan, K.; Xi, D.; Zhu, Y.; Zhu, H.; Xiong, H.; He, Q. A Comprehensive Survey on Transfer Learning. Proc. IEEE; 2021; 109, pp. 43-76. [DOI: https://dx.doi.org/10.1109/JPROC.2020.3004555]

10. Aggarwal, C.C. Neural Networks and Deep Learning; Springer: Cham, Switzerland, 2018.

11. Liu, W.; Wang, Z.; Liu, X.; Zeng, N.; Liu, Y.; Alsaadi, F.E. A survey of deep neural network architectures and their applications. Neurocomputing; 2017; 234, pp. 11-26. [DOI: https://dx.doi.org/10.1016/j.neucom.2016.12.038]

12. Barraza, J.F.; Droguett, E.L.; Martins, M.R. Towards Interpretable Deep Learning: A Feature Selection Framework for Prognostics and Health Management Using Deep Neural Networks. Sensors; 2021; 21, 5888. [DOI: https://dx.doi.org/10.3390/s21175888]

13. Poyatos, J.; Molina, D.; Martinez, A.D.; Del Ser, J.; Herrera, F. EvoPruneDeepTL: An evolutionary pruning model for transfer learning based deep neural networks. Neural Netw.; 2022; 158, pp. 59-82. [DOI: https://dx.doi.org/10.1016/j.neunet.2022.10.011]

14. Goldberg, D.E. Genetic Algorithms in Search, Optimization, and Machine Learning; Addison Wesley: Boston, MA, USA, 1989.

15. Ledesma, S.; Cerda, G.; Aviña, G.; Hernández, D.; Torres, M. Feature Selection Using Artificial Neural Networks. MICAI 2008: Advances in Artificial Intelligence; Springer: Berlin/Heidelberg, Germany, 2008; pp. 351-359. [DOI: https://dx.doi.org/10.1007/978-3-540-88636-5_34]

16. Krishnakumar, K. Micro-Genetic Algorithms For Stationary And Non-Stationary Function Optimization. Intelligent Control and Adaptive Systems; Rodriguez, G. SPIE: Bellingham, WA, USA, 1990; [DOI: https://dx.doi.org/10.1117/12.969927]

17. Goldberg, D.E. Sizing populations for serial and parallel genetic algorithms. Proceedings of the Third International Conference on Genetic Algorithms; San Francisco, CA, USA, 1 June 1989; pp. 70-79.

18. Dubey, A.; Saxena, A. An evolutionary feature selection technique using polynomial neural network. Int. J. Comput. Sci. Issues; 2011; 8, 494.

19. Mohammed, T.A.; Alhayali, S.; Bayat, O.; Uçan, O.N. Feature Reduction Based on Hybrid Efficient Weighted Gene Genetic Algorithms with Artificial Neural Network for Machine Learning Problems in the Big Data. Sci. Program.; 2018; 2018, 2691759. [DOI: https://dx.doi.org/10.1155/2018/2691759]

20. Üstün, O.; Bekiroğlu, E.; Önder, M. Design of highly effective multilayer feedforward neural network by using genetic algorithm. Expert Syst.; 2020; 37, e12532. [DOI: https://dx.doi.org/10.1111/exsy.12532]

21. Luo, X.; Oyedele, L.O.; Ajayi, A.O.; Akinade, O.O.; Delgado, J.M.D.; Owolabi, H.A.; Ahmed, A. Genetic algorithm-determined deep feedforward neural network architecture for predicting electricity consumption in real buildings. Energy AI; 2020; 2, 100015. [DOI: https://dx.doi.org/10.1016/j.egyai.2020.100015]

22. Arroyo, J.C.T.; Delima, A.J.P. An Optimized Neural Network Using Genetic Algorithm for Cardiovascular Disease Prediction. J. Adv. Inf. Technol.; 2022; 13, pp. 95-99. [DOI: https://dx.doi.org/10.12720/jait.13.1.95-99]

23. Souza, F.; Matias, T.; Araójo, R. Co-evolutionary genetic Multilayer Perceptron for feature selection and model design. Proceedings of the International Conference on Emerging Technologies and Factory Automation (ETFA2011); Toulouse, France, 5–9 September 2011; pp. 1-7. [DOI: https://dx.doi.org/10.1109/ETFA.2011.6059084]

24. Pham, T.A.; Tran, V.Q.; Vu, H.L.T.; Ly, H.B. Design deep neural network architecture using a genetic algorithm for estimation of pile bearing capacity. PLoS ONE; 2020; 15, e0243030. [DOI: https://dx.doi.org/10.1371/journal.pone.0243030]

25. Baldominos, A.; Saez, Y.; Isasi, P. Hybridizing Evolutionary Computation and Deep Neural Networks: An Approach to Handwriting Recognition Using Committees and Transfer Learning. Complexity; 2019; 2019, 2952304. [DOI: https://dx.doi.org/10.1155/2019/2952304]

26. Tian, H.; Chen, S.C.; Shyu, M.L. Genetic Algorithm Based Deep Learning Model Selection for Visual Data Classification. Proceedings of the 2019 IEEE 20th International Conference on Information Reuse and Integration for Data Science (IRI); Los Angeles, CA, USA, 30 July–1 August 2019; [DOI: https://dx.doi.org/10.1109/iri.2019.00032]

27. de Lima Mendes, R.; da Silva Alves, A.H.; de Souza Gomes, M.; Bertarini, P.L.L.; do Amaral, L.R. Many Layer Transfer Learning Genetic Algorithm (MLTLGA): A New Evolutionary Transfer Learning Approach Applied To Pneumonia Classification. Proceedings of the 2021 IEEE Congress on Evolutionary Computation (CEC); Krakow, Poland, 28 June–1 July 2021; [DOI: https://dx.doi.org/10.1109/cec45853.2021.9504912]

28. Li, C.; Jiang, J.; Zhao, Y.; Li, R.; Wang, E.; Zhang, X.; Zhao, K. Genetic Algorithm based hyper-parameters optimization for transfer convolutional neural network. Proceedings of the International Conference on Advanced Algorithms and Neural Networks (AANN 2022); Zhuhai, China, 25–27 February 2022; [DOI: https://dx.doi.org/10.48550/ARXIV.2103.03875]

29. Bibi, R.; Mehmood, Z.; Munshi, A.; Yousaf, R.M.; Ahmed, S.S. Deep features optimization based on a transfer learning, genetic algorithm, and extreme learning machine for robust content-based image retrieval. PLoS ONE; 2022; 17, e0274764. [DOI: https://dx.doi.org/10.1371/journal.pone.0274764]

30. He, K.; Zhang, X.; Ren, S.; Sun, J. Deep Residual Learning for Image Recognition. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR); Las Vegas, NV, USA, 27–30 June 2016.

31. Casillas, E.S.M.; Osuna-Enciso, V. Architecture Optimization of Convolutional Neural Networks by Micro Genetic Algorithms. Metaheuristics in Machine Learning: Theory and Applications; Springer International Publishing: Cham, Switzerland, 2021; pp. 149-167. [DOI: https://dx.doi.org/10.1007/978-3-030-70542-8_7]

32. Deng, J.; Dong, W.; Socher, R.; Li, L.J.; Li, K.; Fei-Fei, L. ImageNet: A large-scale hierarchical image database. Proceedings of the 2009 IEEE Conference on Computer Vision and Pattern Recognition; Miami, FL, USA, 20–25 June 2009; pp. 248-255. [DOI: https://dx.doi.org/10.1109/CVPR.2009.5206848]

33. Cormen, T.H.; Leiserson, C.E.; Rivest, R.L.; Stein, C. Introduction to Algorithms; 2nd ed. MIT Press: London, UK, 2001.

34. Deb, K. An efficient constraint handling method for genetic algorithms. Comput. Methods Appl. Mech. Eng.; 2000; 186, pp. 311-338. [DOI: https://dx.doi.org/10.1016/S0045-7825(99)00389-8]

35. Ding, C.; Peng, H. Minumum Redundancy Feature Selection from Microarray Gene Expression Data. J. Bioinform. Comput. Biol.; 2005; 3, pp. 185-205. [DOI: https://dx.doi.org/10.1142/S0219720005001004]

36. Deb, K. Optimization for Engineering Design: Algorithms and Examples; Prentice-Hall of India Private Limited: Delhi, India, 2012.

37. github user: Yiweichen04. Cataract Dataset. 2016; Available online: https://github.com/yiweichen04/retina_dataset (accessed on 17 August 2023).

38. kaggle user: Nitesh Yadav. Chessman Image Dataset. 2016; Available online: https://www.kaggle.com/datasets/niteshfre/chessman-image-dataset (accessed on 17 August 2023).

39. kaggle user: Pranav Raikote. COVID-19 Image Dataset. 2020; Available online: https://www.kaggle.com/datasets/pranavraikokte/covid19-image-dataset (accessed on 17 August 2023).

40. Team, T.T. Flowers. 2019; Available online: http://download.tensorflow.org/example_images/flower_photos.tgz (accessed on 17 August 2023).

41. Rauf, H.T.; Saleem, B.A.; Lali, M.I.U.; Khan, M.A.; Sharif, M.; Bukhari, S.A.C. A Citrus Fruits and Leaves Dataset for Detection and Classification of Citrus Diseases through Machine Learning. 2019; Available online: https://data.mendeley.com/datasets/3f83gxmv57/2 (accessed on 17 August 2023).

42. kaggle user: Muhammad Ahmad. MIT Indoor Scenes. 2019; Available online: https://www.kaggle.com/datasets/itsahmad/indoor-scenes-cvpr-2019 (accessed on 17 August 2023).

43. Museum, V.R. Art Images: Drawing/Painting/Sculptures/Engravings. 2018; Available online: https://www.kaggle.com/datasets/thedownhill/art-images-drawings-painting-sculpture-engraving (accessed on 17 August 2023).

44. Singh, D.; Jain, N.; Jain, P.; Kayal, P.; Kumawat, S.; Batra, N. PlantDoc: A Dataset for Visual Plant Disease Detection. 2020; Available online: https://github.com/pratikkayal/PlantDoc-Dataset (accessed on 17 August 2023).

45. Moroney, L. Rock, Paper, Scissors Dataset. 2019; Available online: https://www.tensorflow.org/datasets/catalog/rock_paper_scissors?hl=es-419 (accessed on 17 August 2023).

46. Collaboration, I.S.I. Skin Cancer: Malignant vs. Benign. 2019; Available online: https://www.kaggle.com/datasets/fanconic/skin-cancer-malignant-vs-benign (accessed on 17 August 2023).

47. Gómez-Ríos, A.; Tabik, S.; Luengo, J.; Shihavuddin, A.; Herrera, F. Coral Species Identification with Texture or Structure Images Using a Two-Level Classifier Based on Convolutional Neural Networks. 2019; Available online: https://sci2s.ugr.es/CNN-coral-image-classification (accessed on 17 August 2023).

48. Oluwafemi, A.G.; Zenghui, W. Multi-Class Weather Classification from Still Image Using Said Ensemble Method. 2019; Available online: https://www.kaggle.com/datasets/somesh24/multiclass-images-for-weather-classification (accessed on 17 August 2023).

49. Kingma, D.P.; Ba, J. Adam: A Method for Stochastic Optimization. arXiv; 2014; [DOI: https://dx.doi.org/10.48550/ARXIV.1412.6980]

50. Zhou, Z.H. Ensemble Methods: Foundations and Algorithms; CRC Press: Boca Raton, FL, USA, 2012.

Word count: 13581

Show less

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

This work proposes the use of a micro genetic algorithm to optimize the architecture of fully connected layers in convolutional neural networks, with the aim of reducing model complexity without sacrificing performance. Our approach applies the paradigm of transfer learning, enabling training without the need for extensive datasets. A micro genetic algorithm requires fewer computational resources due to its reduced population size, while still preserving a substantial degree of the search capabilities found in algorithms with larger populations. By exploring different representations and objective functions, including classification accuracy, hidden neuron ratio, minimum redundancy, and maximum relevance for feature selection, eight algorithmic variants were developed, with six variants performing both hidden layers reduction and feature-selection tasks. Experimental results indicate that the proposed algorithm effectively reduces the architecture of the fully connected layers in the convolutional neural network. The variant achieving the best reduction used only 44% of the convolutional features in the input layer, and only 9.7% of neurons in the hidden layers, without negatively impacting (statistically confirmed) classification accuracy when compared to a network model based on a full reference architecture and a representative method from the literature.

Details

Title

Optimization of Deep Neural Networks Using a Micro Genetic Algorithm

Author

Landa, Ricardo¹

; Tovias-Alanis, David¹; Toscano, Gregorio²

¹ Tamaulipas Campus, Cinvestav, Cd. Victoria 87130, Mexico; [email protected]
² Department of Electrical Engineering and Computer Science, The Catholic University of America, Washington, DC 20064, USA; [email protected]

First page

2651

Publication year

2024

Publication date

2024

Publisher

MDPI AG

e-ISSN

26732688

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/ai5040127

ProQuest document ID

3149498890

Optimization of Deep Neural Networks Using a Micro Genetic Algorithm

Jump to:

Full text

1.1. Transfer Learning

1.2. Optimization Algorithms

2. State of the Art

2.1. Approaches with Transfer Learning That Optimize Fully Connected Layers

2.2. Limitations for Optimization Algorithms

3. A Deep Neural Network Based on a Micro Genetic Algorithm ( $μ$ GA-DNN)

3.1. Objective Functions

3.1.1. Maximizing Classification Accuracy

3.1.2. Maximizing Classification Accuracy and Percentage of Hidden Neurons Removed

3.1.3. Maximizing Classification Accuracy and MRMR Feature Selection Criteria

3.1.4. Maximizing Classification Accuracy, Percentage of Hidden Neurons Removed, and MRMR Criterion

3.2. Representations of Solutions

3.2.1. First Representation: $μ$ GA-1

3.2.2. Second Representation: $μ$ GA-FS

3.2.3. Third Representation: $μ$ GA-MRMR

3.3. Naming Variants of $μ$ GA-DNN

3.4. Pseudocode of the Proposed Algorithm

4. Experimental Framework

4.1. Datasets

4.2. Reference Methods

4.2.1. EvoPruneDeepTL Algorithm

4.2.2. DNN Architecture

4.3. Parameters of the Proposed Algorithm

4.4. Resampling Method

5.1.1. Classification Accuracy

5.1.2. Feature Selected Ratio

5.1.3. Hidden Neurons Ratio

5.1.4. Model Complexity

5.1.5. Number of Objective Function Evaluations

5.1.6. Hamming Distance Results

5.1.7. Runtime Results

5.2. Comparative Results with the Full Reference Model

Abstract

Details

Suggested sources

Optimization of Deep Neural Networks Using a Micro Genetic Algorithm

Jump to:

Full text

1.1. Transfer Learning

1.2. Optimization Algorithms

2. State of the Art

2.1. Approaches with Transfer Learning That Optimize Fully Connected Layers

2.2. Limitations for Optimization Algorithms

3. A Deep Neural Network Based on a Micro Genetic Algorithm (μGA-DNN)

3.1. Objective Functions

3.1.1. Maximizing Classification Accuracy

3.1.2. Maximizing Classification Accuracy and Percentage of Hidden Neurons Removed

3.1.3. Maximizing Classification Accuracy and MRMR Feature Selection Criteria

3.1.4. Maximizing Classification Accuracy, Percentage of Hidden Neurons Removed, and MRMR Criterion

3.2. Representations of Solutions

3.2.1. First Representation: μGA-1

3.2.2. Second Representation: μGA-FS

3.2.3. Third Representation: μGA-MRMR

3.3. Naming Variants of μGA-DNN

3.4. Pseudocode of the Proposed Algorithm

4. Experimental Framework

4.1. Datasets

4.2. Reference Methods

4.2.1. EvoPruneDeepTL Algorithm

4.2.2. DNN Architecture

4.3. Parameters of the Proposed Algorithm

4.4. Resampling Method

5.1.1. Classification Accuracy

5.1.2. Feature Selected Ratio

5.1.3. Hidden Neurons Ratio

5.1.4. Model Complexity

5.1.5. Number of Objective Function Evaluations

5.1.6. Hamming Distance Results

5.1.7. Runtime Results

5.2. Comparative Results with the Full Reference Model

Abstract

Details

3. A Deep Neural Network Based on a Micro Genetic Algorithm ( $μ$ GA-DNN)

3.2.1. First Representation: $μ$ GA-1

3.2.2. Second Representation: $μ$ GA-FS

3.2.3. Third Representation: $μ$ GA-MRMR

3.3. Naming Variants of $μ$ GA-DNN