Full Text

Turn on search term navigation

1. Introduction

Analysis of gene expression levels is essential in studying and detecting genes functions. According to Chandra and Tripathi [1], genes that have similar gene expression levels are likely to involve similar biological functions. The authors showed that the clustering process was quite useful to identify co-expressed genes in a group of genes and, in addition, to detect unique genes in different groups. Therefore, clustering can be quite helpful to extract valuable knowledge from a large amount of biological data [2], which could lead to prevention, prognosis, and treatment in biomedical research.

Cai et al. [3] developed a random walk-based technique to cluster similar genes. The authors show that the proposed method was useful in strengthening the interaction between genes by considering the types of interactions that exist in the same group of genes. Many previous random walk-based methods managed to extract local information from a large graph without knowledge of the whole graph data [4]. In a random walk-based method, a gene is important if it interacts with many other genes [5,6,7,8]. As illustrated in Figure 1, gene 1 has a higher degree than gene 2 (two outgoing links) compared to one outgoing link from gene 3 to gene 4. In this case, gene 1 is the most important gene among the four genes shown in the hypothetical gene network.

Several previous studies have noted the importance of clustering to identify co-expressed genes in a cluster and inactive genes in another cluster [1,9]. Clustering can also discover the fundamental hidden structure of biomedical data, which can be used for diagnosis and treatments [9]. In addition, clustering is extremely vital for identifying cancer subtyping and the detection of the tumor.

Researchers typically focus on clustering by assuming the number of clusters beforehand, which can be seen in [10,11]. This problem can lead to the inability of the clustering techniques to obtain an optimal number of centroids and hence results in poor quality of clusters [11,12]. In previous studies, several proposed approaches managed to discover the optimal number of clusters by simply tuning and optimizing the parameters of the clustering method. This can be done by repeating the process of analyzing the eigenvalues of the affinity matrix, which are equal to the number of desired clusters [13]. In addition, rotating normalized eigenvectors and squared-loss mutual information (SMI) can be employed in the clustering process to obtain an optimal number of clusters [14,15]. Besides, the elbow method and the average silhouette method are the other examples to identify the optimal number of clusters in previous studies [15,16]. The elbow method identifies the optimal number of clusters by calculating sum of squared error for each number of clusters (k) from a range of k values. The average silhouette method computes the average silhouette values of genes for different values of k (number of clusters). Then, this method selects the optimal number of clusters that has the maximum average silhouette values from the range of k values. Optimization of the objective function and validation of clustering can improve the quality of clusters [11]. The optimization for the objective function of clustering can identify the best solution among a set of solutions. On the other hand, clustering validation is used to determine clusters in the data using an appropriate measurement [17]. Clustering validation can also evaluate the goodness of the clustering structure based on the given class labels [18]. Thus, validation is an essential step because it assists in the identification of which cluster is more informative compared to other clusters [19].

This paper focuses on reviewing existing computational methods on genes clustering using the notion of optimizing the objective function and validation.

2. Gene Network Clustering Techniques

In general, clustering can be categorized into partitioning, hierarchical, grid-based, and density-based techniques [11,17,20,21,22]. In Table 1, we show differences among categories of clustering techniques. The table also provides some information such as time complexity, computing efficiency, convergence rate, scalability, and initialization of cluster number. Partitioning clustering assigns the data objects into a number of clusters fixed beforehand. This technique identifies the number of centroids and assigns the objects to the nearest centroid. Hierarchical clustering groups the data based on the distance of the objects to form clusters. This technique can be either started with large data and aggregated into a small group or started from a small group of data and merged until all the data are in one large group. Grid-based clustering divides each dimension of data space to form a grid structure. Density-based clustering separates the data according to the density of the objects. Traditionally, hierarchical, grid-based, and density-based techniques do not require cluster number as an input parameter [20,23]. In the view of Jain [17], hierarchical clustering is more versatile than partitioning clustering. With the discovery of clusters with good robustness and flexibility, grid-based and density-based techniques have been particularly useful [24]. They are also helpful for dealing with large spatial data and the proper use of expert knowledge. Grid-based and density-based techniques also aim to identify data densities and to split the data space into grid structures when looking for groupings [25]. Grid-based clustering techniques are more efficient compared to density-based clustering techniques; however, the use of summarized information makes these techniques lose effectiveness in cases where the number of dimensions increases [26].

In Table 2, we present several examples of clustering techniques done by previous researchers. The table also summarizes the advantages and the disadvantages of the techniques. From this table, k-means clustering is the most popular technique, even though k-means suffers from the shortcoming of identifying the number of potential clusters before the clustering setup.

According to the reviewed clustering techniques in Table 2, this experimental work aims to investigate which category of clustering techniques would perform better in clustering genes. Gene expression data from the leukemia microarray study by Golub et al. [49] are used in this study. These data consist of 3051 genes, 38 tumor mRNA samples [27 acute lymphoblastic leukemia (ALL) and 11 acute myeloid leukemia (AML)] [50]. The clustering techniques investigated in this experimental work are k-means clustering (partitioning), agglomerative nesting (AGNES) (hierarchical), clustering in quest (CLIQUE) (grid-based), and density-based spatial clustering of applications with noise (DBSCAN) (density-based). The results in terms of percentage of accuracy are shown in Table 3. The experimental work was carried out using stratified ten-fold cross-validation and a support vector machine as a classifier. The selected clusters in Table 3 were validated based on silhouette width. According to Table 3, the CLIQUE was able to achieve the highest classification accuracy when applied on the leukemia dataset compared to other clustering techniques. In addition, Table 3 also shows several genes were biologically validated as prognostic markers for leukemia when PubMed text mining was used. Prognostic marker was commonly used to differentiate between good or poor disease outcomes [51]. This validation was done to show the relationship between genes and prognostic markers of leukemia [52]. Although CLIQUE achieved the best classification accuracy, the technique identified 67 genes as prognostic markers of leukemia out of 919 genes in the selected cluster. On the other hand, k-means had the best performance in identifying prognostic markers of leukemia (8%). The remaining techniques were able to achieve between 6% and 8% in determining the prognostic markers of leukemia over the number of genes in the selected clusters.

2.1. Category 1: Partitioning Clustering

Detection of clusters using partitioning clustering has low time complexity and high computational cost [53]. However, there are specific problems related to this technique. One of these problems is detecting clusters inappropriate for non-convex data. This could be because clustering techniques cannot spatially separate the data [54]. Other disadvantages are the need to initialize the number of clusters beforehand, and that the clustering result is sensitive to the intended number of possible clusters. Fuzzy C Means (FCM), k-means clustering, Partitioning Around Medoids (PAM), and Self-Organizing Maps (SOM) are all examples of partitioning clustering [9,10,11,12,27,28,29,30,31,32,33,34,35,36,37,38]. PAM is a variation of k-means clustering [55], and it is more robust in terms of accuracy compared to k-means clustering, for instance, when applied to classify cancer types [56,57].

2.2. Category 2: Hierarchical Clustering

Hierarchical clustering’s scalability is relatively high in cluster detection [53]. One benefit of the method is that it can detect the hierarchical relationship among clusters easily. However, the major drawback associated with hierarchical clustering is the high computational cost. Agglomerative (bottom-up) and divisive (top-down) are the categories of hierarchical clustering [2,35,58]. The way of merging clusters and identification of the node levels can differentiate between agglomerative and divisive hierarchical clustering [58]. Agglomerative hierarchical clustering (AHC) combines the most adjacent pair of clusters, forming a group from bottom to top [59]. Several strategies of AHC are used to identify the distance between clusters, which are single linkage, complete linkage, centroid linkage, average linkage, Ward’s method, and the probability-based method [25,58,59]. On the other hand, divisive hierarchical clustering is useful to identify clusters with different densities and shapes [58,59]. The method starts from all samples in a group and then splits the samples into two sub-clusters, which are then divided into further sub-clusters and so on [58]. For AHC, node-level is the diameter of a new cluster formed at the splitting step. The node-level of divisive methods is to divide the groups based on their diameters. Agglomerative nesting (AGNES), EISEN clustering, and divisive analysis (DIANA) are examples of hierarchical clustering [19,34]. Garzón and González [19] used these clustering techniques to group similar genes before the step of the gene selection.

2.3. Category 3: Grid-Based Clustering

The design of grid-based clustering divides the entire data space into multiple, non-overlapping grid structures [24,59]. This method performs faster than density-based clustering. Grid-based clustering can benefit from dividing the data space into grids to reduce its time complexity [22,60]. CLIQUE, grid-clustering technique for high-dimensional very large spatial databases (GCHL), and statistical information grid (STING) are examples of grid-based clustering [39,40,41,42,43]. The GCHL technique can discover concave (deeper) and convex (higher) regions when applied in medical and geographical fields and by using the average eight direction (AED) technique [26,41]. However, both techniques struggle to identify complex clusters from high dimensional data. CLIQUE partitions the data space into cells and searches subspaces by counting the number of points in each cell [61]. Searching a suitable set of dimensions for each cluster can form the candidate subspace for the centroid of the cluster. Different groups of points are clustered in different subspaces [62].

2.4. Category 4: Density-Based Clustering

Usually, the regions contain points with high density in the data space, which makes density-based clustering mistake them as clusters [59]. Mechanisms of aggregation in density can characterize the clustering [45]. A significant advantage of density-based clustering is that it can discover differently shaped clusters and noise from data [22,24,63]. However, density-based clustering has a high runtime analysis to detect clusters [64]. DBSCAN, random walk, and Relative Core Merge (RECOME) are examples of density-based clustering [44,45,46,47,48]. Historically, a random walk uses the theory of Markov chain [48,65]. In most studies, the random walk has been used to infer and to optimize the structural properties of networks [65,66]. Much of the current literature on the random walk is on ranking the genes concerning their specific probabilities from high to low [67,68]. In literature, a random walk mostly uses the topological similarity in networks to identify genes with a similar disease.

3. Optimization for Objective Function of Partitioning Clustering Techniques

Optimization for objective function can improve the efficiency of partitioning clustering techniques during initialization of the intended cluster number [11,33]. Swarm intelligence is widely used as the objective function for a clustering problem. The number of intended clusters can be predicted based on the typical search of the patterns [69,70]. Swarm intelligence can also be applied through maximizing or minimizing the objective function of clustering [69,71,72]. In most studies, swarm intelligence has been mostly used in the field of optimization [73,74].

Swarm intelligence refers to the collective behavior of decentralized, self-organized systems of living creatures. The swarm intelligence systems consist typically of a population of simple agents or boids interacting locally with one another and with their environment. The inspiration often comes from nature, especially biological systems [75,76].

For modeling the behavior of a swarm, the techniques are made up of animals and insects, such as bees, ants, birds, fishes, and so on [74,77]. Most recent studies used swarm intelligence to solve problematic real-world problems such as networking, traffic routing, robotics, economics, industry, games, etc. [73,74]. Hence, clustering techniques can benefit from swarm intelligence [74].

Swarm intelligence can optimize the objective function of clustering based on population and evolution strategies [11,33]. This function is usually used to determine the fitness of each particle since the community has a set of particles (known as a swarm), and each particle represents a solution. Table 4 compares the use of optimization in population and evolution strategies. Both optimization strategies are designed to imitate the best features in nature and produce a better quality of solution efficiently [78,79]. Previous studies have explored the use of optimization in a generation with more than 1000 populations before the convergence step, but it was not computationally efficient [80].

Table 5 summarizes existing techniques of swarm intelligence based on the strategies together with their usages. Xu et al. [81] found particle swarm optimization (PSO) is faster than both artificial bee colony (ABC) and genetic algorithm (GA) because PSO can perform without any complicated evolution. Previous studies have also shown some drawbacks of ABC, which are the limited ability of exploitation, slow convergence speed, and low-quality solutions [82]. In the review of GA and PSO algorithms, Gandomi et al. [79] identified the main purposes of these techniques, which solved significant problems faster.

3.1. Strategy 1: Population-Based Optimization

Population-based optimization is performed in terms of exploration and exploitation [69,100]. Exploration is the technique able to reach the best solution within the search space, while exploitation expresses the ability of the technique to reach a global optimum solution. Metaheuristic search can apply in this optimization for global optimal solutions using informative parameters. However, the optimization still has difficultly avoiding the problems of local minima and early convergence [11,33,101]. Several examples of population-based optimization are reviewed, which are ant colony optimization (ACO), ant lion optimization (ALO), firefly algorithm (FA), and particle swarm optimization (PSO) [11,33,70,71,77,81,83,86,89].

In the literature related to PSO, most previous studies used PSO because it does not have any complex evolution [81]. Fister et al. [70] found that FA is suitable for multi-modal optimization and fast convergence.

3.2. Strategy 2: Evolution-Based Optimization

Evolution-based optimization is involved in the processes of selection, recombination, and mutation [102]. The selection of evolution strategy fails to deal with changing environments, and it threatens the self-adaptation with its control parameters (internal model) [103,104]. For recombination processes (in terms of discrete and intermediate processes), it performs with control parameters on object variables, standard deviations, and rotation angles. The mutation mechanism makes the techniques evolve its control parameters (standard deviations and covariances). Evolution-based optimization can optimize the mathematical functions of the technique with continuously changeable parameters and extend to solve discrete optimization problems. This strategy can deliver a high quality of solutions and allows the technique to move toward better solutions in the search space with a population [105,106]. GA is one of the techniques using evolution strategy, which is commonly used for clustering based on selection, crossover, and mutation. In previous studies, most algorithms were derived from GA, such as evolution strategy (ES) and evolutionary programming (EP) [92]. The memetic algorithm is the extension of GA and includes local search optimization for problem-solving [97,98,99]. Genetic programming (GP), on the other hand, is the extension of GA that has been successfully applied and used to solve many problems [95,96]. Moreover, gene expression programming (GEP) uses the character of linear chromosomes and has been applied in symbolic regression and block stacking [93,94].

4. Clustering Validation in Measurements

Previous studies have evaluated the identified gene clustering in terms of distance [1]. If they are not within a distance regarding a specified gene in each experimental condition, then the specified gene is classified as an inactive gene. Otherwise, the specified gene is co-expressed.

Clustering validation can be measured in terms of internal and external criteria [17,18,100,107]. Table 6 summarizes the differences between internal and external validations. In general, internal criteria can assess the fitness between clustering structure and data. External criteria can measure the performance by matching cluster structure to prior information. As mentioned by Handl et al. [23], internal validation suffers from bias regarding clusters number and partitioning structure from data. The goal of internal validation is measured based on compactness and separation [18,107]. Compactness is defined as a measure of how close the objects are in a cluster based on variance. Separation measures either how a cluster is distinct or how well separated it is from other clusters. Handl et al. [23] held the view that external validation can suffer from biases in a partitioning according to cluster number and distribution of groups with class sizes.

Table 7 sets out examples of measurements to validate the quality of clusters. As can be seen from the table, previous studies commonly used Euclidean distance and silhouette width. In general, silhouette width can validate the clustering performance in terms of pairwise difference between and within cluster distances [18,107]. The maximum values of the silhouette width can identify an optimal number of clusters.

5. Discussion

An efficient clustering technique is the one capable of extracting useful information about the behavior of a gene. According to Oyelade et al. [114], ensemble clustering (a combination of two or more phases of clustering) can generate more robust and better quality clusters compared to single clustering. Table 8 summarizes the ensemble methods for clustering that were used by previous researchers. In addition, Oyelade et al. [114] also showed that hierarchical clustering is more suitable to handle real datasets, such as image data, compared to partitioning clustering, but it is computationally expensive. Advanced technological developments can isolate a large group of cells. Biological data can provide a better understanding of the complex biological processes. For example, single-cell RNA sequencing can help to expose biological processes and medical insights [115]. The k-means clustering typically performs better than hierarchical clustering in smaller datasets, but it requires a long computational time [114,115]. Other than that, large amounts of bulk data can address biological dynamics and cancer heterogeneity. Tang et al. [115] proposed High-order Correlation Integration (HCI), which uses k-means clustering and Pearson’s correlation coefficient in the experiments. Their results showed that HCI outperforms the existing methods (k-means clustering and hierarchical clustering) under single-cell and bulk RNA-seq datasets. Unsupervised clustering is one of the powerful techniques used in single-cell RNA sequencing to define cell types based on the transcriptome [116]. Fully unsupervised clustering techniques (e.g., intelligent k-means and kernel k-means) are applied to analyze genes in colorectal carcinoma [117]. Other than that, random walk-based clustering, GCHL, and CLIQUE clustering techniques are also used in unsupervised manners [26,41,46,47,48,61,67].

The purpose of optimization for objective function and validation is to achieve quality clusters. Most of the previous studies used swarm intelligence to optimize the parameters of clustering techniques and to identify the optimal number of possible clusters [118]. The objective function of clustering techniques defines optimization as maximizing the accuracy of the centroid or the cluster center, especially for partitioning clustering techniques. It is because partitioning clustering needs to initialize either the number of clusters or the number of centroids beforehand. Furthermore, clustering validation is also essential to measure within or between the identified clusters [19].

In this research, leukemia data containing 3051 genes and 38 samples [49] were used to evaluate the performance of each clustering techniques category. The genes obtained by the clustering techniques were different from one technique category to another; however, the number of target clusters was the same among the techniques. As a result, the grid-based clustering technique provided higher classification accuracy than other clustering techniques. The technique was able to identify 7.29% of the prognostic markers in leukemia data. On the other hand, k-means clustering achieved the highest percentage (8%) of identifying prognostic markers in leukemia, but the classification accuracy in this case was quite poor.

A summary of optimal cluster analysis studied by previous researchers is shown in Table 9. According to the table, k-means clustering was the most used in the research. Integration of optimization is critical to its use in research because it can solve the issue of k-means clustering that requires initializing the number of clusters beforehand [10,11].

6. Conclusions

In summary, this paper reviewed examples of existing computational methods for clustering genes with similar biological functions. As a result, we found that partitioning, hierarchical, grid-based, and density-based are the categories of clustering techniques. Clustering can identify a high-quality cluster that is helpful in biological mechanisms and could lead to the identification of new genes related to potentially known or suspected cancer genes [67,117,123].

Among the categories of clustering, grid-based and density-based techniques are more suitable to be used to cluster objects in large spatial data. These techniques are inappropriate for artificial and biological datasets such as iris, wine, breast tissue, blood transfusion, and yeast datasets [24,114]. On the other hand, density-based clustering techniques are useful if used to cluster gene expression data [114]. Moreover, hierarchical clustering techniques are useful to handle synthetic and real datasets (e.g., image data). However, these techniques have some limitations when the data are very large [114]. Finally, partitioning clustering techniques are inappropriate for non-convex data but suitable for smaller datasets [53,114,115].

Grid-based clustering (CLIQUE) was more efficient than other categories of clustering (e.g., k-means clustering, DBSCAN, and AGNES), but it was difficult to identify multiple clusters in cases of high dimensional data types. Although k-means clustering (category: partitioning) was sensitive to initializing the number of clusters, it provided a higher chance of identifying prognostic markers of leukemia. A prognostic marker is useful for identifying a disease outcome, which can be helpful in cancer treatment and drug discovery as well [52]. However, the quality of clusters is usually affected by initializing the number of intended clusters, especially for partitioning clustering. Therefore, the optimization of the objective function and validation can help clustering techniques to identify the optimal number of clusters with better quality [11,89]. This paper also showed the two types of optimization strategies, which are population and evolution. Most of the existing techniques used for optimization utilize population strategies. Carneiro et al. [124] also concluded that the use of optimization could generate better classification together with the use of clustering and topological data. In addition, this paper also reviewed clustering validation and its measurements criteria. Internal and external criteria are commonly used to measure the cluster structure. Besides, genes in clusters can belong to a specific pathway, which can reflect the genes’ functioning in biological processes [125]. For example, BCL2 associated with X apoptosis regulator (BAX) was among the genes identified in our experimental work, which is also a prognostic marker of leukemia. The BAX gene was encoded in the pro-apoptosis proteins, which could increase its expression and decrease the expression of anti-apoptosis (e.g., Bcl-2 gene) in the treatment of leukemia [126,127]. Moreover, clustered genes can identify metabolic gene clusters related to the discovery of metabolite in bacteria and fungi [127]. Identifying genes in clusters can not only allow us to discover the informative gene and the prognostic marker for the specific disease, but it can also provide a clue about the cluster dictated by signature enzymes. The signature enzyme can catalyze reactions and further tailor the product. Hence, the genes can be encoded in the pathway with enzymes.

Based on the experimental work, the CLIQUE and the k-means clustering techniques produce better results in terms of classification accuracy and identifying cancer markers. Therefore, this review suggests combining clustering techniques such as CLIQUE and k-means to yield more accurate gene clustering.

Although the optimal cluster analysis is the focus of this review, the findings can be applied to different areas.

Author Contributions

Conceptualization, H.W.N., Z.Z., M.S.M. and W.H.C.; Methodology, H.W.N., Z.Z., M.S.M., W.H.C. and N.Z.; Resources, H.W.N.; Writing—Original Draft Preparation, H.W.N.; Writing—Review and Editing, Z.Z., M.S.M., W.H.C., N.Z., R.O.S., S.N., P.C., S.O., J.M.C.; supervision, Z.Z., M.S.M. and W.H.C.

Funding

This research was funded by Fundamental Research Grant Scheme—Malaysia’s Research Star Award (FRGS-MRSA) and Fundamental Research Grant Scheme (R.J130000.7828.4F973) from Ministry of Education Malaysia, ICT funding agency from United Arab Emirates University (G00001472), and Research University Grant from Universiti Teknologi Malaysia (Q.J130000.2628.14J68). The authors also would like to thank Universiti Teknologi Malaysia (UTM) for the support of UTM’s Zamalah Scholarship.

Acknowledgments

The authors acknowledge support from the Ministry of Education Malaysia, United Arab Emirates University (UAEU), University of Salamanca (USAL), and Universiti Teknologi Malaysia (UTM).

Conflicts of Interest

The authors declare no conflict of interest.

Figure and Tables

Figure 1. A hypothetical gene network to illustrate the importance of genes in a random walk.

Table 1

Differences among categories of clustering techniques.

Categories	Time Complexity	Computing Efficiency	Convergence Rate	Scalability	Initialization of Cluster Number
Partitioning	Low	High	Low	Low	Yes
Hierarchical	High	High	Low	High	No
Grid-based	Low	High	Low	High	No
Density-based	Middle	High	High	High	No

Table 2

Examples of popular clustering techniques along with their advantages and disadvantages.

Clustering Techniques	Categories	Advantages	Disadvantages	References
Fuzzy C Means (FCM)	Partitioning	Minimize the error function belonging to its objective function and solve the partition factor of the classes.	Unable to achieve high convergence.	[27,28]
K-means Clustering	Partitioning	Use a minimum “within-class sum of squares from the centers” criterion to select the clusters.	Need to initialize the number of clusters beforehand.	[9,10,11,12,29,30,31,32,33]
Partitioning Around Medoids (PAM)	Partitioning	Deal with interval-scaled measurements and general dissimilarity coefficients.	Consumes large central memory size.	[34]
Self-Organizing Maps (SOM)s	Partitioning	Suitable for data survey and getting good insight into the cluster structure of data for data mining purposes.	Distance dissimilarity is ignored.	[35,36,37,38]
Agglomerative Nesting (AGNES)	Hierarchical (agglomerative)	Build a hierarchy of clustering from a small cluster and then merge until all data are in one large group.	Starts with details and then works up to large clusters, which is affected by unfortunate decisions in the first step.	[19,34]
EISEN Clustering	Hierarchical (agglomerative)	Carry out a clustering in which a mean vector represents each cluster from data in the group.	Starts with details and then works up to large clusters, which can be affected by unfortunate decisions in the first step.	[19]
Divisive Analysis (DIANA)	Hierarchical (divisive)	Perform a task starting from a large cluster containing all data to only a single dataset.	Not generally available and rarely applied in most studies.	[19,34]
Clustering in Quest (CLIQUE)	Grid-based	Can automatically find subspaces in lower-dimensional subspaces with high-density clusters.	Ignores all projections of dimensional subspaces.	[39,40]
Grid-Clustering Technique for High-Dimensional and Large Spatial Databases (GCHL)	Grid-based	Efficient and scalable while handling high dimensionality issue.	Insensitive to noise.	[26,41]
Statistical Information Grid (STING)	Grid-based	Facilitate several kinds of spatial queries and less computational cost.	Difficult to identify multiple clusters.	[42,43]
Density-Based Spatial Clustering of Applications with Noise (DBSCAN)	Density-based	Can detect clusters with different shapes and able to handle ones with different densities.	Optimization issue. Difficult to select appropriate parameter values.	[44,45]
Random Walk based Clustering	Density-based	Reflect the topological features of a functional network.	Considers the interaction between two genes.	[46,47,48]
Relative Core Merge (RECOME)	Density-based	Can characterize based on a step function of its parameter.	Scalability issue. Hard to handle a large volume of data.	[45]

Table 3

Comparative results of the clustering technique applied on leukemia gene expression data.

Categories	Clustering Techniques	Parameter (s)	Number of Genes in the Selected Cluster	Number of Prognostic Markers	Accuracy (%)
Partitioning	K-means	k = 2	275	22	71.50
Hierarchical	AGNES	k = 2	339	22	78.50
Grid-based	CLIQUE	k = 2 dimension = 10 density = 0.2	919	67	89.00
Density-based	DBSCAN	k = 2 minPts = 10	1548	103	73.00

Note: k is the number of clusters to be selected; dimensions are divided into several equal-width intervals; density is the density threshold; minPts is the minimum size of clusters.

Table 4

Comparison of the use of optimization between population and evolution strategies.

Strategies	Population-Based		Evolution
Functions	Exploration	Exploitation	Evolution
Between technique and solution	The technique can reach the best solution within the search space.	Express the ability of the technique to reach the global optimum solution, which was around the obtained local solutions.	Optimize the mathematical functions of the technique with continuously changeable parameters and extend to solve discrete optimization problems.
Application	Metaheuristic search for global optimal solutions using informative parameters.		Processes of selection, recombination, and mutation.
Weakness	Difficult to avoid problems of local minima and early convergence.		Need to control and adjust parameters.
Aim	Imitate the best features in nature and produce a better quality of solution efficiently.

Table 5

Summary of existing techniques of swarm intelligence.

Techniques	Strategies	Usage	Fitness	References
Artificial Bee Colony (ABC)	Population	Can stimulate searching food process of bees based on the found food sources quality.	Position and nectar amount of a food source.	[37,82]
Ant Colony Optimization (ACO)	Population	Mimic ant behavior to solve optimization problems.	Pheromone values.	[77,83]
Ant Lion Optimization (ALO)	Population	High exploitation to explore search space and quickly converge to a global optimum.	Ant location.	[11,33]
Bat Algorithm	Population	Uses the frequency-based tuning and pulse emission rate changes that can lead to better convergence.	Bat behavior.	[78,80,84]
Bee Algorithm	Population	Imitate food foraging behavior of swarms of honeybees to find the optimal solution.	Frequency of the dance.	[85]
Cuckoo Search (CS)	Population	Combine the obligate brood parasitic behavior of some cuckoo species with Lévy flight behavior of some birds and fruit flies.	Quality of cuckoo bird eggs.	[79]
Firefly Algorithm (FA)	Population	Carry out nonlinear design optimization and solve unconstrained stochastic functions.	Brightness of the firefly.	[70,86]
Gravitational Search Algorithm (GSA)	Population	Emulate the law of Newtonian gravity to solve various nonlinear optimization problems.	Intelligence factors.	[87,88]
Particle Swarm Optimization (PSO)	Population	Balance the weights of a neural network and sweep the search space using a swarm of particles.	A “space” where the particles “move”.	[71,77,81,89]
Simulated Annealing (SA)	Population	Use principles of statistical mechanics regarding the behavior of many atoms at low temperature.	Single bit-flips.	[90,91]
Differential Evolution (DE)	Evolution	Maintain a population of target vectors at each iteration for stochastic search and global optimization.	Global minimum.	[71]
Evolution Strategy (ES)	Evolution	Emphasize the use of normally distributed random mutations (main operator).	Several operators needed to consider in the analysis.	[92]
Evolutionary Programming (EP)	Evolution	Use the self-adaptation principle to evolve the parameters on searching.	No recombination operator and difficult to identify useful values for parameter tuning.	[92]
Gene Expression Programming (GEP)	Evolution	Extremely versatile and greatly surpasses the existing evolutionary techniques.	Several genetic operators needed to function on selected chromosomes during reproduction.	[93,94]
Genetic Algorithm (GA)	Evolution	Use genes with mechanisms to mimic survival of the fittest and inspire the genetics with the evolution of populations.	Priority of the genetic strings.	[71]
Genetic Programming (GP)	Evolution	Can select variables and operators automatically then assemble into suitable structures.	No clearly defined termination point in biological processes operating.	[95,96]
Memetic Algorithm	Evolution	Useful on the property of global convexity in the search space.	Genetic operators (crossover and mutation) needed to consider in the analysis.	[97,98,99]

Table 6

The difference in measurements between internal and external validations.

Criteria of Validation Measurements	Internal	External
Aim	Assess the fitness between clustering structure and data.	Measure the performance by matching cluster structure to prior information.
Suffer from bias	Number of clusters Partitioning structure from data	Number of clusters Distribution of cluster with class sizes in a partitioning

Table 7

Examples of previous studies in clustering validation.

Measurements	Categories	Usage	References
Average of sum of intra-cluster distances	Internal	Measure assessing cluster compactness or homogeneity.	[11,33]
Connectivity	Internal	Degree of the connectedness of clusters.	[1,23]
Davies and Bouldin (DB) index	Internal	Measure intra- and inter-cluster using spatial dissimilarity function.	[108]
Dunn index	Internal	Ratio of the smallest distance among observations in the different cluster to the most considerable intra-cluster distance.	[1,23]
Euclidean distance	Internal	Compute distances between the objects to quantify their degree of dissimilarity.	[19,31,34,109]
Inter-cluster distance	Internal	Quantify the degree of separation between individual clusters.	[11]
Manhattan distance	Internal	Correspond to the sum of lengths of the other two sides of a triangle.	[34]
Pearson correlation coefficients (PCC)	Internal	Measure between-state functional similarity.	[23,110]
Silhouette width	Internal	Measure the degree of confidence in a clustering assignment and lie in the interval [−1, +1], with well-clustered observations having values near +1 and near -1 for poorly clustered observations.	[1,18,19,31,32,109]
Square sum function of the error	Internal	Measure the quality of cluster either by compactness or homogeneity.	[12,23,111]
Entropy	External	Measure mutual information based on the probability distribution of random variables.	[30,112,113]
F-measure	External	Assess the quality of clustering result at the level of entire partitioning and not for an individual cluster only.	[11,23,30,33]

Table 8

Summary of the existing ensemble methods used in clustering.

References	Ensemble Methods	Clustering Techniques	Use
Deng et al. [24]	Grid-based and Density-based Spatial Clustering (GRIDEN)	Grid-based Density-based (DBSCAN)	Enhances clustering speed.
Oyelade et al. [114] Masciari et al. [119]	Microarray Data Clustering using Binary Splitting (M-CLUBS)	Hierarchical (divisive and agglomerative)	Overcomes the effect of size and shape of clusters, number of clusters, and noise for gene expression data.
Oyelade et al. [114] Bouguettaya et al. [120]	Efficient Agglomerative Hierarchical Clustering (KnA)	Hierarchical (agglomerative) Partitioning (k-means)	Relatively consistent in synthetic data.
Bouguettaya et al. [120] Lin et al. [121]	Cohesion-based Self-Merging (CSM)	Partitioning (k-means) Hierarchical (divisive)	Clusters the datasets of arbitrary shapes very efficiently.
Darong and Peng [122]	Grid-based DBSCAN Technique with Referential Parameters (GRPDBSCAN)	Grid-based Density-based (DBSCAN)	Finds clusters of arbitrary shape and removes noise.

Table 9

Summary of optimal cluster analysis.

References	Clustering Techniques	Optimization for Objective Function of Partitioning Clustering Techniques	Clustering Validation
Majhi and Biswal [11,33]	K-means clustering	Ant Lion Optimization (ALO)	Average of sum of intra-cluster distances F-measure
Ye et al. [12]	K-means clustering	Cuckoo Search	Square sum function of the error
Mary and Raja [30]	K-means clustering	Ant Colony Optimization (ACO)	Entropy F-measure
Garg and Batra [32]	Decision Tree Criterion (DTC) K-means clustering	Cuckoo Search Optimization (CSO)	Mean Square Error (MSE) Silhouette width
Acharya et al. [91]	Multi-Objective Based Bi-Clustering	Simulated Annealing (SA)	Euclidean distance
Labed et al. [111]	K-Harmonic Means	Cuckoo Search Algorithm (CSA)	Davies and Bouldin (DB) index Square sum function of the error
Shanmugam and Sekaran [118]	Fuzzy C Means (FCM)	Ant Lion Optimization (ALO)	Square sum function of the error
Carneiro et al. [122]	Network-based techniques (e.g., clustering and dimensionality reduction)	Particle Swarm Optimization (PSO)	Euclidean distance

Word count: 5785

Show less

© 2019 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/). Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License.

Abstract

Translate

Clustering techniques can group genes based on similarity in biological functions. However, the drawback of using clustering techniques is the inability to identify an optimal number of potential clusters beforehand. Several existing optimization techniques can address the issue. Besides, clustering validation can predict the possible number of potential clusters and hence increase the chances of identifying biologically informative genes. This paper reviews and provides examples of existing methods for clustering genes, optimization of the objective function, and clustering validation. Clustering techniques can be categorized into partitioning, hierarchical, grid-based, and density-based techniques. We also highlight the advantages and the disadvantages of each category. To optimize the objective function, here we introduce the swarm intelligence technique and compare the performances of other methods. Moreover, we discuss the differences of measurements between internal and external criteria to validate a cluster quality. We also investigate the performance of several clustering techniques by applying them on a leukemia dataset. The results show that grid-based clustering techniques provide better classification accuracy; however, partitioning clustering techniques are superior in identifying prognostic markers of leukemia. Therefore, this review suggests combining clustering techniques such as CLIQUE and k-means to yield high-quality gene clusters.

Details

Title

A Review of Computational Methods for Clustering Genes with Similar Biological Functions

Author

Hui Wen Nies¹; Zalmiyah Zakaria¹; Mohd Saberi Mohamad²; Weng Howe Chan¹; Zaki, Nazar³; Sinnott, Richard O⁴; Napis, Suhaimi⁵; Chamoso, Pablo⁶

; Omatu, Sigeru⁷; Corchado, Juan Manuel⁶

¹ School of Computing, Faculty of Engineering, Universiti Teknologi Malaysia, Skudai 81310, Johor, Malaysia
² Institute for Artificial Intelligence and Big Data, Universiti Malaysia Kelantan, Kota Bharu 16100, Kelantan, Malaysia
³ Department of Computer Science and Software Engineering, College of Information Technology, United Arab Emirate University, Al Ain 15551, UAE
⁴ School of Computing and Information Systems, University of Melbourne, Parkville 3010, Victoria, Australia
⁵ Faculty of Biotechnology and Biomolecular Sciences, Universiti Putra Malaysia, Serdang 43400, Selangor, Malaysia
⁶ BISITE Research Group, Digital Innovation Hub, University of Salamanca, Edificio I+D+i, C/ Espejos s/n, 37007 Salamanca, Spain
⁷ Division of Data-Driven Smart Systems Design, Digital Monozukuri (Manufacturing) Education and Research Center, Hiroshima University, #210, 3-10-31 Kagamiyama, Higashi-Hiroshima 739-0046, Hiroshima Prefecture, Japan

First page

550

Publication year

2019

Publication date

2019

Publisher

MDPI AG

e-ISSN

22279717

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.3390/pr7090550

ProQuest document ID

2550235944

A Review of Computational Methods for Clustering Genes with Similar Biological Functions

Jump to:

Full Text

Abstract

Details

Suggested sources