This is an open access article distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.
1. Introduction
The classification problem is a necessary research topic in data mining fields. Among classification approaches, the decision tree is very effective. It has the advantages of high classification accuracy, few parameters, and strong interpretability. Decision trees have been widely adopted in business, medical care, etc., and have achieved remarkable results. Data generated in daily life is increasing rapidly, which brings opportunities and challenges for decision trees to solve large-scale data classification problems. Decision tree is an inductive learning algorithm on the basis of examples. With the in-depth research on decision tree algorithms and the diversified needs in practical applications, a variety of learning algorithms or models for constructing decision trees have been proposed.
1.1. Information Entropy Decision Tree
Quinlan proposed the ID3 decision tree algorithm. This algorithm employs information gain in information theory as an evaluation of feature quality in the process of splitting tree nodes, and the feature with the largest information gain and the corresponding split point will be used for the construction of the node [1]. The ID3 algorithm is a clear and quick method, but it also has some obvious shortcomings. First of all, in the scope of processing datasets, it is not suitable for datasets containing continuous features; secondly, it tends to choose conditional features with more values as the optimal splitting features. In the same year, Schlimer and Fisher et al. proposed the ID4 decision tree method, which constructs decision trees in an incremental manner [2]. Two years later, ID5 algorithm was presented by Utgof et al., which allows the structure of the existing decision tree to be modified by adding new training instances without the need for retraining [3]. In addition, Xiaohu Liu and his colleagues also discussed the decision tree construction by considering both the information gain brought about by the conditional feature of the current node and the information gain of the conditional feature of the next node when selecting the split feature [4]. Aiming at the shortcomings of ID3 decision trees in selecting features, Xizhao Wang et al. selected appropriate decision tree branches for merging during the construction of decision trees. This algorithm can increase the comprehensibility, improve the generality, and reduce the complexity [5]. The C4.5 decision tree improved performance compared with ID3 algorithm, proposed by Quinlan et al. in 1993 [6]. To solve the bias problem of information gain when selecting split features, this method employs the information gain rate as a metric for choosing split features. Also, the C4.5 algorithm also has the advantage of processing continuous features, discrete features, and incomplete datasets, which has stronger applicability than the ID3 algorithm. When constructing the C4.5 decision tree, the pruning operation is adopted, which can enhance the efficiency of the decision tree, reduce the scale of the decision tree, and effectively avoid the problem of overfitting. Domingos and his colleagues presented a very fast decision tree for data flow, which is called VFDT [7]. The algorithm shortens the training time of the decision tree by using sampling technology. Hulten designed CVFDT algorithm, which extended VFDT and required users to give parameters in advance [8]. Recently, some researchers have also proposed many other decision tree algorithms based on information entropy [9–16].
1.2. Gini Coefficient Decision Tree
In 1984, Breiman and his colleagues designed the CART algorithm, which adopts Gini coefficient as a metric for feature splitting, and chooses the conditional feature with the smallest Gini coefficient as the splitting feature of the tree node to generate a decision tree [17]. When generating a decision tree, if the purity of the spanning tree node is greater than or equal to the threshold assigned in advance, the tree node is stopped from dividing and then the main label of the instance data covered by the node is used as the label of the leaf node label. Meanwhile, the approach adopts resampling to analyze the accuracy of the constructed decision tree and perform pruning operations, and the decision tree with high accuracy and the smallest size is selected as the final decision tree. Here, the pruning method adopts the minimum cost complexity method, MCCP, which can solve the problem of overfitting, reduce the size, and improve interpretability. CART also has its own shortcomings. Due to the limitation of computer memory, this method cannot effectively process large-scale datasets. Aiming at the shortcomings of the CART algorithm in computer memory, in 1996 Mehta et al. proposed an SLIQ decision tree [18]. When building the decision tree, instances are presorted, and then the breadth-first method is used to choose the optimal split feature. The SLIQ algorithm has the advantages of fast calculation speed and ability to process larger datasets. When the data exceed the memory capacity, this method employs feature lists, classification lists, and class histograms to get the solution. However, when the amount of data is large to a certain extent, the algorithm still faces the problem of insufficient memory [19]. In 1998, Rastogi et al. presented a public decision tree algorithm [20]. The algorithm integrates the tree construction process with the tree adjustment process. By calculating the cost function value of the tree node, it is judged whether the node needs to be pruned or expanded. If it is not expanded, the node is marked as a leaf node. The combination of the establishment process and the adjustment process greatly increases the training efficiency of the public decision tree. The Rain Forest is a framework of the decision tree proposed by Gehrke and his colleagues in 2000 [21]. The research aim of the algorithm was to enhance scalability of decision tree and reduce the consumption of computer memory resources as much as possible.
1.3. Rough Set Decision Tree
Incorporating the rough set theory into the decision tree can make the decision tree have the ability to handle uncertain, incomplete, and inconsistent data. When using rough set to generate a decision tree, the main research focus is how to use rough set theory to choose node splitting features. Miao and his colleagues designed a rough set based on multivariate decision tree algorithm. The method first selects the conditional features in the kernel to construct a multivariate test, and then generates a new feature to split the node [22]. The advantage of this algorithm is that the training efficiency is relatively high, but because there are too many variables in the nodes of the decision tree, the interpretability of the decision tree is difficult. Wang et al. designed a fuzzy decision tree on the basis of rough set, which uses fuzzy integration to keep the results consistent [23]. In 2011, Zhai and his colleagues adopted the fuzzy rough set to generate a decision tree and presented a new selection criterion for fuzzy condition features [24]. Wei et al. constructed a decision tree according to a variable precision rough set, which allows small error in the classification process and improves the generalization ability of the decision tree [25]. Jiang and his colleagues designed an incremental decision tree learning method via rough set, and used this method to discuss the problem of network intrusion detection [26]. In 2012, Hu and others proposed a monotonous ordered mutual information decision tree. This decision tree employs the dominant rough set theory to establish a new metric to measure feature quality to construct tree nodes. This decision tree can be resistant to noise, handle monotonous classification problems, and has good effects on general classification problems [27]. On the basis of the algorithm, Qian and his colleagues designed a fusion monotonic decision tree. The algorithm uses feature selection technology to generate multiple data feature distributions to construct multiple decision trees, and employs these decision trees to make comprehensive decisions [28]. Pei and his colleagues designed a monotonously constrained multivariate decision tree. The algorithm first uses the ordered mutual information splitting criterion to generate different data subsets, and then optimizes these subdata to build a decision tree [29].
1.4. Parallel Decision Tree
Nonparallel or serial decision trees have received extensive and in-depth research and development, and a lot of decision tree models and algorithms have been proposed, but due to the recursive characteristics of decision trees and computing platforms, relatively speaking, the research of decision trees parallelization is not very extensive. The following is a brief review and summary of the current research status of parallel decision trees. The research on parallel decision tree started with SPRINT proposed by Shafer et al. in 1996 [30]. The algorithm tries to avoid the problem of insufficient memory by improving data structure during the growth of decision tree, but during calculation of the tree splitting node, the algorithm requires a broadcast from the entire instance to the entire instance. Kufrin et al. discussed the parallelization of decision trees and introduced a parallelization framework in 1997 [31]. One year later, Joshi and his team proposed a parallel form of decision tree similar to SPRIENT. It is different from the traditional depth-first construction method of decision tree. The algorithm adopts a breadth-first form of decision tree growth within a parallel framework. This method can avoid possible load imbalance problem [32]. Srivastava and his team presented two parallel decision tree models on the basis of the synchronous construction approach and the asynchronous construction method, but both of these models have large communication overhead and load imbalance problems [33]. Shent et al. gave a parallel decision tree that divides the input instance into four subsets to build a decision tree, and applied the generated parallel decision tree to the user authentication problem [34]. With the in-depth research on parallel decision trees and the emergence of MapReduce distributed computing frameworks, Panda et al. designed a parallel decision tree in 2009, which relies on multiple distributed computing frameworks [35]. Walkowiak and his colleagues focused on the parallelization of decision trees, and proposed an optimization model for network computing for the distributed implementation of decision trees [36]. In 2012, Yin et al. adopted the scalable regression tree algorithm to give a parallel implementation of the Planet algorithm under the MapReduce framework [37]. In response to the problem of overlearning or overfitting, Wang Ran et al. proposed an extreme learning machine decision tree on the basis of a parallel computing framework, but the disadvantage of this algorithm is that it can only handle numerical datasets and cannot handle mixed dataset [38]. The parallel C4.5 decision tree proposed in [39] considers the problem of overfitting and mixed data types. Aiming at the ordered mutual information decision tree that is widely used in monotonic classification problems, Mu and his colleagues presented a fast version and gave its parallel implementation [40]. There are other parallel decision tree approaches proposed like distributed fuzzy decision tree [41], parallel Pearson correlation coefficient decision tree [42], etc. In addition to the above research, Li and other researchers proposed some classification and alignment algorithms [43–49] from the perspective of granular computing, which have good performance.
2. Contributions
In this study, a decision tree is constructed in granular space to solve the classification problem. The main contributions are as follows:
(i) We propose AGRC that can adaptively give the optimal cluster centers, which is a global optimization method and can avoid falling into local optimization solution.
(ii) We design the parallel granulation method based on the above clustering algorithm, which solves the problem of high complexity of traditional serial granulation and enhances the granulation efficiency.
(iii) In granular space, we define fuzzy granules and related operators and select features based on the information gain ratio to construct a fuzzy granular decision tree for classification. To avoid overlearning, we also design the corresponding pruning algorithm. The method presented can solve binary classification or multiclassification problem and give feature importance according to the order of the tree node generated.
3. The Problem
Let
4. The Algorithm Description
To obtain the solution of classification problem described in Section 3, the model can be presented as follows. First of all, to enhance the efficiency as much as possible, we need to cluster the data. During the process, an adaptive clustering algorithm is designed, which can obtain the quantity of cluster and the cluster centers automatically. Second, on the basis of the quantity of center and the cluster centers, a parallel granulation is executed by calculating the distance between instances and cluster centers and dividing instances set into instances subsets. Third, the problem of instance classification can be converted into the problem of granule classification in granular space. Fuzzy granules, related operators, and cost function are defined in granular space. Splitting nodes can be found by solving cost function to build a fuzzy granular decision tree. Fourth, the pruning algorithm w.r.t. RFGDT is designed. Finally, the label of test instance can be predicted by RFGDT. The overview is described in Figure 1.
[figure omitted; refer to PDF]4.1. Theory of AGRC
K-means algorithm is an unsupervised classification approach. The cluster centers and their numbers need to be specified in advance. The results obtained rely on the above cluster center. If the initial cluster center is not well selected, it will fall into a local optimal solution. An Adaptive Global Random Clustering algorithm is proposed, which adaptively selects cluster centers and their numbers, and the initial selection of cluster centers is random, which is a global optimization approach. The thought is as follows. We know that if the between-cluster variance
Step I: remove instances of missing feature values.
Step II: normalize feature values of instances into
Step III: assign maximum iterations
Step IV: assign current cluster center set
Step V: randomly select one instance
Step VI: calculate the shortest distance between the remaining instances and all cluster centers
and the probability of an instance being selected as the next cluster center is
Step VII: if
Step VIII: if
Step IX: calculate the standard deviation of intercluster and the standard deviation of inner-cluster and update the evaluation set, that is,
Step X: update iteration
Step XI: if
Step XII: in evaluation set
and the quantity of cluster center corresponding to it,
Step XIII: end.
The principle of Adaptive Global Random Clustering is given. On the basis of the principle, the algorithm is as shown in Algorithm 1.
Algorithm 1: Adaptive global random clustering.
Input: instance set
Output: optimal cluster center set
(1) Remove instances of missing feature values.
(2) Normalize feature values of instances into
(3)
(4)
(5) While
(6)
(7)
(8)
(9)
(10)
(11) While
(12)
(13)
(14) p = GenProbability ();
(15) If
(16) End while
(17)
(18)
(19) End while
(20)
(21)
4.2. From Data to Fuzzy Granules
Now, we introduce how to implement parallel fuzzy granulation of data via cluster centers. We can adopt AGRC to get the cluster center set
For simplicity, it can also be written as
Here “
Now we give operators of fuzzy granules. For
Here
The symbol “+” denotes union and the symbol “–” represents separator. The cardinal number of the fuzzy granular array can be written as
Now, we have the operators between fuzzy granular arrays. Let
The difference between two fuzzy granular vectors can be written as follows:
From information granulation, we can see that fuzzy granules and fuzzy granular arrays are generated by their operators. The fuzzy granules consist of the space called fuzzy granular space.
Theorem 1.
For
Theorem 2.
For
Proof 2.
According to equation (13), if
Below we give an example to explain the granulation process and measurement.
Example 1.
As shown in Table 1, let
We take instance
In the same way, fuzzy granules generated by
According to
Similarly, we also obtain
Hence, the distance between instance
Table 1
Fuzzy granulation and measurement of similarity.
0.3 | 0.1 | 0.1 | +1 | 0.5 | |
0.2 | 0.2 | 0.1 | +1 | 0.5 | |
0.4 | 0.5 | 0.1 | –1 | 0.5 | |
0.1 | 0.2 | 0.7 | –1 | 0.5 | |
0.15 | 0.25 | 0.15 | — | — | |
0.35 | 0.45 | 0.65 | — | — |
4.3. Random Fuzzy Granular Decision Tree
RFGDT can embody structure and express the course of classifying instances on the basis of features. It includes a series of if-then rules, or it can also be regarded as a conditional probability distribution calculated in fuzzy granular space and label space. The strength is that this algorithm is readable and its efficiency is high. When learning, we employ the training data to generate a RFGDT on the basis of minimizing cost function. When predicting, test data are classified via the model. There are three steps in learning of RFGDT, namely, selecting feature, constructing tree, and pruning tree.
RFGDT is to describe the feature structure for classifying fuzzy granules in the fuzzy granular space, which can be composed of directed edges and nodes. A feature can be expressed by an internal node, and a label can be denoted by a leaf node.
During the classified process, the model starts from the root node, tests a certain fuzzy granule of instance, and assigns the fuzzy granule to its child nodes according to the result. Meanwhile, the value of a feature corresponds to each subnode. In this way, fuzzy granules are tested and allocated recursively until they reach the leaf node. Finally, fuzzy granules are allocated to the label of the leaf node.
Now, the definition of the fuzzy granular rule set is written as follows.
Definition 1.
Suppose that
Suppose that
4.3.1. IF-Then Rule
RFGDT can also be regarded as a set of if-then rules. A RFGDT can be converted into an if-then rule set like this: A rule is constructed for every path from root node to leaf node; the features of the internal nodes with regard to conditions of the rule, and the label of leaf node, correspond to the conclusion of rule. The path of RFGDT (corresponding to if-then rule set) has a key character: mutually exclusive and complete. This means that every fuzzy granular array is covered by a unique path or rule. The so-called coverage here means that the features of the fuzzy granular array are consistent with the features on the path.
4.3.2. Learning
The learning of RFGDT is to generalize a series of classification rules from training data. There may be more than one RFGDT (i.e., a RFGDT that can correctly classify the training data) that is not inconsistent with the training data. Our purpose is to find a RFGDT that has little contradiction with training data and has strong generalization ability. In other words, RFGDT learning is to estimate conditional probability on training data. Infinite conditional probability models exist by fuzzy granular space division. The conditional probability model chosen can not only have a great fitting to training data but also have a perfect prediction on test data. RFGDT learning uses a cost function to express the aim. As mentioned below, the cost function of RFGDT learning is usually a regularized maximum likelihood function, and its strategy is to minimize the cost function.
The algorithm of RFGDT learning is to recursively select the optimal feature and segment the training data on the basis of the feature, in order that each subdataset can get the best classified results. This process corresponds to the division of fuzzy granular space and the form of RFGDTs. At the beginning, the root node is constructed and all fuzzy granular arrays are placed at the root node. The algorithm chooses an optimal feature, and divides the training data into several subsets on the basis of this feature, in order that each subset has the best classification under the current conditions. If these subsets have been correctly classified, then the algorithm constructs leaf nodes and divides these subsets into the corresponding leaf nodes; if there are still subsets that cannot be correctly classified, then the algorithm selects new optimal features for these subsets, continues to segment them, constructs corresponding nodes, and proceeds recursively until all training data subsets are basically classified correctly or there is no suitable feature. Finally, each subset is divided into leaf nodes, i.e., there are clear categories. This generates a RFGDT.
The RFGDT produced may have better classification ability for training data but may have worse classification ability for test data, that is, overfitting may occur. We need to prune the tree from bottom to top to let the tree be simpler in order that it can enhance its generalization ability. Specifically, it is to remove the leaf nodes that are too subdivided, make them fall back to the parent node, or even an ancestor node, and then modify the parent node or an ancestor node to a new leaf node.
If the quantity of features is large, the features can also be chosen at the beginning of the RFGDT learning, leaving only features that have sufficient classification ability for training data. We can draw a conclusion that the learning algorithm includes feature selection, RFGDT construction, and RFGDT pruning. Since RFGDT denotes conditional probability distribution, RFGDTs of different depths correspond to probability models of different complexity. The generation of RFGDT corresponds to local selection of the model, and the pruning of RFGDT is related to global selection of the model. The generation of the RFGDT only considers the local optimum, while the pruning of the RFGDT considers the global optimum.
4.3.3. Feature Selection and Cost Function Construction
Selecting important features can enhance the efficiency of RFGDT learning. If the result of using a feature for classification is not very different from the result of random classification, the feature is said to have no classification ability. Empirically, throwing away such features has little effect on the accuracy. Here, we redefine the information gain ratio and use this criterion as the cost function of constructing a RFGDT. First, the empirical entropy of the dataset
The empirical conditional entropy of feature
The information gain is calculated as follows:
We can now write the ratio
Here,
4.3.4. RFGDT Generation
We adopt the information gain ratio criterion to select features and recursively build a RFGDT. The specific method is as follows.
Information gain ratio for each feature of each subdataset is calculated in the Map stage. Then, in the Reduce stage, the information gain ratios on the corresponding features of each subdataset are summed. The feature with the largest sum of information gain ratio can be chosen as the feature of the node, and the child node is constructed from the different feature values. We call the above approach recursively on the child nodes to build a RFGDT until the information gain ratios of all features are very small or there are no features to choose from. The algorithm is as follows:
Step I:
Step II: in the Map phase, the Map function uses each feature as the key and the information gain ratio as the value, namely,
Step III: in the Reduce phase, Hadoop distributed system first aggregates the output results of all Map functions according to the key, and then uses these aggregated intermediate results as the input to the Reduce phase. The intermediate results after aggregation are as follows:
Step IV: if all fuzzy granular arrays in
Step V: if
Step VI: or else, calculate the sum of the information gain ratio of each feature in
Step VII: if the information gain ratio of
Step VIII: if not, for each possible value
Step IX: for the node
The algorithm is described as in Algorithm 2.
Algorithm 2: Construct RFGDT.
Input: instance set
Output: root
(1) Normalize instances into
(2) Calculate cluster center set
(3)
(4)
For
For
end for
Build a fuzzy granular array
Get label of
A rule can be built.
End for
(5)
(6) Map stage. In the Map function, each feature
(7) IF for
(8) Reduce stage. Between the Map phase and the Reduce phase, the Hadoop distributed system first aggregates the output results of all Map functions according to the key.
Then, these aggregated intermediate results are used as the input of the Reduce stage, and the intermediate results after aggregation are as follows.
(9) If
Or else,
(10) If the information gain of
Or else, for each possible value
(11) For the node
4.3.5. Pruning
The algorithm recursively generates RFGDT until it cannot do further. The tree generated in this way is often very accurate for the classification of training data, but the classification of test data is not so accurate, i.e., overfitting occurs. The main reason is that too much consideration is given to how to improve the correct classification of training data, thereby building an overly complex tree. The solution to this problem is to reduce tree complexity and simplify the constructed tree. The process of simplifying the constructed tree is called pruning. Specifically, pruning cuts some subtrees or leaf nodes from the constructed tree, and uses its root node or parent node as new leaf nodes, thereby simplifying the classification tree. The pruning of RFGDT can be achieved by minimizing the overall cost function of the tree.
Suppose that the quantity of leaf nodes of the tree
In equation (31),
Step I: empirical entropy of each node is calculated.
Step II: recursively retract upward from the leaf nodes of the tree. Suppose that the whole tree before and after a group of leaf nodes retract to its parent node is
then pruning, that is, the parent node becomes a new leaf node.
Step III: go to Step II, until it cannot continue, and get the subtree with the smallest cost function
The algorithm is shown as in Algorithm 3.
Algorithm 3: Pruning of RFGDT.
Input: fuzzy granular decision tree
Output: pruned subtree
(1) Calculate the empirical entropy of each node
(2) Recursively retract upward from the leaf nodes of the tree.
Suppose that the whole tree before and after a group of leaf nodes retract to its parent node is
(3) If
(4) Return to Step 2 until it cannot continue, and get the subtree with the smallest cost function
4.3.6. Label Prediction
After the RFGDT is constructed, given a test instance, first we transform it into fuzzy granular array and then use the RFGDT decision tree trained to predict. The method is described further in Algorithm 4.
Algorithm 4: RFGDT prediction.
Input: test instance
Output: label
(1) Obtain fuzzy clustering granular vector
(2) Judge the path from the root node to the leaf node according to the feature value, and get the label according to the leaf node.
5. Experimental Analysis
This paper employs 4 datasets from the UC Irvine Machine Learning Repository as the data source for the experimental test and constructs 8 datasets with 1% noise and 3% noise, respectively, as demonstrated in Table 2. The tenfold cross-validation method was adopted in the experiments. Data of
Table 2
Datasets from the UC Irvine Machine Learning Repository.
No. | Datasets | Number of instances | Number of features | Number of classifications |
1 | Wine Quality | 4898 | 12 | 11 |
2 | Bank Marketing | 45211 | 17 | 2 |
3 | Localization Data for Person Activity | 164860 | 8 | 8 |
4 | IDA2016Challenge | 76000 | 171 | 2 |
[figures omitted; refer to PDF]
[figures omitted; refer to PDF]
[figures omitted; refer to PDF]
Fuzzy granulation in the case of serial computing tasks has lower efficiency, which cannot meet the needs. If computing task is parallelizable, serial computing task can be converted into parallel computing one. This paper adopted the approach of parallel granulation via clustering. We usually employ MapReduce to deal with parallel tasks of large-scale data. Fuzzy granulation of large-scale dataset can be divided into several subcomputing tasks and the subdatasets can be assigned to computing nodes via abstracting a hierarchical computing model. Due to its simple and easy-to-use programming interface, MapReduce has been widely used in parallel programming model and computing framework. The main thought is as follows. Job is divided into multiple independently runnable Map tasks; these Map tasks are distributed to several processors to execute; intermediate results are generated; the reduce operation tasks are combined to produce final output results. There are two parts in MapReduce calculation process, namely, Map and Reduce. The input
Table 3
Map process input.
Task of map | Value | |
Map 1 | ||
… | … | |
Map 2 | ||
… | … | |
Map… | … | … |
Map B | ||
… | … |
Table 4
Map process output.
Task of map | Value | |
Map 1 | ||
… | … | |
… | … | |
… | … | |
Map 2 | ||
… | … | |
… | … | |
Map… | … | … |
Map B | ||
… | … | |
… | … |
The output of Reduce is like
Table 5
Reduce process output.
Value | |
… | … |
Figure 2 compares the efficiency of traditional granulation, serial granulation with clustering, and the parallel granulation with clustering proposed, where the abscissa expresses the quantity of instances and the ordinate denotes the average time taken for granulation. Below
Taking 90% of the dataset Wine Quality as the training set, we can get the following results. As shown in Figure 3(a), in the dataset Wine Quality, when the quantity of cluster centers was less than 3006, the average accuracy of RFGDT was lower than that of the other three methods. With the quantity of cluster centers rising, the average accuracy of RFGDT increases rapidly. Especially, when the quantity of cluster centers was 3006, it achieved a peak value of 0.969, while the average accuracies of SVMs, C4.5, and CNN were 0.961, 0.952, and 0.948 (i.e., 0.83%, 1.79%, and 2.22% improvement, respectively). When the quantity of cluster centers was greater than 3006, the average accuracy of RFGDT also decreased slightly, but it was still higher than the other three methods. After adding 1% noise in the data, as illustrated in Figure 3(b), when the quantity of cluster centers was 3006, the average accuracy of RFGDT reached a peak value of 0.964, while the average accuracies of SVMs, C4.5, and CNN, RFGDT were 0.923, 0.914, and 0.892, respectively. RFGDT improved by 4.44%, 5.47%, and 8.07% respectively. Comparing these two datasets, SVMs, C4.5, and CNN dropped by 3.95%, 3.99%, and 5.91% respectively, and RFGDT reduced by 0.52% at the peak value. It can be seen from statistical data that RFGDT is more robust and stable to noise. When we add 3% noise to the data, as exhibited in Figure 3(c), SVMs, C4.5, CNN, and RFGDT decreased by about 0.54%, 1.97%, 0.56%, and 0.42%, respectively, compared with noise data of 1%. After that, we took 70% of data as the training set to verify the performance. Overall, the average accuracies of these four methods were on the decline. From Figures 3(d)–3(f), RFGDT performs better than other three algorithms when the number of clustering is more than 2000.
As illustrated in Figure 4(a), dataset BankMarketing contained nearly 50,000 instances, which was 10 times the scale of dataset Wine Quality. The shape of the average accuracy curve of RFGDT was similar to Figure 3, and the overall shape was high in the middle and low on both sides. When the quantity of cluster centers was
The quantity of instances in dataset Localization Data for Person Activity was more than 160,000. As illustrated in Figure 5(a), without noise, when
The dimension in dataset IDA2016Challenge is much higher than the other three datasets. We took 90% and 70% of data to test as training sets, respectively. On the basis of this, we added 1% and 3% noise into the dataset, respectively. The detailed results are as follows. As shown in Figure 6(a), when
Besides the dataset mentioned above, we also applied the algorithm to predict Alzheimer’s disease by voice. This dataset was from the University of Pittsburgh and was stored in the form of speech and text from participants containing elderly controls, people with possible Alzheimer’s Disease, and people with other dementia diagnoses. The corpus included 1263 instances. Mel-frequency Cepstral Coefficients (MFCC) of corpus was extracted as features for prediction. We calculated the first 20 dimensions, their first-order difference, and their second-order difference, which were concatenated to get 57-dimensional features. The precision of RPFDT was a maximum of 0.932. We can predict Alzheimer’s disease by voice, which is a simple and low cost method compared with Magnetic Resonance Imaging. It is very meaningful and valuable for the diagnosis of Alzheimer’s disease.
From the above analysis, we can see that the average accuracy of RFGDT is better than SVMs, C4.5, and CNN in the six datasets. In smaller datasets, CNN performs weaker than SVMs and C4.5. Especially, in datasets containing noise, the average accuracy of RFGDT is stable and less sensitive to noise. Judging from the curve shape of the average accuracy of RFGDT, it shows a form of high and low in the middle. When the value of
[figures omitted; refer to PDF]
6. Discussion
In this study, we propose a RFGDT that is suitable for dual-classification or multiclassification problems. In the algorithm, the idea of parallel distributed granulation is introduced, which improves the efficiency of data granulation. In the parallel granulation process, we design AGRC for granulation. We transform a classified problem of data into fuzzy granular space to find the solution. In the fuzzy granular space, we define fuzzy granules, fuzzy granular arrays on the basis of operators designed. The aim is to use the information gain rate to select feature as the split point to recursively construct the fuzzy granular decision tree. In order to avoid overfitting, we also design the pruning algorithm of RFGDT, which can improve the performance further. In the future, we will apply it to cloud computing and big data.
Acknowledgments
This work was supported in part by the Scientific Research “Climbing” Program of Xiamen University of Technology under Grant nos. XPDKT20027 and XPDKQ18011, in part by the National Natural Science Foundation of China under Grant nos. 61976183 and 41804019, in part by the Natural Science Foundation of Fujian Province of China under Grant nos. 2019J01850 and 2018J01480, in part by Project of Industry-University-Research of Xiamen University and Scientific Research Institute under Grant no. 3502Z20203064, in part by University Natural Sciences Research Project of Anhui Province of China under Grant no. KJ2020A0660, and in part by the Natural Science Foundation of Anhui Province of China under Grant no. 2008085MF202.
[1] J. R. Quinlan, "Induction of decision trees," Machine Learning, vol. 1 no. 1, pp. 81-106, DOI: 10.1007/bf00116251, 1986.
[2] J. C. Schlimmer, D. Fisher, "A case study of incremental concept induction," Proceedings of the 1986 National Conference on Artificial Intelligence, pp. 496-501, .
[3] P. E. Utgoff, "Id5: An incremental id3," Proceedings of the Fifth International Conference on Machine Learning, .
[4] X. Liu, S. Li, "An optimized algorithm of decision tree," Journal of Software, vol. 9 no. 10, pp. 797-800, 1998. (in Chinese)
[5] X. Wang, C. Yang, "Merging-branches impact on decision tree induction," Chinese Journal of Computers, vol. 30 no. 8, pp. 1251-1258, 2007. (in Chinese)
[6] J. R. Quinlan, C4.5: Programs for Machine Learning, 1993.
[7] P. Domingos, G. Hulten, "Mining high-speed data streams," Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, .
[8] G. Hulten, L. Spencer, P. Domingos, "Mining time-changing data streams," Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, .
[9] J. Su, H. Zhang, "A fast decision tree learning algorithm," Procedings of 21th National Conference on Artificial Intelligence, pp. 500-505, .
[10] A. E. Khedr, A. M. Idrees, A. I. El Seddawy, "Enhancing iterative dichotomiser 3 algorithm for classification decision tree," Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, vol. 6 no. 2, pp. 70-79, DOI: 10.1002/widm.1177, 2016.
[11] A. Franco-Arcega, J. A. Carrasco-Ochoa, G. Sánchez-Díaz, J. Fco Martínez-Trinidad, "Building fast decision trees from large training sets," Intelligent Data Analysis, vol. 16 no. 4, pp. 649-664, DOI: 10.3233/ida-2012-0542, 2012.
[12] C. Li, Y. Zhang, X. L. Ocvfdt, "One-class very fast decision tree for one-class classification of data stream," Proceedings of the 3rd International Workshop on Knowledge Discovery from Sensor Data,DOI: 10.1145/1601966.1601981, .
[13] G. Liu, H. Cheng, Z. Qin, Q. Liu, "E-cvfdt: an improving cvfdt method for concept drift data stream," Proceedings of the 2013 International conference on Communications, Circuits and Systems, pp. 315-318, DOI: 10.1109/ICCCAS.2013.6765241, .
[14] M. Batra, R. Agrawal, "Comparative analysis of decision tree algorithms," Nature Inspired Computing, pp. 31-36, DOI: 10.1007/978-981-10-6747-1_4, 2018.
[15] A. Cherfi, K. Nouira, A. Ferchichi, "Very fast c4.5 decision tree algorithm," Applied Artificial Intelligence, vol. 32 no. 2, pp. 119-137, DOI: 10.1080/08839514.2018.1447479, 2018.
[16] X. Dong, M. Qian, R. Jiang, "Packet classification based on the decision tree with information entropy," The Journal of Supercomputing, vol. 76 no. 6, pp. 4117-4131, DOI: 10.1007/s11227-017-2227-z, 2020.
[17] L. Breiman, J. H. Friedman, R. A. Olshen, C. J. Stone, Classification and Regression Trees, 1984.
[18] M. Mehta, R. Agrawal, J. Rissanen, "SLIQ: a fast scalable classifier for data mining," Proceedings of the 1996 International Conference on Extending Database Technology, .
[19] B. Chandra, P. Pallath, "On improving efficiency of SLIQ decision tree algorithm," Proceedings of the 2007 International Joint Conference on Neural Networks, .
[20] R. Rastogi, K. Shim, "Public: a decision tree classifier that integrates building and pruning," Data Mining and Knowledge Discovery, vol. 4 no. 4, pp. 315-344, DOI: 10.1023/a:1009887311454, 2000.
[21] J. Gehrke, R. Ramakrishnan, V. Ganti, "Rainforest-a framework for fast decision tree construction of large datasets," Data Mining and Knowledge Discovery, vol. 4 no. 2/3, pp. 127-162, DOI: 10.1023/a:1009839829793, 2000.
[22] D. Miao, J. Wang, "Rough sets based approach for multivariate decision tree construction," Journal of Software, vol. 6, pp. 425-431, 1997. (in Chinese)
[23] X.-Z. Wang, J.-H. Zhai, S.-X. Lu, "Induction of multiple fuzzy decision trees based on rough set technique," Information Sciences, vol. 178 no. 16, pp. 3188-3202, DOI: 10.1016/j.ins.2008.03.021, 2008.
[24] J.-h. Zhai, "Fuzzy decision tree based on fuzzy-rough technique," Soft Computing, vol. 15 no. 6, pp. 1087-1096, DOI: 10.1007/s00500-010-0584-0, 2011.
[25] J.-M. Wei, S.-Q. Wang, M.-Y. Wang, J.-P. You, D.-Y. Liu, "Rough set based approach for inducing decision trees," Knowledge-Based Systems, vol. 20 no. 8, pp. 695-702, DOI: 10.1016/j.knosys.2006.10.001, 2007.
[26] F. Jiang, Y. Sui, C. Cao, "An incremental decision tree algorithm based on rough sets and its application in intrusion detection," Artificial Intelligence Review, vol. 40 no. 4, pp. 517-530, DOI: 10.1007/s10462-011-9293-z, 2013.
[27] Q. Hu, X. Che, L. Zhang, D. Yu, "Rand entropy-based decision tress for monotonic classification," IEEE Transactions on Knowledge and Data Engineering, vol. 24 no. 11, pp. 2052-2064, DOI: 10.1109/TKDE.2011.149, 2013.
[28] Y. Qian, H. Xu, J. Liang, B. Liu, J. Wang, "Fusing monotonic decision trees," IEEE Transactions on Knowledge and Data Engineering, vol. 27 no. 10, pp. 2717-2728, DOI: 10.1109/tkde.2015.2429133, 2015.
[29] S. Pei, Q. Hu, C. Chen, "Multivariate decision trees with monotonicity constraints," Knowledge-Based Systems, vol. 112, pp. 14-25, DOI: 10.1016/j.knosys.2016.08.023, 2016.
[30] J. C. Shafer, R. Agrawal, M. Mehta, "Sprint: a scalable parallel classifier for data mining," Proceedings of the 1996 22th International Conference on Very Large Data Bases, pp. 544-555, .
[31] R. Kufrin, "Decision trees on parallel processors," Machine Intelligence and Pattern Recognition, vol. 20, pp. 279-306, DOI: 10.1016/S0923-0459(97)80014-6, 1997.
[32] M. V. Joshi, G. Karypis, V. Kumar, "Scalparc: a new scalable and efficient parallel classification algorithm for mining large datasets," Proceedings of the 1998 First Merged International Parallel Processing Symposium and Symposium on Parallel and Distributed Processing, pp. 573-579, .
[33] A. Srivastava, E.-H. Han, V. Kumar, V. Singh, "Parallel formulations of decision-tree classification algorithms," Data Mining and Knowledge Discovery, vol. 3 no. 3, pp. 237-261, DOI: 10.1023/a:1009832825273, 1999.
[34] Y. Sheng, V. V. Phoha, S. M. Rovnyak, "A parallel decision tree-based method for user authentication based on keystroke patterns," IEEE Transactions on Systems, Man and Cybernetics, Part B (Cybernetics), vol. 35 no. 4, pp. 826-833, DOI: 10.1109/tsmcb.2005.846648, 2005.
[35] B. Panda, J. S. Herbach, S. Basu, R. J. Bayardo, "Planet," Proceedings of the VLDB Endowment, vol. 2 no. 2, pp. 1426-1437, DOI: 10.14778/1687553.1687569, 2009.
[36] K. Walkowiak, M. Wozniak, "Decision tree induction methods for distributed environment," Man-Machine Interactions, pp. 201-208, 2009.
[37] W. Yin, Y. Simmhan, V. K. Prasanna, "Scalable regression tree learning on hadoop using openplanet," Proceedings of the 2012 3rd International Workshop on MapReduce and Its Applications, MapReduce ’12,DOI: 10.1145/2287016.2287027, .
[38] R. Wang, Y.-L. He, C.-Y. Chow, F.-F. Ou, J. Zhang, "Learning elm-tree from big data based on uncertainty reduction," Fuzzy Sets and Systems, vol. 258, pp. 79-100, DOI: 10.1016/j.fss.2014.04.028, 2015.
[39] Y. Mu, X. Liu, Z. Yang, X. Lin, "A parallel c4.5 decision tree algorithm based on mapreduce," Concurrency and Computation: Practice and Experience, vol. 29 no. 8,DOI: 10.1002/cpe.4015, 2017.
[40] Y. Mu, L. Wang, X. Liu, "A fast rank mutual information based decision tree and its implementation via map-reduce," Concurrency and Computation: Practice and Experience, vol. 30 no. 10,DOI: 10.1002/cpe.4387, 2018.
[41] A. Segatori, F. Marcelloni, W. Pedrycz, "On distributed fuzzy decision trees for big data," IEEE Transactions on Fuzzy Systems, vol. 26 no. 1, pp. 174-192, DOI: 10.1109/tfuzz.2016.2646746, 2018.
[42] Y. Mu, X. Liu, L. Wang, "A Pearson’s correlation coefficient based decision tree and its parallel implementation," Information Sciences, vol. 435, pp. 40-58, DOI: 10.1016/j.ins.2017.12.059, 2018.
[43] W. Li, Y. Chen, H. Hu, C. Tang, "Using granule to search privacy preserving voice in home iot systems," IEEE Access, vol. 8, pp. 31957-31969, DOI: 10.1109/access.2020.2972975, 2020.
[44] W. Li, Z. Wei, Y. Chen, C. Tang, Y. Song, "Fuzzy granular hyperplane classifiers," IEEE Access, vol. 8, pp. 112066-112077, DOI: 10.1109/access.2020.3002904, 2020.
[45] W. Li, J. Xue, Y. Chen, "Fuzzy granule manifold alignment preserving local topology," IEEE Access, vol. 8, pp. 178695-178705, DOI: 10.1109/ACCESS.2020.3027311, 2020.
[46] W. Li, K. Zhang, Y. Chen, C. Tang, X. Ma, Y. Luo, "Random fuzzy clustering granular hyperplane classifier," IEEE Access, vol. 8, pp. 228767-228778, DOI: 10.1109/access.2020.3046224, 2020.
[47] W. Li, Y. Chen, Y. Song, "Boosted k-nearest neighbor classifiers based on fuzzy granules," Knowledge-Based Systems, vol. 195,DOI: 10.1016/j.knosys.2020.105606, 2020.
[48] C. Fu, W. Lu, W. Pedrycz, J. Yang, "Rule-based granular classification: a hypersphere information granule-based method," Knowledge-Based Systems, vol. 194,DOI: 10.1016/j.knosys.2020.105500, 2020.
[49] P. Singh, Y.-P. Huang, S.-I. Wu, "An intuitionistic fuzzy set approach for multi-attribute information classification and decision-making," International Journal of Fuzzy Systems, vol. 22 no. 5, pp. 1506-1520, DOI: 10.1007/s40815-020-00879-w, 2020.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Copyright © 2021 Wei Li et al. This is an open access article distributed under the Creative Commons Attribution License (the “License”), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. Notwithstanding the ProQuest Terms and Conditions, you may use this content in accordance with the terms of the License. https://creativecommons.org/licenses/by/4.0/
Abstract
In this study, the classification problem is solved from the view of granular computing. That is, the classification problem is equivalently transformed into the fuzzy granular space to solve. Most classification algorithms are only adopted to handle numerical data; random fuzzy granular decision tree (RFGDT) can handle not only numerical data but also nonnumerical data like information granules. Measures can be taken in four ways as follows. First, an adaptive global random clustering (AGRC) algorithm is proposed, which can adaptively find the optimal cluster centers and maximize the ratio of interclass standard deviation to intraclass standard deviation, and avoid falling into local optimal solution; second, on the basis of AGRC, a parallel model is designed for fuzzy granulation of data to construct granular space, which can greatly enhance the efficiency compared with serial granulation of data; third, in the fuzzy granular space, we design RFGDT to classify the fuzzy granules, which can select important features as tree nodes based on information gain ratio and avoid the problem of overfitting based on the pruning algorithm proposed. Finally, we employ the dataset from UC Irvine Machine Learning Repository for verification. Theory and experimental results prove that RFGDT has high efficiency and accuracy and is robust in solving classification problems.
You have requested "on-the-fly" machine translation of selected content from our databases. This functionality is provided solely for your convenience and is in no way intended to replace human translation. Show full disclaimer
Neither ProQuest nor its licensors make any representations or warranties with respect to the translations. The translations are automatically generated "AS IS" and "AS AVAILABLE" and are not retained in our systems. PROQUEST AND ITS LICENSORS SPECIFICALLY DISCLAIM ANY AND ALL EXPRESS OR IMPLIED WARRANTIES, INCLUDING WITHOUT LIMITATION, ANY WARRANTIES FOR AVAILABILITY, ACCURACY, TIMELINESS, COMPLETENESS, NON-INFRINGMENT, MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. Your use of the translations is subject to all use restrictions contained in your Electronic Products License Agreement and by using the translation functionality you agree to forgo any and all claims against ProQuest or its licensors for your use of the translation functionality and any output derived there from. Hide full disclaimer
Details








1 School of Computer and Information Engineering, Xiamen University of Technology, Xiamen 361024, China
2 School of Artificial Intelligence and Big Data, Hefei University, Hefei 230601, China