Content area
Full text
PSYCHOMETRIKAVOL. 72, NO. 4, 583600 DECEMBER 2007
DOI: 10.1007/S11336-007-9013-4
A COMPARISON OF HEURISTIC PROCEDURES FOR MINIMUM WITHIN-CLUSTER SUMS OF SQUARES PARTITIONING
MICHAEL J. BRUSCO
FLORIDA STATE UNIVERSITY
DOUGLAS STEINLEY
UNIVERSITY OF MISSOURI-COLUMBIA
Perhaps the most common criterion for partitioning a data set is the minimization of the within-cluster sums of squared deviation from cluster centroids. Although optimal solution procedures for within-cluster sums of squares (WCSS) partitioning are computationally feasible for small data sets, heuristic procedures are required for most practical applications in the behavioral sciences. We compared the performances of nine prominent heuristic procedures for WCSS partitioning across 324 simulated data sets representative of a broad spectrum of test conditions. Performance comparisons focused on both percentage deviation from the best-found WCSS values, as well as recovery of true cluster structure. A real-coded genetic algorithm and variable neighborhood search heuristic were the most effective methods; however, a straightforward two-stage heuristic algorithm, HK-means, also yielded exceptional performance. A follow-up experiment using 13 empirical data sets from the clustering literature generally supported the results of the experiment using simulated data. Our ndings have important implications for behavioral science researchers, whose theoretical conclusions could be adversely affected by poor algorithmic performances.
Key words: combinatorial data analysis, cluster analysis, heuristics, sum of squares criterion.
1. Introduction
Monte Carlo comparisons in the cluster analysis literature are especially prevalent because of the large number of methodological alternatives and implementation decisions faced by analysts. Reviews of many of these comparisons are offered by Arabie and Hubert (1992, 1996) and Steinley (2006a). In this paper we focus on a Monte Carlo comparison of methods for one of the most frequently selected criteria in the clustering literature: minimizing the within-cluster sums of squares (WCSS).
We consider a collection of N objects contained in the set S ={o1,o2,...,oN } with the corresponding index set C ={1, 2,...,N}. Each object is measured on V variables and the data are contained in the N V matrix, X =[xiv], whose elements contain the measure of variable v on object oi (for all 1 i N and 1 v V). We assume that the desired number of clusters, K, is prespecied, and the number of feasible partitions of the N objects into K clusters can, therefore, be computed as a...