Simrec: a similarity measure recommendation

Abstract

Clustering algorithms play a pivotal role in data mining, offering powerful tools for uncovering hidden patterns and structures within datasets. These algorithms aim to divide data points into coherent groups based on similarities or dissimilarities, making it easier to explore and understand complex data. Clustering algorithms typically rely on similarity measures to assess the likeness between data points. Consequently, selecting a suitable similarity measure is crucial for achieving satisfactory clustering outcomes. However, this decision can pose significant challenges, especially for non-experts, given the plethora of similarity measures available in the literature and their performance which is closely linked to the specific dataset, clustering algorithm, and cluster validity index employed. This difficulty is even more important when considering mixed data clustering. Mixed data refers to heterogeneous data characterized by both numerical and categorical attributes. In such a context, the same similarity measure cannot be used for both types of attributes due to their different nature. Commonly, two similarity measures are combined, one for numerical attributes and one for categorical attributes. This adds a layer of complexity to the problem since it requires the selection of two similarity measures instead of just one. This paper introduces SIMREC, a similarity measure recommendation system for mixed data clustering. The system uses meta-learning to mine the relationship between dataset characteristics and similarity measures performances for different mixed data clustering algorithms and cluster validity indices. Therefore, given a mixed dataset, a mixed data clustering algorithm, and a cluster validity index, the system can recommend suitable pairs of numerical and categorical similarity measures based on the characteristics of the dataset. We implemented the proposed system using 130 pairs of similarity measures (10 numerical and 13 categorical), 4 commonly used mixed data clustering algorithms (K-Prototypes, LSH-K-Prototypes, K-Medoids, and Hierarchical Clustering), and three cluster validity indices (Silhouette, Clustering Accuracy, and Adjusted Rand Index). Our experiments on 185 publicly available mixed datasets show that the pairs of similarity measures recommended by SIMREC outperform the baseline pairs, including classically used pairs of similarity measures in the literature.

Details

Title

Simrec: a similarity measure recommendation system for mixed data clustering algorithms

Pages

Publication year

2025

Publication date

Feb 2025

Publisher

Springer Nature B.V.

e-ISSN

21961115

Source type

Scholarly Journal

Language of publication

English

DOI

https://doi.org/10.1186/s40537-024-01052-y

ProQuest document ID

3169669025

Simrec: a similarity measure recommendation system for mixed data clustering algorithms

Abstract

Details

Full text options

Suggested sources

Simrec: a similarity measure recommendation system for mixed data clustering algorithms

Content area

Abstract

Details

Full text options

Suggested sources