Content area
Full Text
Data come in many forms. A collection of datasets can mostly be analyzed to produce novel insights. Over the past century, as computers have evolved in speed and capacity, new machine-learning techniques have emerged that allow researchers to discover hidden patterns or new knowledge beyond what traditional statistical analyses could procure. While underutilized in library and information science (LIS) research, particularly research related to information provision and library services, these approaches hold great potential for better understanding of LIS issues from information technology adoption to reading habits to library circulation trends (Cordell, 2020; Lund, 2020). This review focuses on one type of machine-learning technique, cluster analysis, which classifies and sorts datasets into groups, or clusters, based on a similarity measure. This method, sharing some commonalities with content analysis (using automated procedures rather than manual ones), may be used in LIS research to identify patterns and logical groupings in big datasets that would otherwise be untenable to analyze manually.
What is cluster analysis?
Cluster analysis is a common data mining process that identifies latent patterns and groupings in a dataset (Härdle and Simar, 2019; Ma and Lund, 2020; Romesburg, 2004). It evolved over the period of nearly a century of computerization and machine-learning development (Jain, 2008). Cluster analysis is an example of an unsupervised machine-learning procedure, where patterns among data are identified without input from the user (e.g. providing some type of ontology with which the algorithm could classify data) (Kettenring, 2006).
Two of the most common types of cluster analysis are k-means and k-medoids. These two heuristic methods share many similarities, with one key distinction. In both techniques, the researchers/analyst must define the value of “k,” the number of clusters to be calculated. The number selected may be based on a theory or hypothesis, a statistical estimation or through trial-and-error. Both approaches will then assign all values in a dataset to one of the k clusters.
The k-means algorithm begins by creating k number of clusters, then sorting each data point into the cluster with the nearest mean value. In the initial iteration, the clusters will often be very imperfect, as the “means” are, more-or-less, arbitrarily defined by the algorithm. After all the points have been assigned to a cluster, new...