갭 통계량을 이용한 군집 개수 추정 방법
In the era of big data, where massive volumes of information are collected at high velocity from various sources, data mining has become a crucial tool for organizations seeking competitive advantage. Among its core tasks, clustering plays a key role in uncovering hidden patterns within unlabeled data by grouping similar objects into distinct clusters. Widely used methods such as k-means and its robust counterpart PAM (Partitioning Around Medoids) require the number of clusters, k, to be predefined—a task that remains a major challenge despite extensive research. This study addresses the problem of selecting the optimal number of clusters by proposing three novel enhancements to the widely-used gap statistic method: the 1stDaccSEmax heuristic rule, the recursive gap strategy, and the two-way bootstrapping technique. Collectively termed the new gap, this approach aims to overcome the limitations of the original gap statistic, particularly in datasets with overlapping clusters, hierarchical structures, or large volumes. Extensive experiments on both synthetic and real-world datasets—including Iris, Breast Cancer, Seeds, and Khan gene expression datasets—demonstrate that the new gap method outperforms traditional techniques such as the elbow method, silhouette analysis, and the original gap statistic in both accuracy and computational efficiency. Although PAM was used throughout the experiments for its robustness, the proposed approach is algorithm-agnostic and can be integrated with other clustering methods that require the selection of k. The results suggest that the new gap method provides a more reliable and scalable solution for determining the number of clusters, thereby enhancing the effectiveness of clustering-based data analysis in real-world applications.