Proposing a New Gap Statistic-Based Estimation for the Number of Clusters

Dang Ha Thanh Huong; Jaekyung Yang

논문 상세보기

Proposing a New Gap Statistic-Based Estimation for the Number of Clusters KCI 등재

갭 통계량을 이용한 군집 개수 추정 방법

Dang Ha Thanh Huong, Jaekyung Yang

언어KOR
URLhttps://db.koreascholar.com/Article/Detail/445031

구독 기관 인증 시 무료 이용이 가능합니다. 4,000원

한국산업경영시스템학회지 (Journal of Society of Korea Industrial and Systems Engineering)

Vol.48 No.3 (2025.09)
pp.104-111

한국산업경영시스템학회 (Society of Korea Industrial and Systems Engineering)

초록

In the era of big data, where massive volumes of information are collected at high velocity from various sources, data mining has become a crucial tool for organizations seeking competitive advantage. Among its core tasks, clustering plays a key role in uncovering hidden patterns within unlabeled data by grouping similar objects into distinct clusters. Widely used methods such as k-means and its robust counterpart PAM (Partitioning Around Medoids) require the number of clusters, k, to be predefined—a task that remains a major challenge despite extensive research. This study addresses the problem of selecting the optimal number of clusters by proposing three novel enhancements to the widely-used gap statistic method: the 1stDaccSEmax heuristic rule, the recursive gap strategy, and the two-way bootstrapping technique. Collectively termed the new gap, this approach aims to overcome the limitations of the original gap statistic, particularly in datasets with overlapping clusters, hierarchical structures, or large volumes. Extensive experiments on both synthetic and real-world datasets—including Iris, Breast Cancer, Seeds, and Khan gene expression datasets—demonstrate that the new gap method outperforms traditional techniques such as the elbow method, silhouette analysis, and the original gap statistic in both accuracy and computational efficiency. Although PAM was used throughout the experiments for its robustness, the proposed approach is algorithm-agnostic and can be integrated with other clustering methods that require the selection of k. The results suggest that the new gap method provides a more reliable and scalable solution for determining the number of clusters, thereby enhancing the effectiveness of clustering-based data analysis in real-world applications.

키워드

Clustering Gap Statistic Number of Clusters

1. 서 론
2. 문헌 연구
    2.1 엘보우 방법
    2.2 평균 실루엣 방법
    2.3 하티간(Hartigan) 통계량 방법
    2.4 갭(Gap) 통계량 방법
3. 제안 방법론
    3.1 갭 통계량 방법의 한계
    3.2 New Gap
    3.3 실험 결과
4. 결 론
References

저자

Dang Ha Thanh Huong(MIS Department, Shinhan Viet Nam Finance Company Ltd.) | 당하탄후엉 (신한베트남금융 MIS)
Jaekyung Yang(Department of Industrial and Information Systems Engineering, Jeonbuk National University) | 양재경 (전북대학교 산업정보시스템공학과) Corresponding author

같은 권호 다른 논문