A K-Means based clustering algorithm with balanced constraints
-
摘要: 聚类是一种重要数据分析技术,在众多领域中得到广泛地应用.然而,由于数据分布的内在特点,传统的聚类算法并不能保证聚类结果具有平衡性,这与很多现实的需求不一致.本文提出了一种基于K-Means的平衡约束聚类算法,该算法对K-Means算法每次迭代中数据点的分配策略进行修改,达到对每个簇可包含的数据点数目上限进行约束的目的.同时,算法支持用户自定义簇可包含的数据点数目上限,满足不同的平衡约束聚类需求.另外,本算法参数少,只需设置目标簇数目及其可包含的数据点数目上限,时间复杂度低,具有简单、快速的特点.在6个UCI(University of CaliforniaIrvine)真实数据集上进行的实验结果表明,文中提出的平衡约束聚类算法相比其他平衡约束聚类算法具有更佳的聚类效果和时间性能.Abstract: Clustering is a type of core data processing technology, which has been applied widely in various applications. Traditional clustering algorithms, however, cannot guarantee the balance of clustering results, which is inconsistent with the needs of real-world applications. To address this problem, this paper proposes a K-Means based clustering algorithm with balanced constraints. The proposed clustering algorithm changes the data point assignment process, in each iteration, to achieve balanced data blocks. The maximum cluster size threshold can be customized to meet different division requirements. The algorithm is simple and efficient-only two parameters need to be specified:the number of clusters and the threshold for the maximum cluster size. A series of experiments carried on six real UCI (University of California Irvine) datasets show that the algorithm proposed in this paper has better clustering performance and efficiency compared to other balanced clustering algorithms.
-
Key words:
- balanced constraints /
- clustering /
- greedy algorithm /
- data management
-
算法 1 平衡约束聚类算法 输入:数据集$T$, 簇数目$K$, 簇可以包含的数据点个数上限$\tau $
输出: $K$个簇集合$cd=\{d_{1}$, $d_2, \cdots , d_{K}\} $1: AddtoNearestCluster $ (x_{i})$
2: 计算数据点$x_{i}$与${C=}{\{}\mu _j {\vert}j{=1, 2, 3, }\cdots, {K} {\}}$中每个点的欧式距离, 根据
距离由远到近得到簇的排序集合${\rm STC}=\{\gamma _j \vert j=1, 2, 3, \cdots , { K}\}$;
3: for $j \longleftarrow $ 1 to $K$ DO
4: if ${\vert \vert }d_j {\vert \vert }\le \tau $
5: $x_{i}$存入$d_{j}$;
6: 将数据点$x_{i}$以及其距离第$j$个簇中心的距离$\gamma _j $存入CS$_{j}$={{$x_{i}, \gamma _j $} $\cdots$};
7: break
8: else
9: if数据点$x_{i}$距离第$j$个簇中心的距离$\gamma_j $大于CS$_{j}$中最大的距离
10: continue
11: else
12: 将第$j$个簇的边界点$z$删除并将数据点$x_{i}$加入到第$j$个簇中
13: AddtoNearestCluster($z$)
14: 从数据集$T$中随机确定出$K$个数据点作为初始中心点C={$\mu _j \vert j$=1, 2, 3, {$\cdots$}, $K$}
15: repeat
16: for $i \longleftarrow$ 1 to $N$ DO
15: AddtoNearestCluster($x_{i}$)
17: 重新计算每个簇的中心点
18: until簇不发生变化或达到最大迭代次数表 1 6种聚类算法在6个数据集上的实验结果
Tab. 1 Clustering results of six clustering algorithms operated on six datasets
评价指标 数据集 规模 簇数目 KM FCM SC BKM BCLS BCGS 准确率 Wine 148 3 0.937 5 0.888 8 0.972 2 0.875 0.986 1 0.972 2 Lonosphere 252 2 0.769 8 0.753 9 0.670 6 0.785 7 0.746 0 0.785 7 Iris 150 3 0.900 0 0.920 0 0.886 6 0.946 6 0.853 3 0.920 0 Cryotherapy 84 2 0.738 0 0.547 6 0.547 6 0.785 7 0.785 7 0.809 5 User Model 200 4 0.595 0 0.425 0 0.385 0 0.460 0 0.555 0 0.615 0 Vechicle 796 4 0.385 6 0.370 6 0.424 6 0.418 3 0.425 8 0.433 4 归一化互信息 Wine 148 3 0.800 9 0.697 0 0.885 9 0.626 0 0.9385 0.894 9 Lonosphere 252 2 0.222 0 0.195 1 0.164 5 0.250 4 0.182 4 0.250 4 Iris 150 3 0.743 6 0.777 3 0.762 9 0.830 8 0.680 2 0.777 3 Cryotherapy 84 2 0.187 4 0.006 5 0.006 5 0.250 4 0.250 4 0.297 5 User Model 200 4 0.452 3 0.237 3 0.225 9 0.264 1 0.372 5 0.407 3 Vechicle 796 4 0.113 1 0.101 3 0.174 3 0.143 2 0.105 9 0.156 7 标准信息熵 Wine 148 3 0.994 5 0.999 4 0.999 4 1.000 0 1.000 0 1.000 0 Lonosphere 252 2 0.999 2 0.999 8 0.995 7 1.000 0 1.000 0 1.000 0 Iris 150 3 0.999 8 0.999 9 0.972 2 1.000 0 1.000 0 1.000 0 Cryotherapy 84 2 0.956 6 0.998 3 0.998 3 1.000 0 1.000 0 1.000 0 User Model 200 4 0.935 9 0.988 4 0.993 2 1.000 0 0.786 7 1.000 0 Vechicle 796 4 0.994 8 0.995 9 0.826 4 1.000 0 0.983 3 1.000 0 -
[1] AGGARWAL C C, REDDY C K. Data Clustering: Algorithms and Applications[M]. Raton: Chapman & Hall/CRC, 2013. [2] VISWANATH P, PINKESH R. l-DBSCAN: A fast hybrid density based clustering method[C]//Proceedings of the 18th International Conference on Pattern Recognition. NJ: IEEE, 2006: 912-915. [3] PEIZHUANG W. Pattern recognition with fuzzy objective function algorithms (James C. Bezdek)[J]. SIAM Review, 1983, 25(3):442-442. [4] MALINEN M I, FRÄNTI P. Balanced k-means for clustering[C]//Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Berlin: Springer, 2014: 32-41. [5] LIU H, HAN J, NIE F, et al. Balanced clustering with least square regression[C]//Proceedings of the 31th AAAI Conference on Artificial Intelligence. CA: AAAI, 2017: 2231-2237. [6] BRADLEY P S, BENNETT K P, DEMIRIZ A. Constrained k-means clustering[J]. Microsoft Research Redmond, 2000, 59(1):1-34. http://d.old.wanfangdata.com.cn/Periodical/sdsdxb-zrkx201304013 [7] KUHN H W. The Hungarian method for the assignment problem[J]. Naval Research Logistics, 2005, 52(1):7-21. doi: 10.1002/(ISSN)1520-6750 [8] WU B, ZHOU Y, YUAN P, et al. Semstore: A semantic-preserving distributed rdf triple store[C]//Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. NY: ACM, 2014: 509-518. [9] BANERJEE A, GHOSH J. On scaling up balanced clustering algorithms[C]//Proceedings of the 2002 SIAM International Conference on Data Mining. PA: Society for Industrial and Applied Mathematics, 2002: 333-349. [10] BANERJEE A, GHOSH J. Frequency-sensitive competitive learning for scalable balanced clustering on highdimensional hyperspheres[J]. IEEE Transactions on Neural Networks, 2004, 15(3):702-719. doi: 10.1109/TNN.2004.824416 [11] NG A Y, JORDAN M I, WEISS Y. On spectral clustering: Analysis and an algorithm[C]//Proceedings of the neural information processing systems. Massachusetts: MIT Press, 2002: 849-856. [12] CAI D, HE X, HAN J. Document clustering using locality preserving indexing[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(12):1624-1637. doi: 10.1109/TKDE.2005.198 [13] STREHL A, CHOSH J. Knowledge reuse framework for combining multiple partitions[J]. Journal of Machine learning Research, 2002, 33(3):583-617. http://www.scienceopen.com/document/vid/527a0347-4012-4728-a4c1-509f9d21d622