一种基于<i>K</i>-Means的平衡约束聚类算法

唐海波; 林煜明; 李优

doi:10.3969/j.issn.1000-5641.2018.05.014

一种基于K-Means的平衡约束聚类算法

doi: 10.3969/j.issn.1000-5641.2018.05.014

唐海波^1,,
林煜明^1, ,,
李优²

1.
桂林电子科技大学广西可信软件重点实验室, 广西桂林 541004
2.
桂林电子科技大学广西自动检测技术与仪器重点实验室, 广西桂林 541004

基金项目:

国家自然科学基金 61562014

国家自然科学基金 U1501252

国家自然科学基金 U1711263

广西高校中青年教师基础能力提升项目 2017KY0195

广西自动检测技术与仪器重点实验室研究课题 YQ17111

详细信息

作者简介:
唐海波, 男, 硕士研究生, 研究方向为分布式数据管理.E-mail:jadensshai@foxmail.com

通讯作者:
林煜明, 男, 副研究员, 硕士生导师, 研究方向为海量数据管理和观点挖掘.E-mail:ymlin@guet.edu.cn

中图分类号: TP391.4
计量
- 文章访问数: 210
- HTML全文浏览量: 102
- PDF下载量: 241
- 被引次数: 0
出版历程
- 收稿日期: 2018-07-04
- 刊出日期: 2018-09-25

A K-Means based clustering algorithm with balanced constraints

TANG Hai-bo^1
,,
LIN Yu-ming^{1
, ,},
LI You²

1.
Guangxi Key Laboratory of Trusted Software, Guilin University of Electronic Technology, Guilin Guangxi 541004, China
2.
Guangxi Key Laboratory of Automatic Testing and Instrument, Guilin University of Electronic Technology, Guilin Guangxi 541004, China

摘要

摘要: 聚类是一种重要数据分析技术，在众多领域中得到广泛地应用.然而，由于数据分布的内在特点，传统的聚类算法并不能保证聚类结果具有平衡性，这与很多现实的需求不一致.本文提出了一种基于K-Means的平衡约束聚类算法，该算法对K-Means算法每次迭代中数据点的分配策略进行修改，达到对每个簇可包含的数据点数目上限进行约束的目的.同时，算法支持用户自定义簇可包含的数据点数目上限，满足不同的平衡约束聚类需求.另外，本算法参数少，只需设置目标簇数目及其可包含的数据点数目上限，时间复杂度低，具有简单、快速的特点.在6个UCI（University of CaliforniaIrvine）真实数据集上进行的实验结果表明，文中提出的平衡约束聚类算法相比其他平衡约束聚类算法具有更佳的聚类效果和时间性能.
- 平衡约束 /
- 聚类 /
- 贪心算法 /
- 数据管理
Abstract: Clustering is a type of core data processing technology, which has been applied widely in various applications. Traditional clustering algorithms, however, cannot guarantee the balance of clustering results, which is inconsistent with the needs of real-world applications. To address this problem, this paper proposes a K-Means based clustering algorithm with balanced constraints. The proposed clustering algorithm changes the data point assignment process, in each iteration, to achieve balanced data blocks. The maximum cluster size threshold can be customized to meet different division requirements. The algorithm is simple and efficient-only two parameters need to be specified:the number of clusters and the threshold for the maximum cluster size. A series of experiments carried on six real UCI (University of California Irvine) datasets show that the algorithm proposed in this paper has better clustering performance and efficiency compared to other balanced clustering algorithms.
- balanced constraints /
- clustering /
- greedy algorithm /
- data management

HTML全文

图 1 算法实际运行时间比较

Fig. 1 Comparison of algorithm running time

下载: 全尺寸图片幻灯片

算法 1 平衡约束聚类算法

输入:数据集$T$, 簇数目$K$, 簇可以包含的数据点个数上限$\tau $
输出: $K$个簇集合$cd=\{d_{1}$, $d_2, \cdots , d_{K}\} $

1: AddtoNearestCluster $ (x_{i})$
2:     计算数据点$x_{i}$与${C=}{\{}\mu _j {\vert}j{=1, 2, 3, }\cdots, {K} {\}}$中每个点的欧式距离, 根据
        距离由远到近得到簇的排序集合${\rm STC}=\{\gamma _j \vert j=1, 2, 3, \cdots , { K}\}$;
3:     for $j \longleftarrow $ 1 to $K$ DO
4:             if ${\vert \vert }d_j {\vert \vert }\le \tau $
5:                 $x_{i}$存入$d_{j}$;
6:                 将数据点$x_{i}$以及其距离第$j$个簇中心的距离$\gamma _j $存入CS$_{j}$={{$x_{i}, \gamma _j $} $\cdots$};
7:                 break
8:             else
9:                 if数据点$x_{i}$距离第$j$个簇中心的距离$\gamma_j $大于CS$_{j}$中最大的距离
10:                     continue
11:                 else
12:                     将第$j$个簇的边界点$z$删除并将数据点$x_{i}$加入到第$j$个簇中
13:                     AddtoNearestCluster($z$)
14: 从数据集$T$中随机确定出$K$个数据点作为初始中心点C={$\mu _j \vert j$=1, 2, 3, {$\cdots$}, $K$}
15: repeat
16:     for $i \longleftarrow$ 1 to $N$ DO
15:             AddtoNearestCluster($x_{i}$)
17:     重新计算每个簇的中心点
18: until簇不发生变化或达到最大迭代次数

下载: 导出CSV

表 1 6种聚类算法在6个数据集上的实验结果

Tab. 1 Clustering results of six clustering algorithms operated on six datasets

评价指标	数据集	规模	簇数目	KM	FCM	SC	BKM	BCLS	BCGS
准确率	Wine	148	3	0.937 5	0.888 8	0.972 2	0.875	0.986 1	0.972 2
	Lonosphere	252	2	0.769 8	0.753 9	0.670 6	0.785 7	0.746 0	0.785 7
	Iris	150	3	0.900 0	0.920 0	0.886 6	0.946 6	0.853 3	0.920 0
	Cryotherapy	84	2	0.738 0	0.547 6	0.547 6	0.785 7	0.785 7	0.809 5
	User Model	200	4	0.595 0	0.425 0	0.385 0	0.460 0	0.555 0	0.615 0
	Vechicle	796	4	0.385 6	0.370 6	0.424 6	0.418 3	0.425 8	0.433 4
归一化互信息	Wine	148	3	0.800 9	0.697 0	0.885 9	0.626 0	0.9385	0.894 9
	Lonosphere	252	2	0.222 0	0.195 1	0.164 5	0.250 4	0.182 4	0.250 4
	Iris	150	3	0.743 6	0.777 3	0.762 9	0.830 8	0.680 2	0.777 3
	Cryotherapy	84	2	0.187 4	0.006 5	0.006 5	0.250 4	0.250 4	0.297 5
	User Model	200	4	0.452 3	0.237 3	0.225 9	0.264 1	0.372 5	0.407 3
	Vechicle	796	4	0.113 1	0.101 3	0.174 3	0.143 2	0.105 9	0.156 7
标准信息熵	Wine	148	3	0.994 5	0.999 4	0.999 4	1.000 0	1.000 0	1.000 0
	Lonosphere	252	2	0.999 2	0.999 8	0.995 7	1.000 0	1.000 0	1.000 0
	Iris	150	3	0.999 8	0.999 9	0.972 2	1.000 0	1.000 0	1.000 0
	Cryotherapy	84	2	0.956 6	0.998 3	0.998 3	1.000 0	1.000 0	1.000 0
	User Model	200	4	0.935 9	0.988 4	0.993 2	1.000 0	0.786 7	1.000 0
	Vechicle	796	4	0.994 8	0.995 9	0.826 4	1.000 0	0.983 3	1.000 0

下载: 导出CSV

参考文献(13)

[1]	AGGARWAL C C, REDDY C K. Data Clustering: Algorithms and Applications[M]. Raton: Chapman & Hall/CRC, 2013.
[2]	VISWANATH P, PINKESH R. l-DBSCAN: A fast hybrid density based clustering method[C]//Proceedings of the 18th International Conference on Pattern Recognition. NJ: IEEE, 2006: 912-915.
[3]	PEIZHUANG W. Pattern recognition with fuzzy objective function algorithms (James C. Bezdek)[J]. SIAM Review, 1983, 25(3):442-442.
[4]	MALINEN M I, FRÄNTI P. Balanced k-means for clustering[C]//Joint IAPR International Workshops on Statistical Techniques in Pattern Recognition (SPR) and Structural and Syntactic Pattern Recognition (SSPR). Berlin: Springer, 2014: 32-41.
[5]	LIU H, HAN J, NIE F, et al. Balanced clustering with least square regression[C]//Proceedings of the 31th AAAI Conference on Artificial Intelligence. CA: AAAI, 2017: 2231-2237.
[6]	BRADLEY P S, BENNETT K P, DEMIRIZ A. Constrained k-means clustering[J]. Microsoft Research Redmond, 2000, 59(1):1-34. http://d.old.wanfangdata.com.cn/Periodical/sdsdxb-zrkx201304013
[7]	KUHN H W. The Hungarian method for the assignment problem[J]. Naval Research Logistics, 2005, 52(1):7-21. doi: 10.1002/(ISSN)1520-6750
[8]	WU B, ZHOU Y, YUAN P, et al. Semstore: A semantic-preserving distributed rdf triple store[C]//Proceedings of the 23rd ACM International Conference on Conference on Information and Knowledge Management. NY: ACM, 2014: 509-518.
[9]	BANERJEE A, GHOSH J. On scaling up balanced clustering algorithms[C]//Proceedings of the 2002 SIAM International Conference on Data Mining. PA: Society for Industrial and Applied Mathematics, 2002: 333-349.
[10]	BANERJEE A, GHOSH J. Frequency-sensitive competitive learning for scalable balanced clustering on highdimensional hyperspheres[J]. IEEE Transactions on Neural Networks, 2004, 15(3):702-719. doi: 10.1109/TNN.2004.824416
[11]	NG A Y, JORDAN M I, WEISS Y. On spectral clustering: Analysis and an algorithm[C]//Proceedings of the neural information processing systems. Massachusetts: MIT Press, 2002: 849-856.
[12]	CAI D, HE X, HAN J. Document clustering using locality preserving indexing[J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(12):1624-1637. doi: 10.1109/TKDE.2005.198
[13]	STREHL A, CHOSH J. Knowledge reuse framework for combining multiple partitions[J]. Journal of Machine learning Research, 2002, 33(3):583-617. http://www.scienceopen.com/document/vid/527a0347-4012-4728-a4c1-509f9d21d622