在大规模数据集上进行快速自适应同步聚类
Fast Adaptive Clustering by Synchronization on Large Scale Datasets
-
摘要: 现有的同步聚类方法Sync在同步过程中需要将样本中的每一个分量看作相位振子进行计算,具有较高的时间复杂度,因此在大规模数据集上聚类时具有相当大的局限性.为了解决这一问题,提出了快速自适应同步聚类方法(fast adaptive KDEbased clustering by synchronization, FAKCS).FAKCS首先引入基于压缩集密度估计和中心约束最小包含球技术的快速压缩方法对大规模数据集进行压缩,然后通过使用DaviesBouldin指标,在压缩集上进行ε参数自适应的同步聚类,并采用新定义的序列参量来评价局部同步的程度.另外,研究了序列参量和核密度估计间的联系,从理论上揭示了样本点的局部同步在概率密度意义下的本质.FAKCS可以在大规模数据集上得到任意形状、个数、密度的聚类而无需预设聚类数目.在图像分割和大规模UCI数据集上的实验验证了FAKCS的有效性.Abstract: The existing synchronization clustering algorithm Sync regards each attribute of a sample as a phase oscillator in the synchronization process. As a result, the algorithm has higher time complexity and can not be well used on large scale datasets. To solve this problem, we propose a novel fast adaptive clustering algorithm FAKCS in this paper. Firstly, FAKCS introduces a method based on RSDE and CCMEB technology to extract the samples from the original dataset. Then it begins clustering adaptively by using the DaviesBouldin cluster criterion and the new order parameter which can observe the degree of local synchronization. Moreover, the relationship between the new order parameter and KDE is found in this paper, which reveals the probability density nature of local synchronization. FAKCS can detect clusters of arbitrary shape, number and density on large scale datasets without setting cluster number previously. The effectiveness of the proposed method has been demonstrated in image segmentation examples and experiments on large UCI datasets.