基于单元区域的高维数据聚类算法

谢坤武  毕晓玲  叶  斌

基于单元区域的高维数据聚类算法

谢坤武毕晓玲叶斌

Clustering Algorithm of High-Dimensional Data Based on Units

Xie Kunwu, Bi Xiaoling, and Ye Bin

摘要

摘要: 高维数据空间维数较高，数据点分布稀疏、密度平均，从中发现数据聚类比较困难，而用基于距离的方法进行高维数据聚类，维数的增多会使得计算对象间距离的时间开销增大. CAHD(clustering algorithm of high-dimensional data)算法首先采用双向搜索策略在指定的n维空间或其子空间上发现数据点密集的单元区域，然后采用逐位与的方法为这些密集单元区域进行聚类分析.双向搜索策略能够有效地减少搜索空间，从而提高算法效率，同时，聚类密集单元区域只用到逐位与和位移两种机器指令，使得算法效率得到进一步提高.算法CAHD可以有效地处理高维数据的聚类问题.基于数据集的实验表明，算法具有很好的有效性.

Abstract: Clustering is a data mining problem that has received significant attention from the database community. Data set size, dimensionality and sparsity have been identified as aspects that make clustering more difficult. Clustering in high-dimensional spaces is a difficult problem which is recurrent in many domains, for example, in image analysis. High dimension according to higher spatial dimension, data point distribution sparsity, and average density, therefore, discover the data gathering the kind quite to be difficult. The bottleneck of distance-based methods in clustering high-dimensional data sets is calculating the distance between data points. At present the research technique mainly concentrates on the density method based on the grid method and the characteristic method, and this research usually lies in making the improved data to gather with emphasis on the kind of process performance, including obtaining accurately gathering a kind of center, removing noise and so on. Instead of distance calculation, CAHD (clustering algorithm high-dimensional data) searches the dense units in n-dimension space and subspace from both bottom-up and top-down directions in the meantime, and then it clusters these dense units by using bitwise AND. The search strategy reduces search space to improve efficiency and the only use of bitwise and bit-shift machine instructions in clustering makes the algorithm more efficient. The algorithm CAHD is proposed for high-dimensional data sets. Experiments based on the data set indicate that the algorithm has very good validity.

HTML全文

参考文献(0)

施引文献

资源附件(0)