Abstract:
Clustering is a data mining problem that has received significant attention from the database community. Data set size, dimensionality and sparsity have been identified as aspects that make clustering more difficult. Clustering in high-dimensional spaces is a difficult problem which is recurrent in many domains, for example, in image analysis. High dimension according to higher spatial dimension, data point distribution sparsity, and average density, therefore, discover the data gathering the kind quite to be difficult. The bottleneck of distance-based methods in clustering high-dimensional data sets is calculating the distance between data points. At present the research technique mainly concentrates on the density method based on the grid method and the characteristic method, and this research usually lies in making the improved data to gather with emphasis on the kind of process performance, including obtaining accurately gathering a kind of center, removing noise and so on. Instead of distance calculation, CAHD (clustering algorithm high-dimensional data) searches the dense units in n-dimension space and subspace from both bottom-up and top-down directions in the meantime, and then it clusters these dense units by using bitwise AND. The search strategy reduces search space to improve efficiency and the only use of bitwise and bit-shift machine instructions in clustering makes the algorithm more efficient. The algorithm CAHD is proposed for high-dimensional data sets. Experiments based on the data set indicate that the algorithm has very good validity.