Abstract:
DBSCAN (density-based spatial clustering of applications with noise) is one of the most widely used and studied density clustering algorithms for its simplicity and easy implementation. However, the high time complexity (
O(
n2)) yields large-scale data that it is unable to deal with, due to that DBSCAN has great number of redundant distance computations in the process of calculating density. Therefore, accelerating it, which aims to make it suitable for big data environment, has become a research hotspot, and much fruitful work has emerged. From the perspective of acceleration goals, these efforts can be broadly divided into two categories: reducing redundant computations and parallelization. In terms of specific acceleration means, they can be divided into six main categories: distributed technique, sampling, approximation, fast neighbor, space division and GPU acceleration. According to this classification, the existing work is thoroughly combed and cross compared. It is found that the fusion acceleration algorithms of multiple technologies are better than those that only use single acceleration technology; approximate fuzziness, parallelism and distribution are the most effective methods to accelerate DBSCAN at present; high-dimensional data are still difficult to deal with. In addition, the applications of fast DBSCAN in many fields are tracked and reported. Finally, the future direction of rapid DBSCAN is prospected.