高级检索

    面向大规模数据的DBSCAN加速算法综述

    Survey on DBSCAN Acceleration Algorithms for Large Scale Data

    • 摘要: DBSCAN(density-based spatial clustering of applications with noise)是应用最广的密度聚类算法之一. 然而,它时间复杂度过高(O(n2)),无法处理大规模数据. 因而,对它进行加速成为一个研究热点,众多富有成效的工作不断涌现. 从加速目标上看,这些工作大体上可分为减少冗余计算和并行化两大类;就具体加速手段而言,可分为6个主要类别:基于分布式、基于采样化、基于近似模糊、基于快速近邻、基于空间划分以及基于GPU加速技术. 根据该分类,对现有工作进行了深入梳理与交叉比较,发现采用多重技术的融合加速算法优于单一加速技术;近似模糊化、并行化与分布式是当前最有效的手段;高维数据仍然难以应对. 此外,对快速化DBSCAN算法在多个领域中的应用进行了跟踪报告. 最后,对本领域未来的方向进行了展望.

       

      Abstract: DBSCAN (density-based spatial clustering of applications with noise) is one of the most widely used and studied density clustering algorithms for its simplicity and easy implementation. However, the high time complexity (O(n2)) yields large-scale data that it is unable to deal with, due to that DBSCAN has great number of redundant distance computations in the process of calculating density. Therefore, accelerating it, which aims to make it suitable for big data environment, has become a research hotspot, and much fruitful work has emerged. From the perspective of acceleration goals, these efforts can be broadly divided into two categories: reducing redundant computations and parallelization. In terms of specific acceleration means, they can be divided into six main categories: distributed technique, sampling, approximation, fast neighbor, space division and GPU acceleration. According to this classification, the existing work is thoroughly combed and cross compared. It is found that the fusion acceleration algorithms of multiple technologies are better than those that only use single acceleration technology; approximate fuzziness, parallelism and distribution are the most effective methods to accelerate DBSCAN at present; high-dimensional data are still difficult to deal with. In addition, the applications of fast DBSCAN in many fields are tracked and reported. Finally, the future direction of rapid DBSCAN is prospected.

       

    /

    返回文章
    返回