高级检索
    何洪辉 王丽珍 周丽华. pgi-distance:一种高效的并行KNN-join处理方法[J]. 计算机研究与发展, 2007, 44(10): 1774-1781.
    引用本文: 何洪辉 王丽珍 周丽华. pgi-distance:一种高效的并行KNN-join处理方法[J]. 计算机研究与发展, 2007, 44(10): 1774-1781.
    He Honghui, Wang Lizhen, and Zhou Lihua. pgi-distance: An Efficient Method Supporting Parallel KNN-join Process[J]. Journal of Computer Research and Development, 2007, 44(10): 1774-1781.
    Citation: He Honghui, Wang Lizhen, and Zhou Lihua. pgi-distance: An Efficient Method Supporting Parallel KNN-join Process[J]. Journal of Computer Research and Development, 2007, 44(10): 1774-1781.

    pgi-distance:一种高效的并行KNN-join处理方法

    pgi-distance: An Efficient Method Supporting Parallel KNN-join Process

    • 摘要: KNN-join是一种新近才提出的操作,它在数据挖掘中有着广泛的应用.利用KNN-join的“一次一个集合”的性质,一些数据挖掘任务,例如分类、例外挖掘和聚类等,就会更加容易地进行. MuX和Goreder则是两种专为KNN-join设计的算法.为了综合利用这两种方法的优点,一种新的KNN-join并行处理方法——pgi-distance(parallel grid index-distance)——被提了出来. pgi-distance使用双层结构,可以对I/O和CPU进行同时优化;基于距离的索引能够让它更好地适应数据维度和分布的变化.由于采用的是各DBMS厂商广泛支持的B\++树索引,这让pgi-distance得以成为一种更为实用的KNN-join处理方法.在合成数据集和真实数据集上的测试也表明pgi-distance是实用的和高效的.

       

      Abstract: The KNN-join has extensive applications where high-dimensional data is involved, because it is powerful and can provide more meaningful results than range-join. With its set-a-time nature, the KNN-join is able to facilitate many data mining tasks such as classification, outlier detection and clustering. MuX and Goreder are two algorithms specially designed for KNN-join. Since the MuX makes use of R-tree to reduce CPU cost, it suffers like all of the algorithms based on R-tree. In order to optimize both I/O and CPU cost, the Goreder applies a grid on data space. With the difference of data distribution, however, it is difficult to determine the cell's appropriate granularity. In the consequence, the Goreder tends to perform poorly. In this paper, a novel parallel method—pgi-distance (parallel grid index-distance)—is proposed, which combines the advantages of MuX and Goreder. On the one hand, the pgi-distance utilizes a two-tier structure to optimize I/O and CPU time separately. On the other hand, the index based on distance makes the pgi-distance adapted well to data dimensionality and data distribution. Especially the index in the pgi-distance is B\++-tree, which is supported by all commercial DBMS. So, this method becomes more applicable than that based on R-tree. Extensive experiments on both synthetic datasets and real datasets are conducted, and the results illustrate that pgi-distance is efficient.

       

    /

    返回文章
    返回