Abstract:
The KNN-join has extensive applications where high-dimensional data is involved, because it is powerful and can provide more meaningful results than range-join. With its set-a-time nature, the KNN-join is able to facilitate many data mining tasks such as classification, outlier detection and clustering. MuX and Goreder are two algorithms specially designed for KNN-join. Since the MuX makes use of R-tree to reduce CPU cost, it suffers like all of the algorithms based on R-tree. In order to optimize both I/O and CPU cost, the Goreder applies a grid on data space. With the difference of data distribution, however, it is difficult to determine the cell's appropriate granularity. In the consequence, the Goreder tends to perform poorly. In this paper, a novel parallel method—pgi-distance (parallel grid index-distance)—is proposed, which combines the advantages of MuX and Goreder. On the one hand, the pgi-distance utilizes a two-tier structure to optimize I/O and CPU time separately. On the other hand, the index based on distance makes the pgi-distance adapted well to data dimensionality and data distribution. Especially the index in the pgi-distance is B\++-tree, which is supported by all commercial DBMS. So, this method becomes more applicable than that based on R-tree. Extensive experiments on both synthetic datasets and real datasets are conducted, and the results illustrate that pgi-distance is efficient.