Abstract:
The similarity join, an important data mining primitive, can be successfully applied to speeding up applications such as similarity search, data analysis and data mining. So far most of researches focus on the execution of high-dimensional joins over large amounts of disk based data. The increasing sizes of main memory available on current computers, and the need for efficient processing of spatial joins suggest that spatial joins for a large class of problems can be processed in main memory. Δ-tree is a novel multi-level index structure, it can speed up the high-dimensional query in main memory environment and has been proven to be an efficient index method. Each level in the Δ-tree represents the data space at different dimensionalities: the number of dimensions increases towards the leaf level which contains the data at their full dimensions. The remaining dimensions are obtained using principal component analysis. Using the properties of Δ-tree, a similarity join algorithm on the basis of index structure Δ-tree, Δ-tree-join, is presented. The top-down scheme can use fewer number of dimensions, compute the distances and efficiently complete join processing. Experimental results indicate that Δ-tree-join outperforms the state-of-the-art algorithm, EGO-join, and EGO\+*-join by a wide margin, and is an efficient similarity join method.