一种基于Spark的多路空间连接查询处理算法

乔百友; 朱俊海; 郑宇杰; 申木川; 王国仁

doi:10.7544/issn1000-1239.2017.20160558

一种基于Spark的多路空间连接查询处理算法

A Multi-Way Spatial Join Querying Processing Algorithm Based on Spark

摘要

摘要: 针对云环境下空间数据连接查询处理问题，提出了一种基于Spark的多路空间连接查询处理算法BSMWSJ.该算法采用网格划分方法将整个数据空间划分成大小相同的网格单元，并将各类数据集中的空间对象，根据其空间位置划分到相应的网格单元中，不同网格单元中的空间数据对象进行并行连接查询处理.在多路空间连接查询处理过程中，采用边界过滤的方法来过滤无用数据，即通过计算前面连接操作候选结果的MBR来过滤后续连接数据集，从而过滤掉无用的连接对象，减少连接对象的多余投影与复制，并采用重复避免策略来减少重复结果的输出，从而进一步减少后续连接计算的代价.合成数据集和真实数据集上的大量实验结果表明：提出的多路空间连接查询处理算法在性能上明显优于现有的多路连接查询处理算法.

Abstract: Aiming at the problem of spatial join query processing in cloud computing systems, a multi-way spatial join query processing algorithm BSMWSJ is proposed, which is based on Spark platform. In this algorithm, the whole data space is divided into grid cells with the same size by grid partition method, and spatial objects in each type data set are distributed into these grid cells according to their spatial locations. Spatial objects in different grid cells are processed in parallel. In multi-way spatial join query processing, a boundary filtering method is proposed to filter the useless data, which calculates the MBRs of the candidate results generated by the previous join processing, and uses these MBRs to filter the subsequent join data sets. This allows it to filter out the useless spatial objects, and reduce the redundant projection and replication of spatial objects. At the same time, a duplication avoidance strategy is applied to reduce the outputs of redundant results, and further minimizes the cost of the subsequent join processing. Many experiments on synthetic and real data sets show that the proposed multi-way spatial join query processing algorithm BSMWSJ has obvious advantages and better performance than the existing multi-way spatial join query processing algorithms.

HTML全文

参考文献(0)

施引文献

资源附件(0)