ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2017, Vol. 54 ›› Issue (7): 1592-1602.doi: 10.7544/issn1000-1239.2017.20160558

• 软件技术 • 上一篇    下一篇

一种基于Spark的多路空间连接查询处理算法

乔百友1,2,朱俊海1,郑宇杰1,申木川1,王国仁1   

  1. 1(东北大学计算机科学与工程学院 沈阳 110819);2(杨百翰大学计算机科学系 美国犹他州普若佛 84602) (qiaobaiyou@mail.neu.edu.cn)
  • 出版日期: 2017-07-01
  • 基金资助: 
    国家自然科学基金项目(61073063,61332006);国家海洋公益性行业科研专项经费项目(201105033)

A Multi-Way Spatial Join Querying Processing Algorithm Based on Spark

Qiao Baiyou1,2, Zhu Junhai1, Zheng Yujie1, Shen Muchuan1, Wang Guoren1   

  1. 1(School of Computer Science and Engineering, Northeastern University, Shenyang 110819);2(Department of Computer Science, Brigham Young University, Provo, Utah, USA 84602)
  • Online: 2017-07-01

摘要: 针对云环境下空间数据连接查询处理问题,提出了一种基于Spark的多路空间连接查询处理算法BSMWSJ.该算法采用网格划分方法将整个数据空间划分成大小相同的网格单元,并将各类数据集中的空间对象,根据其空间位置划分到相应的网格单元中,不同网格单元中的空间数据对象进行并行连接查询处理.在多路空间连接查询处理过程中,采用边界过滤的方法来过滤无用数据,即通过计算前面连接操作候选结果的MBR来过滤后续连接数据集,从而过滤掉无用的连接对象,减少连接对象的多余投影与复制,并采用重复避免策略来减少重复结果的输出,从而进一步减少后续连接计算的代价.合成数据集和真实数据集上的大量实验结果表明:提出的多路空间连接查询处理算法在性能上明显优于现有的多路连接查询处理算法.

关键词: 云计算, Spark平台, 多路空间连接查询, 边界过滤, 重复避免

Abstract: Aiming at the problem of spatial join query processing in cloud computing systems, a multi-way spatial join query processing algorithm BSMWSJ is proposed, which is based on Spark platform. In this algorithm, the whole data space is divided into grid cells with the same size by grid partition method, and spatial objects in each type data set are distributed into these grid cells according to their spatial locations. Spatial objects in different grid cells are processed in parallel. In multi-way spatial join query processing, a boundary filtering method is proposed to filter the useless data, which calculates the MBRs of the candidate results generated by the previous join processing, and uses these MBRs to filter the subsequent join data sets. This allows it to filter out the useless spatial objects, and reduce the redundant projection and replication of spatial objects. At the same time, a duplication avoidance strategy is applied to reduce the outputs of redundant results, and further minimizes the cost of the subsequent join processing. Many experiments on synthetic and real data sets show that the proposed multi-way spatial join query processing algorithm BSMWSJ has obvious advantages and better performance than the existing multi-way spatial join query processing algorithms.

Key words: cloud computing, Spark platform, multi-way spatial join query, boundary filtering, duplication avoidance

中图分类号: