高级检索

    大规模数据密集型系统中的去重查询优化

    Duplication Elimination in Large Scale Data Intensive Systems

    • 摘要: 针对shared-nothing结构下大规模数据密集型系统去重查询的挑战,提出了一种有效的数据分布策略和并行处理方法分别对相关属性和无关属性去重进行优化:即自适应的散列和直方图相结合的数据分布策略,以及异步式并行查询中间件.前者在数据写入时保证数据均衡,并在数据量发生倾斜时自动调整数据的分布;后者充分发掘了去重查询处理中的粗粒度流水级并行,并消除了多节点同步等待的开销,尽早地返回结果.在生产系统DBroker上的测试表明,数据分布策略极大地改善相关属性的去重查询性能,而异步式并行查询引擎能够充分发掘并行性,对不相关属性的去重查询具有显著的性能提升.

       

      Abstract: As the emerging data intensive applications have received more and more attentions from researchers, its a severe challenge for duplication elimination for large volume data in a shared-nothing environment. The authors propose an effective and adaptive data placement method which is a combination of hash partition and histogram, as well as a design of an asynchronous parallel query engine (APQE) for duplication elimination. Hash partition divides data into non-relevant subsets in order to reduce data migration in duplication elimination, while histogram method keeps balance in data size in different nodes. Furthermore, adaptive approach can make data size rebalanced while data skew occurs. The parallel query engine develops maximum degree of pipeline parallelism for large scale data processing by employing coarse-grained pipelining, and the asynchronous method makes further efforts to eliminate synchronous overhead of multiple nodes parallelism. APQE launches data merging when some of database nodes returns intermediate result, and at the same time returns part of the final result as early as the slowest node returns relevant data, and then frees the memory space. Experimental results tested in a productive large scale system DBroker demonstrate that the combined data placement strategy and adaptive method work well for relative attributes duplication elimination, and the asynchronous parallel query engine can make a great performance improvement for duplication elimination of large volume of data in a cluster environment.

       

    /

    返回文章
    返回