• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Song Huaiming, An Mingyuan, Wang Yang, Yuan Chunyang, Sun Ninghui. Duplication Elimination in Large Scale Data Intensive Systems[J]. Journal of Computer Research and Development, 2010, 47(4): 581-588.
Citation: Song Huaiming, An Mingyuan, Wang Yang, Yuan Chunyang, Sun Ninghui. Duplication Elimination in Large Scale Data Intensive Systems[J]. Journal of Computer Research and Development, 2010, 47(4): 581-588.

Duplication Elimination in Large Scale Data Intensive Systems

More Information
  • Published Date: April 14, 2010
  • As the emerging data intensive applications have received more and more attentions from researchers, its a severe challenge for duplication elimination for large volume data in a shared-nothing environment. The authors propose an effective and adaptive data placement method which is a combination of hash partition and histogram, as well as a design of an asynchronous parallel query engine (APQE) for duplication elimination. Hash partition divides data into non-relevant subsets in order to reduce data migration in duplication elimination, while histogram method keeps balance in data size in different nodes. Furthermore, adaptive approach can make data size rebalanced while data skew occurs. The parallel query engine develops maximum degree of pipeline parallelism for large scale data processing by employing coarse-grained pipelining, and the asynchronous method makes further efforts to eliminate synchronous overhead of multiple nodes parallelism. APQE launches data merging when some of database nodes returns intermediate result, and at the same time returns part of the final result as early as the slowest node returns relevant data, and then frees the memory space. Experimental results tested in a productive large scale system DBroker demonstrate that the combined data placement strategy and adaptive method work well for relative attributes duplication elimination, and the asynchronous parallel query engine can make a great performance improvement for duplication elimination of large volume of data in a cluster environment.
  • Related Articles

    [1]Wu Wenlong, Yin Hailian, Wang Ning, Xu Mengfei, Zhao Xinzhe, Yin Zhanzuo, Liu Yuanrui, Wang Haofen, Ding Yan, Li Bohan. A Synergetic LLM-KG Framework for Cross-Domain Heterogeneous Data Query[J]. Journal of Computer Research and Development, 2025, 62(3): 605-619. DOI: 10.7544/issn1000-1239.202440634
    [2]Chen Yubiao, Li Jianzhong, Li Yingshu. SBS: An Efficient R-Tree Query Algorithm Exploiting the Internal Parallelism of SSDs[J]. Journal of Computer Research and Development, 2020, 57(11): 2404-2418. DOI: 10.7544/issn1000-1239.2020.20190564
    [3]Wang Yishu, Yuan Ye, Liu Meng, Wang Guoren. Survey of Query Processing and Mining Techniques over Large Temporal Graph Database[J]. Journal of Computer Research and Development, 2018, 55(9): 1889-1902. DOI: 10.7544/issn1000-1239.2018.20180132
    [4]Wang Youwei, Wang Weiping, Meng Dan. Query Optimization by Statistical Approach for Hive Data Warehouse[J]. Journal of Computer Research and Development, 2015, 52(6): 1452-1462. DOI: 10.7544/issn1000-1239.2015.20140403
    [5]Li Yefeng, Le Jiajin, and Wang Mei. A Column-Store Based Bucket Partition Algorithm for Range Queries[J]. Journal of Computer Research and Development, 2013, 50(3): 594-601.
    [6]Wang Yijie, Li Xiaoyong, Qi Yafei, and Sun Weidong. Uncertain Data Queries Technologies[J]. Journal of Computer Research and Development, 2012, 49(7): 1460-1466.
    [7]Ou Xiaoping, Wang Chaokun, Peng Zhuo, Qiu Ping, and Bai Yiyuan. A Graph-Based Music Data Model and Query Language[J]. Journal of Computer Research and Development, 2011, 48(10): 1879-1889.
    [8]Huang Zhenhua and Wang Wei. An Algebra for Skyline Query Processing Data Cube[J]. Journal of Computer Research and Development, 2007, 44(6): 990-999.
    [9]Zhuang Yi, Zhuang Yueting, and Wu Fei. k Nearest Neighbor Queries Based on Data Grid[J]. Journal of Computer Research and Development, 2006, 43(11): 1876-1885.
    [10]Tao Chun, Zhang Liang, and Shi Baile. Query Processing for Ontology-Based XML Data Integration[J]. Journal of Computer Research and Development, 2005, 42(3).

Catalog

    Article views (706) PDF downloads (498) Cited by()

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return