ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2020, Vol. 57 ›› Issue (12): 2683-2693.doi: 10.7544/issn1000-1239.2020.20190683

Previous Articles     Next Articles

An Efficient Data Partitioning Method in Distributed Heterogeneous Bandwidth Environment

Ma Qingyun1, Ji Hangxu1, Zhao Yuhai1, Mao Keming2, Wang Guoren3   

  1. 1(School of Computer Science and Engineering, Northeastern University, Shenyang 110169);2(Software College, Northeastern University, Shenyang 110169);3(School of Computer Science and Technology, Beijing Institute of Technology, Beijing 100081)
  • Online:2020-12-01
  • Supported by: 
    This work was supported by the National Key Research and Development Program of China (2018YFB1004402) and the National Natural Science Foundation of China (61772124).

Abstract: A large quantity of data is transmitted through the network during the process in distributed big data processing framework, resulting in the time consumption for data transmission between each node becomes one of the main costs of the operation. However, in the case of heterogeneous bandwidth of nodes, traditional data partitioning methods such as Hash partitioning or range partitioning will be inefficient, due to the existence of bandwidth bottleneck nodes. Data partitioning is necessary for big data processing and inefficient data partitioning methods would significantly increase the running time of jobs. We therefore propose a data transmission model between nodes to reduce time consumption in distributed heterogeneous bandwidth networks. The model calculates each node’s optimal data distribution ratio to minimize the data transfer time, according to its uplink and downlink bandwidth as well as the initial data size. Besides, a bandwidth-based data partitioning method is designed based on the proposed model, enabling each node to allocate data under the optimal data distribution ratio. We demonstrate the effectiveness of our bandwidth-based data partitioning method through the implementation in the Apache Flink framework and have significantly improved efficiency. Extensive experimental results show that the bandwidth-based data partitioning method can effectively reduce the time consumption of data partitioning in distributed heterogeneous bandwidth conditions.

Key words: data partitioning, Apache Flink, load balancing, heterogeneous bandwidth, distributed system

CLC Number: