• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

一种分布式异构带宽环境下的高效数据分区方法

马卿云, 季航旭, 赵宇海, 毛克明, 王国仁

马卿云, 季航旭, 赵宇海, 毛克明, 王国仁. 一种分布式异构带宽环境下的高效数据分区方法[J]. 计算机研究与发展, 2020, 57(12): 2683-2693. DOI: 10.7544/issn1000-1239.2020.20190683
引用本文: 马卿云, 季航旭, 赵宇海, 毛克明, 王国仁. 一种分布式异构带宽环境下的高效数据分区方法[J]. 计算机研究与发展, 2020, 57(12): 2683-2693. DOI: 10.7544/issn1000-1239.2020.20190683
Ma Qingyun, Ji Hangxu, Zhao Yuhai, Mao Keming, Wang Guoren. An Efficient Data Partitioning Method in Distributed Heterogeneous Bandwidth Environment[J]. Journal of Computer Research and Development, 2020, 57(12): 2683-2693. DOI: 10.7544/issn1000-1239.2020.20190683
Citation: Ma Qingyun, Ji Hangxu, Zhao Yuhai, Mao Keming, Wang Guoren. An Efficient Data Partitioning Method in Distributed Heterogeneous Bandwidth Environment[J]. Journal of Computer Research and Development, 2020, 57(12): 2683-2693. DOI: 10.7544/issn1000-1239.2020.20190683

一种分布式异构带宽环境下的高效数据分区方法

基金项目: 国家重点研发计划项目(2018YFB1004402);国家自然科学基金项目(61772124)
详细信息
  • 中图分类号: TP311

An Efficient Data Partitioning Method in Distributed Heterogeneous Bandwidth Environment

Funds: This work was supported by the National Key Research and Development Program of China (2018YFB1004402) and the National Natural Science Foundation of China (61772124).
  • 摘要: 在分布式大数据处理框架的作业运行过程中,会有大量的数据通过网络传输,数据在各节点之间传输所需的时间已成为作业运行的主要开销之一.在节点异构带宽的情况下,因为带宽瓶颈节点的存在,传统的数据分区方法效率低下.针对这个问题,建立了节点间的数据传输模型,该模型以降低数据传输时间为目标,根据各节点的上下行带宽和初始数据量大小,计算出各节点的最优数据分发比例.以该模型为基础,设计了基于带宽的数据分区方法,该数据分区方法使得各节点按最优数据分发比例来分配数据.最后在Apache Flink框架中将基于带宽的数据分区方法进行了实现,并通过实验进行了验证.实验结果表明:异构带宽条件下,基于带宽的数据分区方法可以有效减少数据分区所需的时间.
    Abstract: A large quantity of data is transmitted through the network during the process in distributed big data processing framework, resulting in the time consumption for data transmission between each node becomes one of the main costs of the operation. However, in the case of heterogeneous bandwidth of nodes, traditional data partitioning methods such as Hash partitioning or range partitioning will be inefficient, due to the existence of bandwidth bottleneck nodes. Data partitioning is necessary for big data processing and inefficient data partitioning methods would significantly increase the running time of jobs. We therefore propose a data transmission model between nodes to reduce time consumption in distributed heterogeneous bandwidth networks. The model calculates each node’s optimal data distribution ratio to minimize the data transfer time, according to its uplink and downlink bandwidth as well as the initial data size. Besides, a bandwidth-based data partitioning method is designed based on the proposed model, enabling each node to allocate data under the optimal data distribution ratio. We demonstrate the effectiveness of our bandwidth-based data partitioning method through the implementation in the Apache Flink framework and have significantly improved efficiency. Extensive experimental results show that the bandwidth-based data partitioning method can effectively reduce the time consumption of data partitioning in distributed heterogeneous bandwidth conditions.
  • 期刊类型引用(6)

    1. 华书蓓,刘于超,白雅雯,郑际俊. 电能表数据采集终端负载自适应均衡方法研究. 自动化仪表. 2024(03): 78-82 . 百度学术
    2. 吕鹤轩,黄山,艾力卡木·再比布拉,吴思衡,段晓东. Flink水位线动态调整策略. 计算机工程与科学. 2023(02): 237-245 . 百度学术
    3. 梁懿,刘迪,陈又咏,董晓祺,许志毅. 国产化环境下的海量小文件数据分布式存储技术. 计算技术与自动化. 2023(03): 141-146 . 百度学术
    4. 邓国宝,查晓文,刘涛,冯灿,薛博文. 试飞数据查询引擎设计. 计算机测量与控制. 2023(10): 208-213+221 . 百度学术
    5. 邓国宝,查晓文,冯灿,张逸飞,薛博文. 试飞数据平台数据架构设计与应用. 计算机测量与控制. 2023(12): 271-276 . 百度学术
    6. 张茂君,李俊华,邢海涛,朱庭楠,孙健. 基于Hadoop和Flink的电力供应链数据中台建设与应用. 电力大数据. 2022(02): 55-63 . 百度学术

    其他类型引用(4)

计量
  • 文章访问数:  711
  • HTML全文浏览量:  7
  • PDF下载量:  204
  • 被引次数: 10
出版历程
  • 发布日期:  2020-11-30

目录

    /

    返回文章
    返回