An Efficient Data Partitioning Method in Distributed Heterogeneous Bandwidth Environment
-
摘要: 在分布式大数据处理框架的作业运行过程中,会有大量的数据通过网络传输,数据在各节点之间传输所需的时间已成为作业运行的主要开销之一.在节点异构带宽的情况下,因为带宽瓶颈节点的存在,传统的数据分区方法效率低下.针对这个问题,建立了节点间的数据传输模型,该模型以降低数据传输时间为目标,根据各节点的上下行带宽和初始数据量大小,计算出各节点的最优数据分发比例.以该模型为基础,设计了基于带宽的数据分区方法,该数据分区方法使得各节点按最优数据分发比例来分配数据.最后在Apache Flink框架中将基于带宽的数据分区方法进行了实现,并通过实验进行了验证.实验结果表明:异构带宽条件下,基于带宽的数据分区方法可以有效减少数据分区所需的时间.
-
关键词:
- 数据分区 /
- Apache Flink /
- 负载均衡 /
- 异构带宽 /
- 分布式系统
Abstract: A large quantity of data is transmitted through the network during the process in distributed big data processing framework, resulting in the time consumption for data transmission between each node becomes one of the main costs of the operation. However, in the case of heterogeneous bandwidth of nodes, traditional data partitioning methods such as Hash partitioning or range partitioning will be inefficient, due to the existence of bandwidth bottleneck nodes. Data partitioning is necessary for big data processing and inefficient data partitioning methods would significantly increase the running time of jobs. We therefore propose a data transmission model between nodes to reduce time consumption in distributed heterogeneous bandwidth networks. The model calculates each node’s optimal data distribution ratio to minimize the data transfer time, according to its uplink and downlink bandwidth as well as the initial data size. Besides, a bandwidth-based data partitioning method is designed based on the proposed model, enabling each node to allocate data under the optimal data distribution ratio. We demonstrate the effectiveness of our bandwidth-based data partitioning method through the implementation in the Apache Flink framework and have significantly improved efficiency. Extensive experimental results show that the bandwidth-based data partitioning method can effectively reduce the time consumption of data partitioning in distributed heterogeneous bandwidth conditions.-
Keywords:
- data partitioning /
- Apache Flink /
- load balancing /
- heterogeneous bandwidth /
- distributed system
-
-
期刊类型引用(9)
1. 臧洁,任旭,冯艳爽,王妍,肖萍,鲁锦涛. 一种干扰系数自探测的网络事件选取方法. 小型微型计算机系统. 2024(03): 763-768 . 百度学术
2. 路苗,门可,马永红,张海瑞,冯彦成. 基于SIS模型的群体社交网络舆情演化仿真. 吉林大学学报(信息科学版). 2023(01): 106-111 . 百度学术
3. 马帅,刘建伟,左信. 图神经网络综述. 计算机研究与发展. 2022(01): 47-80 . 本站查看
4. 夏一雪,张立红,何巍,张双狮. 自治线性风险作用下网络舆情演化建模与仿真研究. 情报杂志. 2022(05): 92-98 . 百度学术
5. 易杰,曹腾飞,黄明峰,黄肖翰,张子震. 基于时间编码LSTM的高校舆情热点趋势预测研究. 大数据. 2022(05): 124-138 . 百度学术
6. 张杨,廉吉庆,张扬,高德毅. 国内网络舆情情感研究热点分析. 网络安全与数据治理. 2022(07): 47-55 . 百度学术
7. 徐缤荣. 融媒体背景下社会热点新闻舆情传播控制模型构建. 微型电脑应用. 2022(10): 149-152 . 百度学术
8. 臧洁,任旭. 考虑兴趣偏好和多事件影响的网络事件推演模型研究. 辽宁大学学报(自然科学版). 2022(04): 298-306 . 百度学术
9. 赵剑,董文华,史丽娟,匡哲君,毕京晓,王晢宇,强文倩. 针对突发公共事件的舆情监测与可视化分析. 吉林大学学报(信息科学版). 2021(06): 712-719 . 百度学术
其他类型引用(5)
计量
- 文章访问数: 714
- HTML全文浏览量: 8
- PDF下载量: 207
- 被引次数: 14