编码技术改进大规模分布式机器学习性能综述

王艳; 李念爽; 王希龄; 钟凤艳

doi:10.7544/issn1000-1239.2020.20190286

编码技术改进大规模分布式机器学习性能综述

(华东交通大学软件学院南昌 330013) (wangyann@189.cn)

基金项目: 国家自然科学基金项目(61402172)；江西省自然科学基金项目(20192BAB217006)

详细信息

中图分类号: TP399
计量
- 文章访问数: 1623
- HTML全文浏览量: 5
- PDF下载量: 794
出版历程
- 发布日期: 2020-02-29

Coding-Based Performance Improvement of Distributed Machine Learning in Large-Scale Clusters

(School of Software, East China Jiaotong University, Nanchang 330013)

Funds: This work was supported by the National Natural Science Foundation of China (61402172) and the Natural Science Foundation of Jiangxi Province of China (20192BAB217006).

摘要

摘要: 由于分布式计算系统能为大数据分析提供大规模的计算能力，近年来受到了人们的广泛关注.在分布式计算系统中，存在某些计算节点由于各种因素的影响，计算速度会以某种随机的方式变慢，从而使运行在集群上的机器学习算法执行时间增加，这种节点叫作掉队节点(straggler).介绍了基于编码技术解决这些问题和改进大规模机器学习集群性能的研究进展.首先介绍编码技术和大规模机器学习集群的相关背景;其次将相关研究按照应用场景分成了应用于矩阵乘法、梯度计算、数据洗牌和一些其他应用，并分别进行了介绍分析;最后总结讨论了相关编码技术存在的困难并对未来的研究趋势进行了展望.
- 编码技术 /
- 机器学习 /
- 分布式计算 /
- 掉队节点容忍 /
- 性能优化
Abstract: With the growth of models and data sets, running large-scale machine learning algorithms in distributed clusters has become a common method. This method divides the whole machine learning algorithm and training data into several tasks and each task runs on different worker nodes. Then, the results of all tasks are combined by master node to get the results of the whole algorithm. When there are a large number of nodes in distributed cluster, some worker nodes, called straggler, will inevitably slow down than other nodes due to resource competition and other reasons, which makes the task time of running on this node significantly higher than that of other nodes. Compared with running replica task on multiple nodes, coded computing shows an impact of efficient utilization of computation and storage redundancy to alleviate the effect of stragglers and communication bottlenecks in large-scale machine learning cluster.This paper introduces the research progress of solving the straggler issues and improving the performance of large-scale machine learning cluster based on coding technology. Firstly, we introduce the background of coding technology and large-scale machine learning cluster. Secondly, we divide the related research into several categories according to application scenarios: matrix multiplication, gradient computing, data shuffling and some other applications. Finally, we summarize the difficulties of applying coding technology in large-scale machine learning cluster and discuss the future research trends about it.
- coding technology /
- machine learning /
- distributed computing /
- stragglers tolerate /
- performance improvement

HTML全文

参考文献(0)

施引文献(27)

期刊类型引用(11)

1.	徐宁，李静秋，王岚君，刘安安. 时序特性引导下的谣言事件检测方法评测. 南京大学学报(自然科学). 2025(01): 71-82 . 百度学术
2.	关昌珊，邴万龙，刘雅辉，顾鹏飞，马洪亮. 基于图卷积网络的多特征融合谣言检测方法. 郑州大学学报(工学版). 2024(04): 70-78 . 百度学术
3.	帅训波，冯梅，李青，董之光，张文博. 文本信息检索质量评估技术发展趋势及展望. 网络新媒体技术. 2024(04): 1-7+25 . 百度学术
4.	王友卫，王炜琦，凤丽洲，朱建明，李洋. 基于广度-深度采样和图卷积网络的谣言检测方法. 浙江大学学报(工学版). 2024(10): 2040-2052 . 百度学术
5.	陈鑫，荣欢，郭尚斌，杨彬. 用于谣言检测的图卷积时空注意力融合与图重构方法. 计算机科学. 2024(11): 54-64 . 百度学术
6.	丁浩，刘清，齐江蕾，胡广伟. 基于网络突发公共卫生事件早期谣言识别研究——以新冠疫情谣言为例. 情报科学. 2023(04): 156-163 . 百度学术
7.	吴越，温欣，袁雪. ParallelGAT:网络谣言检测方法. 情报杂志. 2023(05): 94-101+93 . 百度学术
8.	曹健，陈怡梅，李海生，蔡强. 基于图神经网络的行人轨迹预测研究综述. 计算机工程与科学. 2023(06): 1040-1053 . 百度学术
9.	王友卫，凤丽洲，王炜琦，侯玉栋. 基于事件-词语-特征异质图的微博谣言检测新方法. 中文信息学报. 2023(09): 161-174 . 百度学术
10.	王莉. 网络虚假信息检测技术研究与展望. 太原理工大学学报. 2022(03): 397-404 . 百度学术
11.	王友卫，童爽，凤丽洲，朱建明，李洋，陈福. 基于图卷积网络的归纳式微博谣言检测新方法. 浙江大学学报(工学版). 2022(05): 956-966 . 百度学术