• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Zhu Hongrui, Yuan Guojun, Yao Chengji, Tan Guangming, Wang Zhan, Hu Zhongzhe, Zhang Xiaoyang, An Xuejun. Survey on Network of Distributed Deep Learning Training[J]. Journal of Computer Research and Development, 2021, 58(1): 98-115. DOI: 10.7544/issn1000-1239.2021.20190881
Citation: Zhu Hongrui, Yuan Guojun, Yao Chengji, Tan Guangming, Wang Zhan, Hu Zhongzhe, Zhang Xiaoyang, An Xuejun. Survey on Network of Distributed Deep Learning Training[J]. Journal of Computer Research and Development, 2021, 58(1): 98-115. DOI: 10.7544/issn1000-1239.2021.20190881

Survey on Network of Distributed Deep Learning Training

Funds: This work was supported by the CAS Strategic Priority Program(B) (XDB24050200), the General Program of the National Natural Science Foundation of China (61972380, 61702484), and the Innovation Fund from the Institute of Computing Technology, Chinese Academy of Sciences (20166060).
More Information
  • Published Date: December 31, 2020
  • In recent years, deep learning has achieved better results than traditional algorithms in many fields such as image, speech, and natural language processing. People are increasingly demanding training speed and data processing capabilities for deep learning. However, the calculating ability of a single server has a limit and cannot achieve human demands. Distributed deep learning training has become the most effective method to expand deep learning training computing ability. At present, distributed deep learning faces a training bottleneck due to communication problems in the network during the training process which leads the communication network to be the most influential factor. There are currently many network performance optimization researches for distributed deep learning. In this paper, the main performance bottlenecks and optimization schemes are firstly demonstrated. Then the current state-of-art ultra-large-scale distributed training architecture and methods for optimization performance are specifically analyzed. Finally, a comparative summary of each performance optimization scheme and the difficulties still existing in distributed deep learning training are given, and the future research directions are pointed out as well.
  • Cited by

    Periodical cited type(18)

    1. 罗宇哲,李玲,侯朋朋,于佳耕,程丽敏,张常有,武延军,赵琛. 面向AIoT的协同智能综述. 计算机研究与发展. 2025(01): 179-206 . 本站查看
    2. 程钰. 拓扑约束下基于双过滤机制的拜占庭容错分布式学习分析. 集成电路应用. 2025(02): 102-103 .
    3. 王恩东,闫瑞栋,郭振华,赵雅倩. 分布式训练系统及其优化算法综述. 计算机学报. 2024(01): 1-28 .
    4. 黎恺嘉,贺晋,曹佳宝,张栋威,刘浩. 基于Seq-GRU的建筑能耗预测方法研究. 物联网技术. 2024(04): 55-60 .
    5. 胡涛,王中杰,张连明,陈晓锁. 基于深度学习的非结构化大数据密度聚类仿真. 计算机仿真. 2024(05): 501-505 .
    6. 巨涛,康贺廷,刘帅,火久元. 深度神经网络动态分层梯度稀疏化及梯度合并优化方法. 西安交通大学学报. 2024(09): 105-116 .
    7. 巨涛,刘帅,王志强,李林娟. 深度神经网络模型任务切分及并行优化方法. 北京航空航天大学学报. 2024(09): 2739-2752 .
    8. 房鑫,陈兵旗,彭书博,张雄楚,李永正. 基于改进YOLOv4的前方车辆检测方法. 传感器与微系统. 2024(10): 155-159 .
    9. 唐春娜. 深度学习在主机分布式集群负载均衡中的技术应用. 信息与电脑(理论版). 2024(17): 59-61 .
    10. 巨涛,刘帅,火久元,张学军. 深度神经网络模型并行自适应计算任务调度方法. 吉林大学学报(工学版). 2024(12): 3601-3613 .
    11. 巨涛,赵宇阳,刘帅,杨阳,杨文杰. 面向图片识别的深度学习模型并行优化方法. 西安交通大学学报. 2023(01): 141-151 .
    12. 王睿,王岩,尹朴,齐建鹏,孙叶桃,李倩,张易达,张梅奎. 面向边缘智能的协同训练研究进展. 工程科学学报. 2023(08): 1400-1416 .
    13. 韩忠华,黎恺嘉,周晓锋,王继娜,孙亮亮. 基于深度学习的柔性流水车间排产优化问题研究. 智能系统学报. 2023(03): 468-478 .
    14. 任刚,李鑫,刘小杰,张阳,郜广兰,肖东栩. 基于Spark大数据计算模型的遗传算法深度前馈神经网络训练算法. 河南工学院学报. 2023(05): 14-22 .
    15. 马翔,申国伟,郭春,崔允贺,陈意. 面向异构分布式机器学习的动态自适应并行加速方法. 智能系统学报. 2023(05): 1099-1107 .
    16. 彭琨,丁小波,蔡茂贞,钟地秀,黎蕴玉. 分布式图像解析系统的设计与研究. 现代计算机. 2022(11): 31-34+40 .
    17. 李新春,詹德川. 使用多分类器的分布式模型重用技术. 计算机科学与探索. 2022(10): 2310-2319 .
    18. 钟运琴,朱月琴,焦守涛. 边缘大数据分析预测建模方法研究. 高技术通讯. 2022(10): 1067-1075 .

    Other cited types(46)

Catalog

    Article views (2259) PDF downloads (1807) Cited by(64)
    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return