• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Wang Yan, Li Nianshuang, Wang Xiling, Zhong Fengyan. Coding-Based Performance Improvement of Distributed Machine Learning in Large-Scale Clusters[J]. Journal of Computer Research and Development, 2020, 57(3): 542-561. DOI: 10.7544/issn1000-1239.2020.20190286
Citation: Wang Yan, Li Nianshuang, Wang Xiling, Zhong Fengyan. Coding-Based Performance Improvement of Distributed Machine Learning in Large-Scale Clusters[J]. Journal of Computer Research and Development, 2020, 57(3): 542-561. DOI: 10.7544/issn1000-1239.2020.20190286

Coding-Based Performance Improvement of Distributed Machine Learning in Large-Scale Clusters

Funds: This work was supported by the National Natural Science Foundation of China (61402172) and the Natural Science Foundation of Jiangxi Province of China (20192BAB217006).
More Information
  • Published Date: February 29, 2020
  • With the growth of models and data sets, running large-scale machine learning algorithms in distributed clusters has become a common method. This method divides the whole machine learning algorithm and training data into several tasks and each task runs on different worker nodes. Then, the results of all tasks are combined by master node to get the results of the whole algorithm. When there are a large number of nodes in distributed cluster, some worker nodes, called straggler, will inevitably slow down than other nodes due to resource competition and other reasons, which makes the task time of running on this node significantly higher than that of other nodes. Compared with running replica task on multiple nodes, coded computing shows an impact of efficient utilization of computation and storage redundancy to alleviate the effect of stragglers and communication bottlenecks in large-scale machine learning cluster.This paper introduces the research progress of solving the straggler issues and improving the performance of large-scale machine learning cluster based on coding technology. Firstly, we introduce the background of coding technology and large-scale machine learning cluster. Secondly, we divide the related research into several categories according to application scenarios: matrix multiplication, gradient computing, data shuffling and some other applications. Finally, we summarize the difficulties of applying coding technology in large-scale machine learning cluster and discuss the future research trends about it.

Catalog

    Article views PDF downloads Cited by()
    Turn off MathJax
    Article Contents

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return