高级检索
    李钦, 朱延超, 刘轶, 钱德沛. 基于YARN集群的计算加速部件扩展支持[J]. 计算机研究与发展, 2016, 53(6): 1263-1270. DOI: 10.7544/issn1000-1239.2016.20148351
    引用本文: 李钦, 朱延超, 刘轶, 钱德沛. 基于YARN集群的计算加速部件扩展支持[J]. 计算机研究与发展, 2016, 53(6): 1263-1270. DOI: 10.7544/issn1000-1239.2016.20148351
    Li Qin, Zhu Yanchao, Liu Yi, Qian Depei. Accelerator Support in YARN Cluster[J]. Journal of Computer Research and Development, 2016, 53(6): 1263-1270. DOI: 10.7544/issn1000-1239.2016.20148351
    Citation: Li Qin, Zhu Yanchao, Liu Yi, Qian Depei. Accelerator Support in YARN Cluster[J]. Journal of Computer Research and Development, 2016, 53(6): 1263-1270. DOI: 10.7544/issn1000-1239.2016.20148351

    基于YARN集群的计算加速部件扩展支持

    Accelerator Support in YARN Cluster

    • 摘要: 以GPU和Intel MIC为代表的计算加速部件已在科学计算、图形图像处理等领域得到了广泛的应用,其在基于云平台的高性能计算及大数据处理等方向也具有广泛的应用前景.YARN是新一代Hadoop分布式计算框架,其对计算资源的分配调度主要针对CPU,缺少对计算加速部件的支持.在YARN中添加计算加速部件需要解决多个难点,分别是计算加速部件资源如何调度以及异构节点间如何共享问题、多个任务同时调用计算加速部件而引起的资源争用问题和集群中对计算加速部件的状态监控与管理问题.为了解决这些问题,提出了动态节点捆绑策略、流水线式的计算加速部件任务调度等,实现了YARN对计算加速部件的支持,并通过实验验证了其有效性.

       

      Abstract: Accelerators, such as GPU and Intel MIC, are widely used in scientific computing and image processing, and have strong potentials in big data processing and HPC based on cloud platform. YARN is a new generation of Hadoop distributed computing framework. Its adoption of computing resources is only limited to CPU, lacking of support for accelerators. This paper adds the support to nodes with accelerators to YARN to solve the problem. By analyzing the problem of supporting heterogeneous node, there are three identified difficulties which should be solved to introduce hybrid/heterogeneous to YARN. The first one is how to manage and schedule the added accelerator resources in the cluster; the second one is how to collect the status of accelerators to the master node for management; the third one is how to address the contention issue among multiple accelerator tasks concurrently running on the same node. In order to solve the above problems, the following design tasks have been carried out. Resource encapsulation which bundles neighbor nodes into one resource encapsulation is designed to solve the first problem. Management functions which collect the real-time accelerators status from working nodes are designed on the master node to solve the second problem. Accelerator task pipeline which splits accelerator tasks into three parts and executes them in parallel is designed on the nodes with accelerators to solve the third problem. Our scheme is verified with a cluster consisting of 4 nodes with GPU, and the workload testing the cluster includes LU, QR and Cholesky decomposition from the third party benchmark MAGMA, and the program performes feature extraction and clustering upon 50000 images. The results prove the effectiveness of the scheme presented.

       

    /

    返回文章
    返回