Machine Learning Inference Framework on Multi-Core Processor
-
摘要: 近年来,深度神经网络被广泛应用于各个领域并取得了极大的成功.由于神经网络模型的尺寸和计算量的不断增加,为了能够高效迅速地完成神经网络的计算,包括GPU和专用加速器在内的很多新型硬件处理器被用于深度学习的计算.尽管如此,通用处理器作为目前最为常见和易于获得的计算平台,探究如何高效地在其上运行神经网络算法同样具有重要意义.多核处理器在训练阶段可以采用数据并行的方式来提高数据吞吐量,加快训练速度.然而在推理阶段,相比吞吐量场景,端到端的时延往往更加重要,因为这决定了处理器在某个场景下的可用性.传统的数据并行方案不能满足推理场景下对处理器小数据、低延迟的要求.因此,对于多核的处理器结构,需要在算子内部对计算进行拆分,才能够充分利用多核结构的硬件资源.考虑到处理器的计算特点,需要一种精细的方法来对计算图中的算子进行合理的拆分,才能真正有效地发挥出多核处理器的计算潜能.提出一种基于算子拆分的并行框架,可以用较小的开销实现处理器由单核向多核结构上的扩展,并且能够针对给定的网络和底层处理器特点给出一种高效的拆分方案.实验结果表明:该方法能有效降低各种网络在多核处理器上的端到端时延.Abstract: In recent years, deep neural network has been widely used in many domains and got huge success. Since the size and computation workload for neural network model is increasing rapidly, GPU and many new-designed domain-specific accelerators have been used in order to complete computing neural networks as soon as possible. However, the traditional general-purpose processor should not be ignored. Considering it is common and easy to get, exploring efficient way for using general-purpose processor in deep learning is meaningful. In training phase, the multi-core architecture is suitable for data parallelism which helps to increase system throughput. However, in inference phase, end-to-end latency is much more important than throughput, and traditional data parallelism could not fulfill the requirement of small batch and low latency. In order to utilize hardware resource of multi-core architecture, it is necessary to split the computation task into smaller parts which can be executed on multi-core processor in parallel. Besides, a sophisticated strategy is necessary to make sure the split plan will not affect computing efficiency on each core. In this paper, we propose a parallel framework for the multi-core general-purpose processor. It divides each operation in the neural network into smaller ones and executes them on the multiple cores in parallel. By offering some necessary assistant operations, this framework can be easily transplanted to support potential multi-core processors. Also, the framework can automatically generate an effective splitting plan for the given neural networks. The plan is designed with enough consideration of both network architecture and low-level hardware. The experimental results show that this framework can give an efficient splitting plan which substantially reduces the end-to-end latency of inference task on multi-core processor.
-
-
期刊类型引用(6)
1. 郭晓龙,牛晋宇,杜永萍. 基于树莓派的高效卷积优化方法. 计算机技术与发展. 2023(05): 96-104 . 百度学术
2. 辛明勇,祝健杨,徐长宝,姚浩,刘德宏. 基于循环神经网络的多核处理器层次化存储技术. 电子设计工程. 2023(22): 121-124+129 . 百度学术
3. 王利伟,玄志武,徐洪洲,刘学. Windows环境下遥测数据并行拼接处理方法研究. 电子设计工程. 2021(02): 10-15 . 百度学术
4. 孟慧玲,王耀彬,李凌,杨洋,王欣夷,刘志勤. TACLeBench中内核程序循环级推测并行性分析. 计算机应用. 2021(09): 2652-2657 . 百度学术
5. 于海心,王晶,李晓锋. 基于改进RMS算法的多核嵌入式系统总线周期调度表优化设计. 火炮发射与控制学报. 2021(03): 71-75 . 百度学术
6. 丁艳,张海文,孙永彦. 基于多网格技术的电网工程造价数据信息分析方法研究. 电子设计工程. 2021(19): 35-39 . 百度学术
其他类型引用(8)
计量
- 文章访问数: 1340
- HTML全文浏览量: 9
- PDF下载量: 938
- 被引次数: 14