Puzzle：面向深度学习集成芯片的可扩展框架

王梦迪; 王颖; 刘成; 常开颜; 高成思; 韩银和; 李华伟; 张磊

doi:10.7544/issn1000-1239.202330059

Puzzle：面向深度学习集成芯片的可扩展框架

Puzzle: A Scalable Framework for Deep Learning Integrated Chips

摘要

摘要: 芯粒集成逐渐成为不同场景下敏捷定制深度学习芯片的高可扩展性的解决方案，芯片设计者可以通过集成设计、验证完成的第三方芯粒来降低芯片开发周期和成本，提高芯片设计的灵活性和芯片良率. 在传统的芯片设计和商业模式中，编译器等专用软件工具链是芯片解决方案的组成部分，并在芯片性能和开发中发挥重要作用. 然而，当使用第三方芯粒进行芯片敏捷定制时，第三方芯粒所提供的专用工具链无法预知整个芯片的资源，因此无法解决敏捷定制的深度学习芯片的任务部署问题，而为敏捷定制的芯片设计全新的工具链需要大量的时间成本，失去了芯片敏捷定制的优势. 因此，提出一种面向深度学习集成芯片的可扩展框架（scalable framework for integrated deep learning chips）——Puzzle，它包含从处理任务输入到运行时管理芯片资源的完整流程，并自适应地生成高效的任务调度和资源分配方案，降低冗余访存和芯粒间通信开销. 实验结果表明，该可扩展框架为深度学习集成芯片生成的任务部署方案可自适应于不同的工作负载和硬件资源配置，与现有方法相比平均降低27.5%的工作负载运行延迟.

Abstract: Chiplet integration is becoming a highly scalable solution of customizing deep learning chips for different scenarios, thus many chip designers start to reduce the chip development cost by integrating "known-good" third-party dies, which shows advantages in higher yield, design flexibility, and shorter time-to-market. In conventional chip business model, the dedicated software toolchain such as compiler is provided as part of the chip solution and plays an important role in chip performance and development. However, when it comes to chip solution that assembles multiple third-party dies, the toolchain must face the situation that is unknown to the dedicated compiler of die vendors in advance. In such a situation, how to dispatch tasks to hardware resources and manage the cooperation between the provided interfaces of independent third-party dies becomes a necessity. Moreover, designing a whole-new toolchain for each integrated chip is time-consuming and even deviating the original intention of agile chip customization. In this paper, we propose Puzzle, a scalable compilation and resource management framework for integrated deep learning chips. Puzzle contains a complete framework from profiling the input workload to run-time management of chip resources, and reduces redundant memory access and expensive inter-die communication through efficient and self-adaptive resource allocation and task distribution. Experimental results show that Puzzle achieves an average of 27.5% latency reduction under various chip configurations and workloads compared with state-of-the-art solutions.

HTML全文

参考文献(51)

施引文献

资源附件(0)