Puzzle: A Scalable Framework for Deep Learning Integrated Chips

Wang Mengdi; Wang Ying; Liu Cheng; Chang Kaiyan; Gao Chengsi; Han Yinhe; Li Huawei; Zhang Lei

doi:10.7544/issn1000-1239.202330059

Wang Mengdi, Wang Ying, Liu Cheng, Chang Kaiyan, Gao Chengsi, Han Yinhe, Li Huawei, Zhang Lei. Puzzle: A Scalable Framework for Deep Learning Integrated ChipsJ. Journal of Computer Research and Development, 2023, 60(6): 1216-1231. DOI: 10.7544/issn1000-1239.202330059

Citation:

Puzzle: A Scalable Framework for Deep Learning Integrated Chips

Graphical Abstract

Abstract

Abstract

Chiplet integration is becoming a highly scalable solution of customizing deep learning chips for different scenarios, thus many chip designers start to reduce the chip development cost by integrating "known-good" third-party dies, which shows advantages in higher yield, design flexibility, and shorter time-to-market. In conventional chip business model, the dedicated software toolchain such as compiler is provided as part of the chip solution and plays an important role in chip performance and development. However, when it comes to chip solution that assembles multiple third-party dies, the toolchain must face the situation that is unknown to the dedicated compiler of die vendors in advance. In such a situation, how to dispatch tasks to hardware resources and manage the cooperation between the provided interfaces of independent third-party dies becomes a necessity. Moreover, designing a whole-new toolchain for each integrated chip is time-consuming and even deviating the original intention of agile chip customization. In this paper, we propose Puzzle, a scalable compilation and resource management framework for integrated deep learning chips. Puzzle contains a complete framework from profiling the input workload to run-time management of chip resources, and reduces redundant memory access and expensive inter-die communication through efficient and self-adaptive resource allocation and task distribution. Experimental results show that Puzzle achieves an average of 27.5% latency reduction under various chip configurations and workloads compared with state-of-the-art solutions.

FullText(HTML)

References (51)

Cited By

Turn off MathJax

Article Contents

Puzzle: A Scalable Framework for Deep Learning Integrated Chips

Abstract

Catalog

Export File

Citation

Format

Content