Abstract:
Chiplet integration is becoming a highly scalable solution of customizing deep learning chips for different scenarios, thus many chip designers start to reduce the chip development cost by integrating "known-good" third-party dies, which shows advantages in higher yield, design flexibility, and shorter time-to-market. In conventional chip business model, the dedicated software toolchain such as compiler is provided as part of the chip solution and plays an important role in chip performance and development. However, when it comes to chip solution that assembles multiple third-party dies, the toolchain must face the situation that is unknown to the dedicated compiler of die vendors in advance. In such a situation, how to dispatch tasks to hardware resources and manage the cooperation between the provided interfaces of independent third-party dies becomes a necessity. Moreover, designing a whole-new toolchain for each integrated chip is time-consuming and even deviating the original intention of agile chip customization. In this paper, we propose Puzzle, a scalable compilation and resource management framework for integrated deep learning chips. Puzzle contains a complete framework from profiling the input workload to run-time management of chip resources, and reduces redundant memory access and expensive inter-die communication through efficient and self-adaptive resource allocation and task distribution. Experimental results show that Puzzle achieves an average of 27.5% latency reduction under various chip configurations and workloads compared with state-of-the-art solutions.