Abstract:
Knowledge distillation, as a key technique in deep learning, achieves model compression and acceleration by transferring knowledge from a large teacher model to a smaller student model. Under the premise of maintaining performance, this technology significantly reduces the requirements of computational resources and storage, and facilitates the deployment of high-performance models on resource-constrained edge devices. Firstly, this paper provides a systematic review of the recent research in knowledge distillation and categorizes it from two perspectives: the type of knowledge and teacher-student model architectures. We comprehensively summarize the distillation methods based on three typical types of knowledge: output feature knowledge, intermediate feature knowledge, and relational feature knowledge, as well as distillation methods based on CNN to CNN architecture, CNN to ViT (vision Transformer) architecture, ViT to CNN architecture, and ViT to ViT architecture. Next, the paper explores various learning paradigms such as offline distillation, online distillation, self-distillation, data-free distillation, multi-teacher distillation, and assistant distillation. Then, the paper summarizes distillation optimization methods based on the distillation process, knowledge structure, temperature coefficient, and loss functions. It analyzes improvements in distillation brought by adversarial techniques, automated machine learning, reinforcement learning, and diffusion models, and concludes with the implementation of distillation technology in common applications. Despite significant advancements in knowledge distillation, numerous challenges remain in both practical applications and theoretical research. Finally, the paper provides an in-depth analysis of these issues and offers insights into future development directions.