Fine-grained image recognition is characterized by large intra-class variation and small inter-class variation, with wide applications in intelligent retail, biodiversity protection, and intelligent transportation. Extracting discriminative multi-granularity features is the key to improve the accuracy of fine-grained image recognition. Most of existing methods only perform knowledge acquisition at a single level, ignoring the effectiveness of multi-level information interaction for extracting robust features. The other work introduces attention mechanisms to locate discriminative local regions to extract discriminative features, but this inevitably increases the network complexity. To address these issues, a MKSMT (multi-level knowledge self-distillation with multi-step training) model for fine-grained image recognition is proposed. The model first learns features in the shallow network, then performs feature learning in the deep network, and uses self-distillation to transfer knowledge from the deep network to the shallow network. The optimized shallow network can help the deep network extract more robust features, thus improving the performance of the whole model. Experimental results show that MKSMT achieves classification accuracy of 92.8%, 92.6%, and 91.1% on three publicly available fine-grained image datasets, respectively, outperforming most state-of-the-art fine-grained recognition algorithms.