Abstract:
In recent years, large language model (LLM) have shown potential in multivariate time series forecasting. However, existing cross-modal methods based on LLM still face the challenges of insufficient modality alignment and limited generalization capability in few-shot learning and long-term forecasting tasks. To address the above issues, this paper proposes an enhanced cross-modal LLM fine-tuning framework, termed Enhanced CALF, which integrates dynamic attention and hierarchical distillation. The framework first constructs a dynamic attention cross-modal matching module, introducing adaptive weight generation and alignment prediction mechanisms. This module can dynamically adjust the allocation of attention weights based on data distribution characteristics and inter-modality correlation strengths, thereby improving the precision of cross-modal feature alignment. Secondly, a multi-level knowledge distillation and contrastive learning module is built. By introducing projection mappings at each Transformer layer and combining hierarchical distillation loss with an adaptive temperature contrastive loss, it achieves hierarchical feature transmission covering aspects from local details to global semantics, aiming to enhance the consistency of cross-modal representations. Finally, an adaptive alignment mechanism is designed, which dynamically adjusts the weights of the total loss function through quantitative evaluation of inter-modality alignment scores, thereby optimizing the model training process. Experiments on seven real-world datasets demonstrate that Enhanced CALF outperforms existing baseline models in long-term forecasting and few-shot learning tasks.