Abstract:
Multimodal recommendation methods have introduced features such as text and images to enrich item representations, significantly enhancing recommend system performance. In recent years, Large Language Models (LLMs), with their extensive world knowledge and powerful reasoning capabilities, have shown great potential in the field of recommendation. However, effectively integrating LLMs with multimodal information to construct LLM-based multimodal recommender systems still faces several challenges. Firstly, there is a significant semantic gap between the textual input format of LLMs and the multimodal information, which limits the model's ability to fully understand multimodal features. Secondly, there exists an information imbalance between different modalities, where naive feature fusion can lead to performance degradation. To address these challenges, this paper proposes a LLM-based multimodal recommendation algorithm with adaptive modality fusion (AMFRec). The framework introduces multimodal hybrid prompt templates and projection modules, aligning non-textual features with the LLM token space and recommendation objectives through instruction tuning. To mitigate modality imbalance, a multimodal weight dynamic allocation strategy based on gating mechanism is designed, enabling adaptive adjustment of modality weights by combining mixture-of-experts projection and gating network. In addition, the algorithm requires training only the projection module and gating network, without updating the LLM itself, thereby significantly reducing training costs. Extensive experiments on multiple datasets demonstrate the effectiveness of the proposed framework.