基于大语言模型的多模态自适应融合推荐算法

肖光义; 侯旻; 廖宇欣; 武晗; 白晨希; 魏思; 吴乐

doi:10.7544/issn1000-1239.202550836

基于大语言模型的多模态自适应融合推荐算法

Large Language Model-Based Multimodal Recommendation with Adaptive Modality Fusion

摘要

摘要: 多模态推荐方法引入了文本、图像等模态特征丰富物品表示，提升了推荐系统的性能．近年来，大语言模型（Large Language Model，LLM）凭借其强大的世界知识与推理能力，在推荐领域展现出巨大潜力．然而，如何充分结合大语言模型与多模态信息，构建基于大语言模型的多模态推荐系统仍面临诸多挑战．首先，LLM的文本输入与多模态信息之间存在显著语义鸿沟，导致多模态特征无法被模型充分理解；其次，不同模态间存在信息不平衡问题，对多模态信息的简单融合可能会造成推荐性能的下降．为此，提出基于LLM的多模态自适应融合推荐算法，设计多模态混合提示模板与投影模块，通过指令微调实现非文本模态特征与LLM令牌空间及推荐任务对齐．针对模态不平衡问题，提出基于门控机制的多模态权重动态分配策略，结合混合专家投影与门控网络自适应调整各模态特征权重．此外，该算法仅训练投影模块和门控网络，无需更新LLM，显著降低了训练成本．在多个数据集上的实验验证了该框架的有效性．

Abstract: Multimodal recommendation methods have introduced features such as text and images to enrich item representations, significantly enhancing recommend system performance. In recent years, Large Language Models (LLMs), with their extensive world knowledge and powerful reasoning capabilities, have shown great potential in the field of recommendation. However, effectively integrating LLMs with multimodal information to construct LLM-based multimodal recommender systems still faces several challenges. Firstly, there is a significant semantic gap between the textual input format of LLMs and the multimodal information, which limits the model's ability to fully understand multimodal features. Secondly, there exists an information imbalance between different modalities, where naive feature fusion can lead to performance degradation. To address these challenges, this paper proposes a LLM-based multimodal recommendation algorithm with adaptive modality fusion (AMFRec). The framework introduces multimodal hybrid prompt templates and projection modules, aligning non-textual features with the LLM token space and recommendation objectives through instruction tuning. To mitigate modality imbalance, a multimodal weight dynamic allocation strategy based on gating mechanism is designed, enabling adaptive adjustment of modality weights by combining mixture-of-experts projection and gating network. In addition, the algorithm requires training only the projection module and gating network, without updating the LLM itself, thereby significantly reducing training costs. Extensive experiments on multiple datasets demonstrate the effectiveness of the proposed framework.

HTML全文

参考文献(0)

施引文献

资源附件(0)