Abstract:
Multimodal recommendation systems aim to provide more accurate and personalized recommendation services. However, existing research still faces the following issues: 1) Feature distortion: The input embeddings are processed by small pre-trained language models and deep convolutional neural networks, resulting in inaccurate feature representations. 2) Single encoding perspective: The multimodal encoding layers of current models only consider encoding from a single perspective of memory or expansion, leading to information loss. 3) Poor multimodal alignment: Embeddings from different modalities are distributed in different spaces and need to be mapped to the same space for alignment. However, existing methods, which rely on simple behavioral information multiplication, fail to capture the complex relationships between modalities, preventing precise alignment. To address these issues, a novel model called DPRec is proposed. This model considers encoding from both memory and expansion perspectives and introduces hypergraphs for multi-level precise cross-modal alignment. The proposed model has been tested on three real-world datasets, and the experimental results have validated its effectiveness.