Abstract:
Multimodal emotion recognition aims to integrate data from multiple modalities for accurate inference of emotional states. Existing research either treats all modalities equally or adopts fixed fusion strategies based on a single modality, failing to adequately address the imbalance of modality contributions. To tackle this, we propose a dynamic multimodal fusion method based on the Mixture of Experts (MoE) framework, incorporating an adaptive router module and a multimodal fusion expert module. This method dynamically evaluates the contribution of each modality and selects appropriate fusion strategies accordingly. The adaptive router mechanism dynamically assesses modality contributions through inter-modal correlation analysis and dynamic weighting to guide the selection of fusion experts, while an expert-guided loss function is integrated to further optimize the expert selection process. The multimodal fusion experts perform complementary fusion for different modality contribution combinations and introduce a shared expert to mitigate the loss of global information and reduce parameter redundancy. Comparative experiments on three benchmark datasets (MER2024, CMU-MOSEI, and CH-SIMS) for multimodal emotion recognition demonstrate that the proposed MoMFE method outperforms state-of-the-art (SOTA) multimodal fusion methods in core metrics, including binary emotion recognition accuracy (Acc-2) and F1 score. Notably, the method achieves an average improvement of approximately 2% on the CH-SIMS dataset.