Abstract:
Multimodal emotion recognition aims to integrate data from multiple modalities for accurate inference of emotional states. Existing research either treats all modalities equally or adopts fixed fusion strategies based on a single modality, failing to adequately address the imbalance of modality contributions. To tackle this, we propose a dynamic multimodal fusion method MoMFE based on the mixture of experts (MoE) framework, incorporating an adaptive router module and a multimodal fusion expert module. This method dynamically evaluates the contribution of each modality and selects appropriate fusion strategies accordingly. The adaptive router mechanism dynamically assesses modality contributions through inter-modal correlation analysis and dynamic weighting to guide the selection of fusion experts, while an expert-guided loss function is integrated to further optimize the expert selection process. The multimodal fusion experts perform complementary fusion for different modality contribution combinations and introduce a shared expert to mitigate the loss of global information and reduce parameter redundancy. Comparative experiments on three benchmark datasets (MER2024, CMU-MOSEI, and CH-SIMS) for multimodal emotion recognition demonstrate that MoMFE method outperforms state-of-the-art (SOTA) multimodal fusion methods in core metrics, including binary emotion recognition accuracy (Acc-2) and
F1 score. Notably, the method achieves an average improvement of approximately 2% on CH-SIMS dataset.