Multimodal Emotion Recognition Method Based on Mixture of Fusion Experts
-
Graphical Abstract
-
Abstract
Multimodal emotion recognition aims to integrate data from multiple modalities for accurate inference of emotional states. Existing research either treats all modalities equally or adopts fixed fusion strategies based on a single modality, failing to adequately address the imbalance of modality contributions. To tackle this, we propose a dynamic multimodal fusion method based on the Mixture of Experts (MoE) framework, incorporating an adaptive router module and a multimodal fusion expert module. This method dynamically evaluates the contribution of each modality and selects appropriate fusion strategies accordingly. The adaptive router mechanism dynamically assesses modality contributions through inter-modal correlation analysis and dynamic weighting to guide the selection of fusion experts, while an expert-guided loss function is integrated to further optimize the expert selection process. The multimodal fusion experts perform complementary fusion for different modality contribution combinations and introduce a shared expert to mitigate the loss of global information and reduce parameter redundancy. Comparative experiments on three benchmark datasets (MER2024, CMU-MOSEI, and CH-SIMS) for multimodal emotion recognition demonstrate that the proposed MoMFE method outperforms state-of-the-art (SOTA) multimodal fusion methods in core metrics, including binary emotion recognition accuracy (Acc-2) and F1 score. Notably, the method achieves an average improvement of approximately 2% on the CH-SIMS dataset.
-
-