Abstract:
Multimodal emotion recognition in conversations (MERC) has emerged as a prominent research focus in the domain of human-computer intelligent interaction. It demonstrates extensive applicability across diverse scenarios, including affective dialogue systems and conversational recommendation systems, thereby substantially improving user experience through enhanced emotional engagement. Identifying abstract emotional semantics in multimodal conversation scenarios poses a significant challenge. Current research predominantly relies on contrastive learning frameworks to extract discriminative feature. Although these approaches effectively enforce intra-class feature consistency, they inherently constrain fine-grained feature diversity. This limitation consequently impairs the model’s generalization capacity, particularly when processing samples from minority classes. In addition, current dialogue context graph learning only capture the dependence with a fixed window, neglecting the dynamic cross-correlation of the conversation flow, which can lead to either redundancy or insufficient of extracted context information. To address the aforementioned limitations, the adversarial soft-contrast modulated dynamic graph learning is proposed for MERC. Specifically, a dynamic discourse graph is firstly constructed according to the number of utterances emitted by the speaker in the conversation, which allows for dynamically modeling the conversational dependency range of the utterance in each modality to extract rich contextual information more precisely. Secondly, the adversarial soft-contrast training mechanism is designed to enhance the discrimination and robustness of the network learning. Among them, the class sample space is expanded by adding perturbations to the hidden layers of different modal feature extractors to generate adversarial samples, and the soft-contrast learning is exploited to maximize label semantic consistency of the original samples and the adversarial samples. Finally, a bi-stream graph learning strategy for different modalities is constructed, which not only explores the cross-modal consistent feature interaction guided by text semantics, but also learns the context dynamic dependencies within multimodal dialogue flow to collaboratively facilitates the complementary fusion of multimodal conversations data. Extensive experiments conducted on the IEMOCAP and MELD benchmark datasets have demonstrated that our proposed model achieves competitive results in terms of effectiveness for MERC, compared to current state-of-the-art methods.