对抗软对比调制动态图学习的多模态会话情绪识别方法

王顺杰; 蔡国永; 吕光瑞

doi:10.7544/issn1000-1239.202440896

对抗软对比调制动态图学习的多模态会话情绪识别方法

Adversarial Soft-Contrast Modulated Dynamic Graph Learning for Multimodal Emotion Recognition in Conversations

摘要

摘要: 多模态会话情绪识别（multimodal emotion recognition in conversations，MERC）已成为人机智能交互的研究热点，目前被广泛应用于情感对话机器人及对话推荐等多个场景. 在多模态会话场景中识别抽象的情感语义是困难的，大多现有研究利用对比学习来提取判别特征，其尽管约束了类内特征的一致性，但损害了精细的多样性表征，导致模型泛化性降低，尤其不利于少类样本的学习. 此外，目前的会话上下文学习多建模固定窗口的语境依赖，忽略了会话信息流的动态互相关. 为了解决上述限制，提出对抗软对比调制动态图学习方法. 具体地：首先根据会话场景中说话人的话语数量构建话语图实现在各自模态内动态地建模话语的会话依赖范围，以精准地提取丰富的语境信息；其次，设计对抗性软对比训练机制，通过在不同模态特征提取器的隐藏层中添加扰动生成对抗样本来扩展类样本空间，并使用软对比学习在原始样本和对抗样本上最大化类扩展样本的标签语义一致性以增强模型学习的判别性与鲁棒性；最后，构建针对不同模态的双流图学习策略，以协同地促进各模态会话数据的互补融合. 在IEMOCAP和MELD的多模态会话基准数据集上进行广泛的实验，结果表明提出的方法与目前先进的方法相比，在MERC任务上取得了具有竞争力的效果.

Abstract: Multimodal emotion recognition in conversations (MERC) has emerged as a prominent research focus in the domain of human-computer intelligent interaction. It demonstrates extensive applicability across diverse scenarios, including affective dialogue systems and conversational recommendation systems, thereby substantially improving user experience through enhanced emotional engagement. Identifying abstract emotional semantics in multimodal conversation scenarios poses a significant challenge. Current research predominantly relies on contrastive learning frameworks to extract discriminative feature. Although these approaches effectively enforce intra-class feature consistency, they inherently constrain fine-grained feature diversity. This limitation consequently impairs the model’s generalization capacity, particularly when processing samples from minority classes. In addition, current dialogue context graph learning only capture the dependence with a fixed window, neglecting the dynamic cross-correlation of the conversation flow, which can lead to either redundancy or insufficient of extracted context information. To address the aforementioned limitations, the adversarial soft-contrast modulated dynamic graph learning is proposed for MERC. Specifically, a dynamic discourse graph is firstly constructed according to the number of utterances emitted by the speaker in the conversation, which allows for dynamically modeling the conversational dependency range of the utterance in each modality to extract rich contextual information more precisely. Secondly, the adversarial soft-contrast training mechanism is designed to enhance the discrimination and robustness of the network learning. Among them, the class sample space is expanded by adding perturbations to the hidden layers of different modal feature extractors to generate adversarial samples, and the soft-contrast learning is exploited to maximize label semantic consistency of the original samples and the adversarial samples. Finally, a bi-stream graph learning strategy for different modalities is constructed, which not only explores the cross-modal consistent feature interaction guided by text semantics, but also learns the context dynamic dependencies within multimodal dialogue flow to collaboratively facilitates the complementary fusion of multimodal conversations data. Extensive experiments conducted on the IEMOCAP and MELD benchmark datasets have demonstrated that our proposed model achieves competitive results in terms of effectiveness for MERC, compared to current state-of-the-art methods.

HTML全文

参考文献(50)

施引文献

资源附件(0)