基于对比学习的跨模态实体链接模型

王苑铮; 孙文祥; 范意兴; 廖华明; 郭嘉丰

doi:10.7544/issn1000-1239.202330731

基于对比学习的跨模态实体链接模型

A Cross-Modal Entity Linking Model Based on Contrastive Learning

摘要

摘要: 图文跨模态实体链接是对传统实体链接任务的扩展，其输入为包含实体的图像，目标是将其链接到文本模态的知识库实体上. 现有模型通常采用双编码器架构，将图像、文本模态的实体分别编码为向量，利用点乘计算两者的相似度，从而链接到与图像实体相似度最高的文本实体. 其训练过程通常采用基于InfoNCE损失的对比学习任务，即提高一个实体某模态与自身另一模态的向量相似度，降低与其他实体另一模态的向量相似度. 然而此模型忽视了图文2个模态内部表示难度的差异：图像模态中的相似实体，通常比文本模态中的相似实体更难以区分，导致外观相似的图像实体很容易链接错误. 因此，提出2个新的对比学习任务来提升向量的判别能力. 一个是自对比学习，用于提升图像向量之间的区分度；另一个是难负例对比学习，让文本向量更容易区分几个相似的图像向量. 在开源数据集WikiPerson上进行实验，在12万规模的实体库上，相比于采用InfoNCE损失的最佳基线模型，模型正确率提升了4.5个百分点.

Abstract: Image-text cross-modal entity linking is an extension of traditional named entity linking. The inputs are images containing entities, which are linked to textual entities in the knowledge base. Existing models usually adopt a dual-encoder architecture which encodes entities of visual and textual modality into separate vectors, then calculates their similarities using dot product, and links the image entities to the most similar text entities. The training process usually adopts the cross-modal contrastive learning task. For a given modality of entity vectors, this task pulls closer the vector of another modality that corresponds to itself, and pushes away the vector of another modality corresponding to other entities. However, this approach overlooks the differences in representation difficulty within the two modalities: visually similar entities are often more difficult to distinguish than textual similar entities, resulting in the incorrect linking of the former ones. To solve this problem, we propose two new contrastive learning tasks, which can enhance the discriminative power of the vectors. The first is self-contrastive learning, which aims to improve the distinction between visual vectors. The second is hard-negative contrastive learning, which helps a textual vectors to distinguish similar visual vectors. We conduct experiments on the open-source dataset WikiPerson. With a knowledge base of 120000 entities, our model achieves an accuracy improvement of 4.5% compared with the previous state-of-the-art model.

HTML全文

参考文献(45)

施引文献

资源附件(0)