Abstract:
Image-text cross-modal entity linking is an extension of traditional named entity linking. The inputs are images containing entities, which are linked to textual entities in the knowledge base. Existing models usually adopt a dual-encoder architecture. It encodes entities of visual and textual modality into separate vectors, then calculates their similarities using dot product, and links the image entities to the most similar text entities. The training process usually adopts the cross-modal contrastive learning task. For a given modality of entity vectors, this task pulls closer the vector of another modality that corresponds to itself, and pushes away the vector of another modality corresponding to other entities. However, this approach overlooks the differences in representation difficulty within the two modalities: visually similar entities are often more difficult to distinguish than textual similar entities, resulting in the incorrect linking of the former ones. To solve this problem, we propose two new contrastive learning tasks, which can enhance the discriminative power of the vectors. The first is self-contrastive learning, which aims to improve the distinction between visual vectors. The second is hard-negative contrastive learning, which helps a textual vectors to distinguish similar visual vectors. We conduct experiments on the open-source dataset WikiPerson. With a knowledge base of 120k entities, our model achieves an accuracy improvement of 4.5% compared to the previous state-of-the-art model.