A Cross-Modal Entity Linking Model Based on Contrastive Learning

Wang Yuanzheng; Sun Wenxiang; Fan Yixing; Liao Huaming; Guo Jiafeng

doi:10.7544/issn1000-1239.202330731

Journal of Computer Research and Development > 2025 > 62(3): 662-671. > DOI: 10.7544/issn1000-1239.202330731 CSTR: 32373.14.issn1000-1239.202330731

Wang Yuanzheng, Sun Wenxiang, Fan Yixing, Liao Huaming, Guo Jiafeng. A Cross-Modal Entity Linking Model Based on Contrastive Learning[J]. Journal of Computer Research and Development, 2025, 62(3): 662-671. DOI: 10.7544/issn1000-1239.202330731

Citation:

PDF (1708 KB)

A Cross-Modal Entity Linking Model Based on Contrastive Learning

CAS Key Laboratory of Network Data Science & Technology (Institute of Computing Technology, Chinese Academy of Sciences), Beijing 100190
University of Chinese Academy of Science, Beijing 100049

Funds: This work was supported by the National Natural Science Foundation of China (62372431), the National Key Research and Development Program of China (2021QY1701, 2023YFA1011602), the Youth Innovation Promotion Association Member Project of Chinese Academy of Sciences (2021100), the Innovation Project of Institute of Computing Technology of Chinese Academy of Sciences (E261090), and the National Defense Science and Technology Innovation Zone Foundation of China.

More Information

Author Bio:
Wang Yuanzheng: born in 1997. PhD candidate. His main research interests include information retrieval and natural language processing

Sun Wenxiang: born in 1998. Master. His main research interests include information retrieval and natural language processing

Fan Yixing: born in 1990. Associate professor. His main research interests include information retrieval and natural language understanding

Liao Huaming: born in 1972. Associate professor. Her main research interests include big data application, information retrieval, and distributed data processing

Guo Jiafeng: born in 1980. Professor. His main research interests include representation learning, and neural models for information retrieval and filtering
Received Date: September 11, 2023
Revised Date: July 01, 2024
Accepted Date: August 08, 2024
Available Online: August 15, 2024

Graphical Abstract

Abstract

Abstract

Image-text cross-modal entity linking is an extension of traditional named entity linking. The inputs are images containing entities, which are linked to textual entities in the knowledge base. Existing models usually adopt a dual-encoder architecture which encodes entities of visual and textual modality into separate vectors, then calculates their similarities using dot product, and links the image entities to the most similar text entities. The training process usually adopts the cross-modal contrastive learning task. For a given modality of entity vectors, this task pulls closer the vector of another modality that corresponds to itself, and pushes away the vector of another modality corresponding to other entities. However, this approach overlooks the differences in representation difficulty within the two modalities: visually similar entities are often more difficult to distinguish than textual similar entities, resulting in the incorrect linking of the former ones. To solve this problem, we propose two new contrastive learning tasks, which can enhance the discriminative power of the vectors. The first is self-contrastive learning, which aims to improve the distinction between visual vectors. The second is hard-negative contrastive learning, which helps a textual vectors to distinguish similar visual vectors. We conduct experiments on the open-source dataset WikiPerson. With a knowledge base of 120000 entities, our model achieves an accuracy improvement of 4.5% compared with the previous state-of-the-art model.
- entity linking model,
- multi-modal,
- cross-modal,
- contrastive learning,
- visual information

FullText(HTML)

References (45)

References

[1]	Sun Wen, Fan Yixing, Guo Jiafeng, et al. Visual named entity linking: A new dataset and a baseline [C] // Proc of the Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 2403−2415
[2]	Liu Fuxiao, Wang Yinghan, Wang Tianlu, et al. Visual news: Benchmark and challenges in news image captioning [C] // Proc of the Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2021: 6761−6771
[3]	Abdelnabi S, Hasan R, Fritz M. Open-domain, content-based, multi-modal fact-checking of out-of-context images via online resources [C] // Proc of the IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2022: 14920−14929
[4]	Bouquet P, Stoermer H, Giacomuzzi D, et al. OKKAM: Enabling a Web of entities [C/OL] // Proc of the WWW Workshop I3: Identity, Identifiers, Identification, Entity-Centric Approaches to Information and Knowledge Management on the Web. New York: ACM, 2007 [2024-06-24]. https://ceur-ws.org/Vol-249/submission_150.pdf
[5]	Milne D, Witten I. Learning to link with Wikipedia [C] // Proc of the 17th ACM Conf on Information and Knowledge Management. New York: ACM, 2008: 509−518
[6]	Moro A, Raganato A, Navigli R. Entity linking meets word sense disambiguation: A unified approach[J]. Transactions of the Association for Computational Linguistics, 2014, 2: 231−244 doi: 10.1162/tacl_a_00179
[7]	Ferragina P, Scaiella U. Tagme: On the-fly annotation of short text fragments (by Wikipedia entities) [C] // Proc of the 19th ACM Int Conf on Information and Knowledge Management. New York: ACM, 2010: 1625−1628
[8]	Devlin J, Chang Mingwei, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [C] // Proc of the Annual Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2019: 4171−4186
[9]	Yamada I, Washio K, Shindo H, et al. Global entity disambiguation with BERT [C] // Proc of the Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2022: 3264−3271
[10]	Cao N, Izacard G, Riedel S, et al. Autoregressive entity retrieval [C/OL] // Proc of the 9th Int Conf on Learning Representations. OpenReview. net, 2021 [2024-06-24]. https://openreview.net/forum?id=5k8F6UU39V
[11]	Logeswaran L, Chang Mingwei, Lee K, et al. Zero-shot entity linking by reading entity descriptions [C] // Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 3449−3460
[12]	Wu L, Petroni F, Josifoski M, et al. Scalable zero-shot entity linking with dense entity retrieval [C] // Proc of the Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2020: 6397−6407
[13]	Moon S, Neves L, Carvalho V. Multimodal named entity disambiguation for noisy social media posts [C] // Proc of the 56th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2018: 2000−2008
[14]	Zhou Xingchen, Wang Peng, Li Guozheng, et al. Weibo-mel, wikidata-mel and richpedia-mel: Multimodal entity linking benchmark datasets [C] // Proc of the 6th Knowledge Graph and Semantic Computing: Knowledge Graph Empowers New Infrastructure Construction. Berlin: Springer, 2021: 315−320
[15]	Wang Sijia, Li A, Zhu Henghui, et al. Benchmarking diverse-modal entity linking with generative models [C] // Proc of the 61st Association for Computational Linguistics. Stroudsburg, PA: ACL, 2023: 7841−7857
[16]	Zhang Gongrui, Jiang Chenghuan, Guan Zhongheng, et al. Multimodal entity linking with mixed fusion mechanism [C] // Proc of the 28th Int Conf on Database Systems for Advanced Applications. Berlin: Springer, 2023: 607−622
[17]	Wang Xuwu, Tian Junfeng, Gui Min, et al. WikiDiverse: A multimodal entity linking dataset with diversified contextual topics and entity types [C] // Proc of the 60th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL 2022: 4785−4797
[18]	Zhang Li, Li Zhixu, Yang Qiang. Attention-based multimodal entity linking with high-quality images [C] // Proc of the 26th Int Conf on Database Systems for Advanced Applications. Berlin: Springer, 2021: 533−548
[19]	Yang Chengmei, He Bowei, Wu Yimeng, et al. MMEL: A joint learning framework for multi-mention entity linking [C] // Proc of the 39th Conf on Uncertainty in Artificial Intelligence. New York: PMLR, 2023: 2411−2421
[20]	Wang Peng, Wu Jiangheng, Chen Xiaohang. Multimodal entity linking with gated hierarchical fusion and contrastive training [C] // Proc of the 45th Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2022: 938−948
[21]	Adjali O, Besançon R, Ferret O, et al. Building a multimodal entity linking dataset from tweets [C] // Proc of the 12th Language Resources and Evaluation Conf. Paris: European Language Resources Association, 2020: 4285−4292
[22]	Adjali O, Besançon R, Ferret O, et al. Multimodal entity linking for tweets [C] // Proc of the 42nd European Conf on Information Retrieval. Berlin: Springer, 2020: 463−478
[23]	Gan Jingru, Luo Jinchang, Wang Haiwei, et al. Multimodal entity linking: A new dataset and a baseline [C] // Proc of the 29th ACM Int Conf on Multimedia. New York: ACM, 2021: 993−1001
[24]	Yang Zhishen, Okazaki N. Image caption generation for news articles [C] // Proc of the 28th Int Conf on Computational Linguistics. Stroudsburg, PA: ACL, 2020: 1941−1951
[25]	Hu Anwen, Chen Shizhe, Jin Qin. Icecap: Information concentrated entity-aware image captioning [C] // Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 4217−4225
[26]	Fukui A, Park D, Yang D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding [C] // Proc of the Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2016: 457−468
[27]	Kim J, On K, Lim W, et al. Hadamard product for low-rank bilinear pooling [C/OL]// Proc of the 4th Int Conf on Learning Representations. OpenReview. net, 2016 [2024-06-24]. https://openreview.net/forum?id=r1rhWnZkg
[28]	Younes H, Cadene R, Cord M, et al. Mutan: Multimodal tucker fusion for visual question answering [C] // Proc of the IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 2631−2639
[29]	Younes H, Cadene R, Thome N, et al. Block: Bilinear superdiagonal fusion for visual question answering and visual relationship detection [C] // Proc of the 33rd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2019: 8102−8109
[30]	Szegedy C, Ioffe S, Vanhoucke V, et al. Inceptionv4, inception-ResNet and the impact of residual connections on learning [C] // Proc of the 31st AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2017: 4278−4284
[31]	Hochreiter S, Schmidhuber J. Long short-term memory[J]. Neural Computation, 1997, 9(8): 1735−1780 doi: 10.1162/neco.1997.9.8.1735
[32]	Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [C] // Proc of the 31st Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 6000–6010
[33]	Dosovitskiy A, Beyer L, Kolesnikov A, et al. An image is worth 16×16 words: Transformers for image recognition at scale [C/OL] // Proc of the 9th Int Conf on Learning Representations. OpenReview. net, 2021 [2024-06-24]. https://openreview.net/forum?id=YicbFdNTTy
[34]	Radford A, Kim J, Hallacy C, et al. Learning transferable visual models from natural language supervision [C] // Proc of the 38th Int Conf on Machine Learning. New York: PMLR, 2021: 8748−8763
[35]	Liu Yinhan, Ott M, Goyal N, et al. RoBERTa: A robustly optimized bert pretraining approach [C/OL] // Proc of the 7th Int Conf on Learning Representations. OpenReview. net, 2019 [2024-06-24]. https://openreview.net/forum?id=SyxS0T4tvS
[36]	Wang Feng, Liu Huaping. Understanding the behaviour of contrastive loss [C] // Proc of the IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 2495−2504
[37]	Chen Ting, Kornblith S, Norouzi M, et al. A simple framework for contrastive learning of visual representations [C] // Proc of the 37th Int Conf on Machine Learning. New York: PMLR, 2020: 1597−1607
[38]	He Kaiming, Fan Haoqi, Wu Yuxin, et al. Momentum contrast for unsupervised visual representation learning [C] // Proc of the IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 9729−9738
[39]	Cao Qiong, Shen Li, Xie Weidi, et al. VGGFace2: A dataset for recognising faces across pose and age [C] // Proc of the 13th IEEE Int Conf on Automatic Face & Gesture Recognition. Piscataway, NJ: IEEE, 2018: 67−74
[40]	He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition [C] // Proc of the IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 770−778
[41]	Loshchilov I, Hutter F. Decoupled weight decay regularization [C/OL] // Proc of the 7th Int Conf on Learning Representations. OpenReview. net, 2019 [2024-06-24]. https://openreview.net/forum?id=Bkg6RiCqY7
[42]	Jeff J, Matthijs D, Herve J. Billion-scale similarity search with GPUs[J]. IEEE Transactions on Big Data, 2019, 7(3): 535−547
[43]	Robinson J, Chuang C, Sra S, et al. Contrastive learning with hard negative samples [C/OL] // Proc of the 9th Int Conf on Learning Representations. OpenReview. net, 2021 [2024-06-24]. https://openreview.net/forum?id=CR1XOQ0UTh-
[44]	Kalantidis Y, Sariyildiz M, Pion N, et al. Hard negative mixing for contrastive learning [C] // Proc of the 34th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2020: 21798−21809
[45]	Zhou Bolei, Khosla A, Lapedriza A, et al. Learning deep features for discriminative localization [C] // Proc of the IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 2921−2929

[1]	Wu Tianxing, Cao Xudong, Bi Sheng, Chen Ya, Cai Pingqiang, Sha Hangyu, Qi Guilin, Wang Haofen. Constructing Health Management Information System for Major Chronic Diseases Based on Large Language Model[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440570
[2]	Zhao Yun, Liu Dexi, Wan Changxuan, Liu Xiping, Liao Guoqiong. Mental Health Text Matching Model Integrating Characters’ Mental Portrait[J]. Journal of Computer Research and Development, 2024, 61(7): 1812-1824. DOI: 10.7544/issn1000-1239.202220987
[3]	Fu Tao, Chen Zhaojiong, Ye Dongyi. GAN-Based Bidirectional Decoding Feature Fusion Extrapolation Algorithm of Chinese Landscape Painting[J]. Journal of Computer Research and Development, 2022, 59(12): 2816-2830. DOI: 10.7544/issn1000-1239.20210830
[4]	Gan Xinbiao, Tan Wen, Liu Jie. Bidirectional-Bitmap Based CSR for Reducing Large-Scale Graph Space[J]. Journal of Computer Research and Development, 2021, 58(3): 458-466. DOI: 10.7544/issn1000-1239.2021.20200090
[5]	Zhou Donghao, Han Wenbao, Wang Yongjun. A Fine-Grained Information Diffusion Model Based on Node Attributes and Content Features[J]. Journal of Computer Research and Development, 2015, 52(1): 156-166. DOI: 10.7544/issn1000-1239.2015.20130915
[6]	Li Yaxiong, Zhang Jianqiang, Pan Deng, Hu Dan. A Study of Speech Recognition Based on RNN-RBM Language Model[J]. Journal of Computer Research and Development, 2014, 51(9): 1936-1944. DOI: 10.7544/issn1000-1239.2014.20140211
[7]	Huang He, Sun Yu'e, Chen Zhili, Xu Hongli, Xing Kai, Chen Guoliang. Completely-Competitive-Equilibrium-Based Double Spectrum Auction Mechanism[J]. Journal of Computer Research and Development, 2014, 51(3): 479-490.
[8]	Zhu Feng, Luo Limin, Song Yuqing, Chen Jianmei, Zuo Xin. Adaptive Spatially Neighborhood Information Gaussian Mixture Model for Image Segmentation[J]. Journal of Computer Research and Development, 2011, 48(11): 2000-2007.
[9]	Ma Xiao, Wang Xuan, and Wang Xiaolong. The Information Model for a Class of Imperfect Information Game[J]. Journal of Computer Research and Development, 2010, 47(12).
[10]	Ma Liang, Chen Qunxiu, and Cai Lianhong. An Improved Model for Adaptive Text Information Filtering[J]. Journal of Computer Research and Development, 2005, 42(1): 79-84.