Relationship Aggregation Network for Referring Expression Comprehension

Guo Wenya; Zhang Ying; Liu Shengzhe; Yang Jufeng; Yuan Xiaojie

doi:10.7544/issn1000-1239.202220019

Journal of Computer Research and Development > 2023 > 60(11): 2611-2623. > DOI: 10.7544/issn1000-1239.202220019 CSTR: 32373.14.issn1000-1239.202220019

Guo Wenya, Zhang Ying, Liu Shengzhe, Yang Jufeng, Yuan Xiaojie. Relationship Aggregation Network for Referring Expression Comprehension[J]. Journal of Computer Research and Development, 2023, 60(11): 2611-2623. DOI: 10.7544/issn1000-1239.202220019

Citation:

PDF (2571 KB)

Relationship Aggregation Network for Referring Expression Comprehension

College of Computer Science, Nankai University, Tianjin 300350

Funds: This work was supported by the National Natural Science Foundation of China-Joint Fund (U1903128).

More Information

Author Bio:
Guo Wenya: born in 1994. PhD. Her main research interests include multimodal data processing and sentiment analysis

Zhang Ying: born in 1986. PhD, professor, PhD supervisor. Her main research interests include natural language processing, sentiment analysis, and multimodal data analysis

Liu Shengzhe: born in 1998. Master. His main research interest includes weakly supervised visual grounding

Yang Jufeng: born in 1980. PhD, professor, PhD supervisor. His main research interests include visual sentiment analysis, fine-grained classification, medical image recognition, and image retrieval

Yuan Xiaojie: born in 1963. PhD, professor, PhD supervisor. Her main research interests include big data analysis, data mining, and database technology
Received Date: January 03, 2022
Revised Date: January 08, 2023
Available Online: June 25, 2023

Graphical Abstract

Abstract

Abstract

In this paper, we focus on the task of referring expression comprehension (REC), which aims to locate the corresponding regions in images referred by expressions. One of the main challenges is to visually ground the object relationships described by the input expressions. The existing mainstream methods mainly score objects based on their visual attributes and the relationships with other objects, and the object with the highest score is predicted as the referred region. However, these methods tend to only consider the relationships between the current evaluated region and its surroundings, but ignore the informative interactions among the multiple surrounding regions, which are important for matching the input expressions and visual content in image. To address this issue, we propose a relationship aggregation network (RAN) to construct comprehensive relationships and then aggregate them to predict the referred region. Specifically, we construct both the two kinds of aforementioned relationships based on graph attention networks. Then, the relationships most relevant to the input expression are selected and aggregated with a cross-modality attention mechanism. Finally, we compute the matching scores according to the aggregated features, based on which we predict the referred regions. Additionally, we improve the existing erase strategies in REC by erasing some continuous words to encourage the model find and use more clues. Extensive experiments on three widely-used benchmark datasets demonstrate the superiority of the proposed method.
- referring expression comprehension,
- attention mechanism,
- graph attention network,
- modular network,
- erasing strategy

FullText(HTML)

References (55)

References

[1]	Yu Licheng, Lin Zhe, Shen Xiaohui, et al. MattNet: Modular attention network for referring expression comprehension[C] //Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 1307−1315
[2]	孟祥申,江爱文,刘长红,等. 基于Spatial-DCTHash 动态参数网络的视觉问答算法[J]. 中国科学:信息科学,2017,47(8):60−74 Meng Xiangshen, Jiang Aiwen, Liu Changhong, et al. Visual question answering based on spatial DCTHash dynamic parameter network[J]. SCIENTIA SINICA Informations, 2017, 47(8): 60−74 (in Chinese)
[3]	Li Guohao, Wang Xin, Zhu Wenwu. Boosting visual question answering with context-aware knowledge aggregation[C] //Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 1227−1235
[4]	Zhou Yiyi, Ji Rongrong, Sun Xiaoshuai, et al. K-armed bandit based multi-modal network architecture search for visual question answering[C] //Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 1245−1254
[5]	Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention[C] //Proc of the 36th Int Conf on Machine Learning. Piscataway, NJ: IEEE, 2015: 2048−2057
[6]	Zhang Beichen, Li Liang, Su Li, et al. Structural semantic adversarial active learning for image captioning[C] //Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 1112−1121
[7]	Wang Yong, Zhang Wenkai, Liu Qing, et al. Improving intra- and inter-modality visual relation for image captioning[C] //Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 4190−4198
[8]	Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv: 1409.1556, 2014
[9]	He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C] //Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 770−778
[10]	Liu Yongfei , Wan Bo, Zhu Xiaodan, et al. Learning cross-modal context graph for visual grounding[C] //Proc of the 34th Association for the Advancement of Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 11645−11652
[11]	Yang Sibei, Li Guanbin, Yu Yizhou. Dynamic graph attention for referring expression comprehension[C] //Proc of the 17th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 4643−4652
[12]	Yang Sibei, Li Guanbin, Yu Yizhou. Graph-structured referring expression reasoning in the wild[C] //Proc of the 38th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 9949−9958
[13]	Hu Ronghang, Rohrbach M, Andreas J, et al. Modeling relationships in referential expressions with compositional modular networks[C] // Proc of the 35th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 4418−4427
[14]	Liu Xihui, Wang Zihao, Shao Jing, et al. Improving referring expression grounding with cross-modal attention-guided erasing[C] //Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 1950−1959
[15]	Velickovic P, Cucurull G, Casanova A, et al. Graph attention networks[J]. arXiv preprint, arXiv: 1710.10903, 2017
[16]	许晶航, 左万利, 梁世宁, 等. 基于图注意力网络的因果关系抽取[J]. 计算机研究与发展, 2020, 57（1）: 159-174 Xu Jinghang, Zuo Wanli, Liang Shining, et al. Causal relation extraction based on graph attention networks[J]. Journal of Computer Research and Development, 2020, 57(1): 159-174 (in Chinese)
[17]	Zhang Xiaolin, Wei Yunchao, Feng Jiashi, et al. Adversarial complementary learning for weakly supervised object localization[C] //Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 1325−1334
[18]	Luo Ruotian, Shakhnarovich G. Comprehension-guided referring expressions[C] //Proc of the 35th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 3125−3134
[19]	Mao Junhua, Huang J, Toshev A, et al. Generation and comprehension of unambiguous object descriptions[C] //Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 11−20
[20]	Nagaraja V K, Morariu V I, Davis L S. Modeling context between objects for referring expression understanding[C] //Proc of the 14th European Conf on Computer Vision. Berlin: Springer, 2016: 792−807
[21]	Yu Licheng, Poirson P, Yang Shan, et al. Modeling context in referring expressions[C] //Proc of the 14th European Conf on Computer Vision. Berlin: Springer, 2016: 69−85
[22]	Yu Licheng, Tan Hao, Bansal M, et al. A joint speaker-listener-reinforcer model for referring expressions[C] //Proc of the 35th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 3521−3529
[23]	Fukui A, Park D H, Yang D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding[C] //Proc of the 21st Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2016: 457−468
[24]	Rohrbach A, Rohrbach M, Hu Ronghang, et al. Grounding of textual phrases in images by reconstruction[C] //Proc of the 14th European Conf on Computer Vision. Berlin: Springer, 2016: 817−834
[25]	Andreas J, Rohrbach M, Darrell T, et al. Neural module networks[C] //Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 39−48
[26]	Hu Ronghang, Andreas J, Rohrbach M, et al. Learning to reason: End-to-end module networks for visual question answering[C] //Proc of the 16th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 804−813
[27]	鲜光靖,黄永忠. 基于神经网络的视觉问答技术研究综述[J]. 网络安全技术与应用,2018,1:42−47 Xian Guangjing, Huang Yongzhong. A survey of visual question answering technology based on neural network[J]. Network Security Technology & Application, 2018, 1: 42−47 (in Chinese)
[28]	Johnson J, Hariharan B, Van D M L, et al. Inferring and executing programs for visual reasoning[C] //Proc of the 16th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 3008−3017
[29]	杜鹏飞,李小勇,高雅丽. 多模态视觉语言表征学习研究综述[J]. 软件学报,2021,32(2):327−348 Du Pengfei, Li Xiaoyong, Gao Yali, et al. Survey on multimodal visual language representation learning[J]. Journal of Software, 2021, 32(2): 327−348 (in Chinese)
[30]	Andreas J, Klein D, LevinE S. Modular multitask reinforcement learning with policy sketches[C] //Proc of the 38th Int Conf on Machine Learning. New York: ACM, 2017: 166−175
[31]	Wang Peng, Wu Qi, Cao Jiewei, et al. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks[C] //Proc of the 37th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 1960−1968
[32]	Bajaj M, Wang Lanjun, Sigal L. G³raphground: Graph-based language grounding[C] //Proc of the 17th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 4280−4289
[33]	Yang Sibei, Li Guanbin, Yu Yizhou. Cross-modal relationship inference for grounding referring expressions[C] //Proc of the 37th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 4145−4154
[34]	Liu Daqing, Zhang Hanwang, Zha Zhengjun, et al. Learning to assemble neural module tree networks for visual grounding[C] //Proc of the 17th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 4672−4681
[35]	Liao Yue, Liu Si, Li Guanbin, et al. A real-time cross-modality correlation filtering method for referring expression comprehension[C] //Proc of the 38th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10877−10886
[36]	Yang Zhengyuan, Gong Boqing, Wang Liwei, et al. A fast and accurate one-stage approach to visual grounding[C] //Proc of the 17th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 4682−4692
[37]	Yu Zhou, Yu Jun, Xiang Chenchao, et al. Rethinking diversified and discriminative proposal generation for visual grounding[C] //Proc of the 27th Int Joint Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2018: 1114−1120
[38]	Deng Chaorui, Wu Qi, Wu Qingyao, et al. Visual grounding via accumulated attention[C] //Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 7746−7755
[39]	Hu Zhiwei, Feng Guang, Sun Jiayu, et al. Bi-directional relationship inferring network for referring image segmentation[C] //Proc of the 38th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 4423−4432
[40]	Zhuang Bohan, Wu Qi, Shen Chunhua, et al. Parallel attention: A unified framework for visual object discovery through dialogs and queries[C] //Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 4252−4261
[41]	Ren Shaoqing, He Kaiming, Girshick R B, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis Machine Intelligence, 2017, 39(6): 1137−1149 doi: 10.1109/TPAMI.2016.2577031
[42]	Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation[C] //Proc of the 19th Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2014: 1532−1543
[43]	Liu Jingyu, Wang Liang, Yang Ming-Hsuan. Referring expression generation and comprehension via attributes[C] //Proc of the 16th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 4866−4874
[44]	Zhang Hanwang, Niu Yuelei, Chang S. Grounding referring expressions in images by variational context[C] //Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 4158−4166
[45]	Zhang Chao, Li Weiming, Ouyang Wanli, et al. Referring expression comprehension with semantic visual relationship and word mapping[C] //Proc of the 27th ACM Int Conf on Multimedia. New York: ACM, 2019: 1258−1266
[46]	Chen Long, Ma Wenbo, Xiao Jun, et al. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding[C] //Proc of the 35th Association for the Advancement of Artificial Intelligence. Palo Alto, CA: AAAI, 2021: 1036−1044
[47]	Sun Mingjie, Xiao Jimin, Lim E G. Iterative shrinking for referring expression grounding using deep reinforcement learning[C] //Proc of the 39th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 14060−14069
[48]	Chen Xinpeng, Ma Lin, Chen Jingyuan, et al. Real-time referring expression comprehension by single-stage grounding network[J]. arXiv preprint, arXiv: 1812.03426, 2018
[49]	Yang Zhengyuan, Chen Tianlang, Wang Liwei, et al. Improving one-stage visual grounding by recursive sub-query construction[C] //Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 387−404
[50]	Luo Gen, Zhou Yiyi, Sun Xiaoshuai, et al. Multi-task collaborative network for joint referring expression comprehension and segmentation[C] //Proc of the 38th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10031−10040
[51]	Huang Binbin, Lian Dongze, Luo Weixin, et al. Look before you leap: Learning landmark features for one-stage visual grounding[C] //Proc of the 39th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 16888−16897
[52]	Kazemzadeh S, Ordonez V, Matten M, et al. Referitgame: Referring to objects in photographs of natural scenes[C] //Proc of the 19th Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2014: 787−798
[53]	Lin T, Maire M, Belongie S J, et al. Microsoft COCO: Common objects in context[C] //Proc of the 13th European Conf on Computer Vision. Berlin: Springer, 2014: 740−755
[54]	Liu Chenxi, Lin Zhe, Shen Xiaohui, et al. Recurrent multimodal interaction for referring image segmentation[C] //Proc of the 16th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 1280−1289
[55]	Hu Ronghang, Rohrbach M, Darrell T. Segmentation from natural language expressions[C] //Proc of the 14th European Conf on Computer Vision. Berlin: Springer, 2016: 108−124

Cited By

Cited by

Periodical cited type(6)

1.	韩宇捷，徐志杰，杨定裕，黄波，郭健美. CDES:数据驱动的云数据库效能评估方法. 计算机科学. 2024(06): 111-117 .
2.	刘传磊，张贺，杨贺. 地铁保护区智能化巡查系统开发及应用研究. 现代城市轨道交通. 2024(09): 23-30 .
3.	董文，张俊峰，刘俊，张雷. 国产数据库在能源数字化转型中的创新应用研究. 信息通信技术与政策. 2024(10): 68-74 .
4.	阎开. 计算机检测维修与数据恢复技术及应用研究. 信息记录材料. 2023(08): 89-91 .
5.	冯丽琴，冯花平. 基于人脸识别的可控化学习数据库系统设计. 数字通信世界. 2023(10): 69-71 .
6.	张惠芹，章小卫，杜坤，李江. 基于数字孪生的高校实验室高温设备智能化监管体系的探究. 实验室研究与探索. 2023(11): 249-252+282 .