Citation: | Guo Wenya, Zhang Ying, Liu Shengzhe, Yang Jufeng, Yuan Xiaojie. Relationship Aggregation Network for Referring Expression Comprehension[J]. Journal of Computer Research and Development, 2023, 60(11): 2611-2623. DOI: 10.7544/issn1000-1239.202220019 |
In this paper, we focus on the task of referring expression comprehension (REC), which aims to locate the corresponding regions in images referred by expressions. One of the main challenges is to visually ground the object relationships described by the input expressions. The existing mainstream methods mainly score objects based on their visual attributes and the relationships with other objects, and the object with the highest score is predicted as the referred region. However, these methods tend to only consider the relationships between the current evaluated region and its surroundings, but ignore the informative interactions among the multiple surrounding regions, which are important for matching the input expressions and visual content in image. To address this issue, we propose a relationship aggregation network (RAN) to construct comprehensive relationships and then aggregate them to predict the referred region. Specifically, we construct both the two kinds of aforementioned relationships based on graph attention networks. Then, the relationships most relevant to the input expression are selected and aggregated with a cross-modality attention mechanism. Finally, we compute the matching scores according to the aggregated features, based on which we predict the referred regions. Additionally, we improve the existing erase strategies in REC by erasing some continuous words to encourage the model find and use more clues. Extensive experiments on three widely-used benchmark datasets demonstrate the superiority of the proposed method.
[1] |
Yu Licheng, Lin Zhe, Shen Xiaohui, et al. MattNet: Modular attention network for referring expression comprehension[C] //Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 1307−1315
|
[2] |
孟祥申,江爱文,刘长红,等. 基于Spatial-DCTHash 动态参数网络的视觉问答算法[J]. 中国科学:信息科学,2017,47(8):60−74
Meng Xiangshen, Jiang Aiwen, Liu Changhong, et al. Visual question answering based on spatial DCTHash dynamic parameter network[J]. SCIENTIA SINICA Informations, 2017, 47(8): 60−74 (in Chinese)
|
[3] |
Li Guohao, Wang Xin, Zhu Wenwu. Boosting visual question answering with context-aware knowledge aggregation[C] //Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 1227−1235
|
[4] |
Zhou Yiyi, Ji Rongrong, Sun Xiaoshuai, et al. K-armed bandit based multi-modal network architecture search for visual question answering[C] //Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 1245−1254
|
[5] |
Xu K, Ba J, Kiros R, et al. Show, attend and tell: Neural image caption generation with visual attention[C] //Proc of the 36th Int Conf on Machine Learning. Piscataway, NJ: IEEE, 2015: 2048−2057
|
[6] |
Zhang Beichen, Li Liang, Su Li, et al. Structural semantic adversarial active learning for image captioning[C] //Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 1112−1121
|
[7] |
Wang Yong, Zhang Wenkai, Liu Qing, et al. Improving intra- and inter-modality visual relation for image captioning[C] //Proc of the 28th ACM Int Conf on Multimedia. New York: ACM, 2020: 4190−4198
|
[8] |
Simonyan K, Zisserman A. Very deep convolutional networks for large-scale image recognition[J]. arXiv preprint, arXiv: 1409.1556, 2014
|
[9] |
He Kaiming, Zhang Xiangyu, Ren Shaoqing, et al. Deep residual learning for image recognition[C] //Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 770−778
|
[10] |
Liu Yongfei , Wan Bo, Zhu Xiaodan, et al. Learning cross-modal context graph for visual grounding[C] //Proc of the 34th Association for the Advancement of Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 11645−11652
|
[11] |
Yang Sibei, Li Guanbin, Yu Yizhou. Dynamic graph attention for referring expression comprehension[C] //Proc of the 17th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 4643−4652
|
[12] |
Yang Sibei, Li Guanbin, Yu Yizhou. Graph-structured referring expression reasoning in the wild[C] //Proc of the 38th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 9949−9958
|
[13] |
Hu Ronghang, Rohrbach M, Andreas J, et al. Modeling relationships in referential expressions with compositional modular networks[C] // Proc of the 35th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 4418−4427
|
[14] |
Liu Xihui, Wang Zihao, Shao Jing, et al. Improving referring expression grounding with cross-modal attention-guided erasing[C] //Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 1950−1959
|
[15] |
Velickovic P, Cucurull G, Casanova A, et al. Graph attention networks[J]. arXiv preprint, arXiv: 1710.10903, 2017
|
[16] |
许晶航, 左万利, 梁世宁, 等. 基于图注意力网络的因果关系抽取[J]. 计算机研究与发展, 2020, 57(1): 159-174
Xu Jinghang, Zuo Wanli, Liang Shining, et al. Causal relation extraction based on graph attention networks[J]. Journal of Computer Research and Development, 2020, 57(1): 159-174 (in Chinese)
|
[17] |
Zhang Xiaolin, Wei Yunchao, Feng Jiashi, et al. Adversarial complementary learning for weakly supervised object localization[C] //Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 1325−1334
|
[18] |
Luo Ruotian, Shakhnarovich G. Comprehension-guided referring expressions[C] //Proc of the 35th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 3125−3134
|
[19] |
Mao Junhua, Huang J, Toshev A, et al. Generation and comprehension of unambiguous object descriptions[C] //Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 11−20
|
[20] |
Nagaraja V K, Morariu V I, Davis L S. Modeling context between objects for referring expression understanding[C] //Proc of the 14th European Conf on Computer Vision. Berlin: Springer, 2016: 792−807
|
[21] |
Yu Licheng, Poirson P, Yang Shan, et al. Modeling context in referring expressions[C] //Proc of the 14th European Conf on Computer Vision. Berlin: Springer, 2016: 69−85
|
[22] |
Yu Licheng, Tan Hao, Bansal M, et al. A joint speaker-listener-reinforcer model for referring expressions[C] //Proc of the 35th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2017: 3521−3529
|
[23] |
Fukui A, Park D H, Yang D, et al. Multimodal compact bilinear pooling for visual question answering and visual grounding[C] //Proc of the 21st Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2016: 457−468
|
[24] |
Rohrbach A, Rohrbach M, Hu Ronghang, et al. Grounding of textual phrases in images by reconstruction[C] //Proc of the 14th European Conf on Computer Vision. Berlin: Springer, 2016: 817−834
|
[25] |
Andreas J, Rohrbach M, Darrell T, et al. Neural module networks[C] //Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 39−48
|
[26] |
Hu Ronghang, Andreas J, Rohrbach M, et al. Learning to reason: End-to-end module networks for visual question answering[C] //Proc of the 16th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 804−813
|
[27] |
鲜光靖,黄永忠. 基于神经网络的视觉问答技术研究综述[J]. 网络安全技术与应用,2018,1:42−47
Xian Guangjing, Huang Yongzhong. A survey of visual question answering technology based on neural network[J]. Network Security Technology & Application, 2018, 1: 42−47 (in Chinese)
|
[28] |
Johnson J, Hariharan B, Van D M L, et al. Inferring and executing programs for visual reasoning[C] //Proc of the 16th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 3008−3017
|
[29] |
杜鹏飞,李小勇,高雅丽. 多模态视觉语言表征学习研究综述[J]. 软件学报,2021,32(2):327−348
Du Pengfei, Li Xiaoyong, Gao Yali, et al. Survey on multimodal visual language representation learning[J]. Journal of Software, 2021, 32(2): 327−348 (in Chinese)
|
[30] |
Andreas J, Klein D, LevinE S. Modular multitask reinforcement learning with policy sketches[C] //Proc of the 38th Int Conf on Machine Learning. New York: ACM, 2017: 166−175
|
[31] |
Wang Peng, Wu Qi, Cao Jiewei, et al. Neighbourhood watch: Referring expression comprehension via language-guided graph attention networks[C] //Proc of the 37th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 1960−1968
|
[32] |
Bajaj M, Wang Lanjun, Sigal L. G3raphground: Graph-based language grounding[C] //Proc of the 17th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 4280−4289
|
[33] |
Yang Sibei, Li Guanbin, Yu Yizhou. Cross-modal relationship inference for grounding referring expressions[C] //Proc of the 37th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 4145−4154
|
[34] |
Liu Daqing, Zhang Hanwang, Zha Zhengjun, et al. Learning to assemble neural module tree networks for visual grounding[C] //Proc of the 17th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 4672−4681
|
[35] |
Liao Yue, Liu Si, Li Guanbin, et al. A real-time cross-modality correlation filtering method for referring expression comprehension[C] //Proc of the 38th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10877−10886
|
[36] |
Yang Zhengyuan, Gong Boqing, Wang Liwei, et al. A fast and accurate one-stage approach to visual grounding[C] //Proc of the 17th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 4682−4692
|
[37] |
Yu Zhou, Yu Jun, Xiang Chenchao, et al. Rethinking diversified and discriminative proposal generation for visual grounding[C] //Proc of the 27th Int Joint Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2018: 1114−1120
|
[38] |
Deng Chaorui, Wu Qi, Wu Qingyao, et al. Visual grounding via accumulated attention[C] //Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 7746−7755
|
[39] |
Hu Zhiwei, Feng Guang, Sun Jiayu, et al. Bi-directional relationship inferring network for referring image segmentation[C] //Proc of the 38th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 4423−4432
|
[40] |
Zhuang Bohan, Wu Qi, Shen Chunhua, et al. Parallel attention: A unified framework for visual object discovery through dialogs and queries[C] //Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 4252−4261
|
[41] |
Ren Shaoqing, He Kaiming, Girshick R B, et al. Faster R-CNN: Towards real-time object detection with region proposal networks[J]. IEEE Transactions on Pattern Analysis Machine Intelligence, 2017, 39(6): 1137−1149 doi: 10.1109/TPAMI.2016.2577031
|
[42] |
Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation[C] //Proc of the 19th Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2014: 1532−1543
|
[43] |
Liu Jingyu, Wang Liang, Yang Ming-Hsuan. Referring expression generation and comprehension via attributes[C] //Proc of the 16th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 4866−4874
|
[44] |
Zhang Hanwang, Niu Yuelei, Chang S. Grounding referring expressions in images by variational context[C] //Proc of the 36th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 4158−4166
|
[45] |
Zhang Chao, Li Weiming, Ouyang Wanli, et al. Referring expression comprehension with semantic visual relationship and word mapping[C] //Proc of the 27th ACM Int Conf on Multimedia. New York: ACM, 2019: 1258−1266
|
[46] |
Chen Long, Ma Wenbo, Xiao Jun, et al. Ref-NMS: Breaking proposal bottlenecks in two-stage referring expression grounding[C] //Proc of the 35th Association for the Advancement of Artificial Intelligence. Palo Alto, CA: AAAI, 2021: 1036−1044
|
[47] |
Sun Mingjie, Xiao Jimin, Lim E G. Iterative shrinking for referring expression grounding using deep reinforcement learning[C] //Proc of the 39th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 14060−14069
|
[48] |
Chen Xinpeng, Ma Lin, Chen Jingyuan, et al. Real-time referring expression comprehension by single-stage grounding network[J]. arXiv preprint, arXiv: 1812.03426, 2018
|
[49] |
Yang Zhengyuan, Chen Tianlang, Wang Liwei, et al. Improving one-stage visual grounding by recursive sub-query construction[C] //Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 387−404
|
[50] |
Luo Gen, Zhou Yiyi, Sun Xiaoshuai, et al. Multi-task collaborative network for joint referring expression comprehension and segmentation[C] //Proc of the 38th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2020: 10031−10040
|
[51] |
Huang Binbin, Lian Dongze, Luo Weixin, et al. Look before you leap: Learning landmark features for one-stage visual grounding[C] //Proc of the 39th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 16888−16897
|
[52] |
Kazemzadeh S, Ordonez V, Matten M, et al. Referitgame: Referring to objects in photographs of natural scenes[C] //Proc of the 19th Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2014: 787−798
|
[53] |
Lin T, Maire M, Belongie S J, et al. Microsoft COCO: Common objects in context[C] //Proc of the 13th European Conf on Computer Vision. Berlin: Springer, 2014: 740−755
|
[54] |
Liu Chenxi, Lin Zhe, Shen Xiaohui, et al. Recurrent multimodal interaction for referring image segmentation[C] //Proc of the 16th Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 1280−1289
|
[55] |
Hu Ronghang, Rohrbach M, Darrell T. Segmentation from natural language expressions[C] //Proc of the 14th European Conf on Computer Vision. Berlin: Springer, 2016: 108−124
|