高级检索
    郭文雅, 张莹, 刘胜哲, 杨巨峰, 袁晓洁. 一种面向指代短语理解的关系聚合网络[J]. 计算机研究与发展, 2023, 60(11): 2611-2623. DOI: 10.7544/issn1000-1239.202220019
    引用本文: 郭文雅, 张莹, 刘胜哲, 杨巨峰, 袁晓洁. 一种面向指代短语理解的关系聚合网络[J]. 计算机研究与发展, 2023, 60(11): 2611-2623. DOI: 10.7544/issn1000-1239.202220019
    Guo Wenya, Zhang Ying, Liu Shengzhe, Yang Jufeng, Yuan Xiaojie. Relationship Aggregation Network for Referring Expression Comprehension[J]. Journal of Computer Research and Development, 2023, 60(11): 2611-2623. DOI: 10.7544/issn1000-1239.202220019
    Citation: Guo Wenya, Zhang Ying, Liu Shengzhe, Yang Jufeng, Yuan Xiaojie. Relationship Aggregation Network for Referring Expression Comprehension[J]. Journal of Computer Research and Development, 2023, 60(11): 2611-2623. DOI: 10.7544/issn1000-1239.202220019

    一种面向指代短语理解的关系聚合网络

    Relationship Aggregation Network for Referring Expression Comprehension

    • 摘要: 指代短语理解(referring expression comprehension,REC)任务的目的是定位输入短语所指代的图像区域,其中最主要的挑战之一是在图像中建立和定位由输入短语描述的物体之间的关系. 现有的主流方法之一是根据物体本身的特性以及与其他物体的关系对当前物体进行打分,将得分最高的物体作为预测的被指代区域. 然而,这类方法往往只考虑物体与其周围环境之间的关系,而忽略了输入短语中所描述的周围环境之间的交互关系,这大大影响了对物体间关系的建模. 为了解决这一问题,提出了关系聚合网络(relationship aggregation network,RAN)来构建物体之间的关系,进而预测输入短语所指代的内容. 具体来说,利用图注意力网络建模图像物体之间完备的关系;然后利用跨模态注意力方法选择与输入短语最相关的关系进行聚合;最后,计算目标区域与输入短语之间的匹配分数. 除此之外,对指代短语理解中的擦除方法进行了改进,通过自适应扩充擦除范围的方式促使模型利用更多的线索来定位正确的区域. 在3个广泛使用的基准数据集上进行了大量的实验,结果证明了所提出方法的优越性.

       

      Abstract: In this paper, we focus on the task of referring expression comprehension (REC), which aims to locate the corresponding regions in images referred by expressions. One of the main challenges is to visually ground the object relationships described by the input expressions. The existing mainstream methods mainly score objects based on their visual attributes and the relationships with other objects, and the object with the highest score is predicted as the referred region. However, these methods tend to only consider the relationships between the current evaluated region and its surroundings, but ignore the informative interactions among the multiple surrounding regions, which are important for matching the input expressions and visual content in image. To address this issue, we propose a relationship aggregation network (RAN) to construct comprehensive relationships and then aggregate them to predict the referred region. Specifically, we construct both the two kinds of aforementioned relationships based on graph attention networks. Then, the relationships most relevant to the input expression are selected and aggregated with a cross-modality attention mechanism. Finally, we compute the matching scores according to the aggregated features, based on which we predict the referred regions. Additionally, we improve the existing erase strategies in REC by erasing some continuous words to encourage the model find and use more clues. Extensive experiments on three widely-used benchmark datasets demonstrate the superiority of the proposed method.

       

    /

    返回文章
    返回