Abstract:
In this paper, we focus on the task of referring expression comprehension (REC), which aims to locate the corresponding regions in images referred by expressions. One of the main challenges is to visually ground the object relationships described by the input expressions. The existing mainstream methods mainly score objects based on their visual attributes and the relationships with other objects, and the object with the highest score is predicted as the referred region. However, these methods tend to only consider the relationships between the current evaluated region and its surroundings, but ignore the informative interactions among the multiple surrounding regions, which are important for matching the input expressions and visual content in image. To address this issue, we propose a relationship aggregation network (RAN) to construct comprehensive relationships and then aggregate them to predict the referred region. Specifically, we construct both the two kinds of aforementioned relationships based on graph attention networks. Then, the relationships most relevant to the input expression are selected and aggregated with a cross-modality attention mechanism. Finally, we compute the matching scores according to the aggregated features, based on which we predict the referred regions. Additionally, we improve the existing erase strategies in REC by erasing some continuous words to encourage the model find and use more clues. Extensive experiments on three widely-used benchmark datasets demonstrate the superiority of the proposed method.