基于图像-文本语义一致性的文本生成图像方法

薛志杭; 许喆铭; 郎丛妍; 冯松鹤; 王涛; 李浥东

doi:10.7544/issn1000-1239.202220416

摘要: 近年来，以生成对抗网络（generative adversarial network, GAN）为基础的文本生成图像方法成为跨媒体融合研究的一大热门领域. 文本生成图像方法旨在通过提取更具表征力的文本及图像特征，提升文本描述与生成图像之间的语义一致性.现有方法大多针对在图像全局特征与初始文本语义特征之间进行建模，忽略了初始文本特征的局限性，且没有充分利用具有语义一致性的生成图像对文本特征的指导作用，因而降低了文本生成图像中文本信息的表征性. 其次，由于没有考虑到生成目标区域间的动态交互，生成网络只能粗略地划分目标区域，且忽略了图像局部区域与文本语义标签的潜在对应关系.为解决上述问题，提出了一种基于图像-文本语义一致性的文本生成图像方法ITSC-GAN. 该模型首先设计了一个文本信息增强模块（text information enhancement module, TEM），利用生成图像对文本信息进行增强，从而提高文本特征的表征能力.另外，该模型提出了一个图像区域注意力模块（image regional attention module, IRAM），通过挖掘图像子区域之间的关系，增强图像特征的表征能力. 通过联合利用这2个模块，使得图像局部特征与文本语义标签之间具有更高的一致性. 最后，该模型使用生成器与判别器损失函数作为约束，以提升生成图像的质量，促进图像与文本描述的语义一致. 实验结果表明，在CUB数据集上，与当前主流方法AttnGAN模型相比，ITSC-GAN模型的IS（inception score）指标增长了约7.42%，FID（Fréchet inception distance）减少了约28.76%，R-precision增加了约14.95%. 大量实验结果充分验证了ITSC-GAN模型的有效性及优越性.

Abstract: In recent years, text-to-image generation methods based on generative adversarial networks have become a popular area of research in cross-media convergence. Text-to-image generation methods aim to improve the semantic consistency between text descriptions and generated images by extracting more representational text and image features. Most of the existing methods model the global image features and the initial text semantic features, ignoring the limitations of the initial text features and not fully utilizing the guidance of the semantic consistency of the generated images with the text features, thus reducing the representativeness of the text information in text-to-image synthesis. In addition, because the dynamic interaction between the generated object regions is not considered, the generated network can only roughly delineate the target region and ignore the potential correspondence between local regions of the image and the semantic labels of the text. To solve the above problems, a text-to-image generation method, called ITSC-GAN, based on image-text semantic consistency is proposed in this paper. The model firstly designs a text information enhancement module to enhance the text information using the generated images, thus improving the characterization of text features. Secondly, the model proposes an image regional attention module to enhance the characterization ability of image features by mining the relationship between image sub-regions. By jointly utilizing the two modules, higher consistency between image local features and text semantic labels is achieved. Finally, the model uses the generator and discriminator loss functions as constraints to improve the quality of the generated images and the semantic agreement with the text description. The experimental results show that the IS (inception score) metric of the ITSC-GAN model increases by about 7.42%, the FID (Fréchet inception distance) decreases by about 28.76% and the R-precision increased by about 14.95% on CUB dataset compared with the current mainstream approach AttnGAN model. A large number of experimental results fully validate the effectiveness and superiority of ITSC-GAN model.

基于图像-文本语义一致性的文本生成图像方法

Text-to-Image Generation Method Based on Image-Text Semantic Consistency