• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Xue Zhihang, Xu Zheming, Lang Congyan, Feng Songhe, Wang Tao, Li Yidong. Text-to-Image Generation Method Based on Image-Text Semantic Consistency[J]. Journal of Computer Research and Development, 2023, 60(9): 2180-2190. DOI: 10.7544/issn1000-1239.202220416
Citation: Xue Zhihang, Xu Zheming, Lang Congyan, Feng Songhe, Wang Tao, Li Yidong. Text-to-Image Generation Method Based on Image-Text Semantic Consistency[J]. Journal of Computer Research and Development, 2023, 60(9): 2180-2190. DOI: 10.7544/issn1000-1239.202220416

Text-to-Image Generation Method Based on Image-Text Semantic Consistency

Funds: This work was supported by the National Natural Science Foundation of China (62072027, 61872032, 62076021) and Beijing Natural Science Foundation (4202057, 4202058, 4202060).
More Information
  • Author Bio:

    Xue Zhihang: born in 1997. Master candidate. His main research interests include computer vision and image generation

    Xu Zheming: born in 1996. PhD candidate. Her main research interests include vehicle re-identification and multi-view learning

    Lang Congyan: born in 1978. PhD, professor, PhD supervisor. Member of CCF. Her main research interests include computer vision and multimedia content analysis

    Feng Songhe: born in 1981. PhD, professor, PhD supervisor. Member of CCF. His main research interests include weakly-supervised machine learning and multimedia content analysis

    Wang Tao: born in 1980. PhD, professor. His main research interests include computer vision and machine learning

    Li Yidong: born in 1982. PhD, professor. His main research interests include big data analysis, privacy preserving, information security, and data mining

  • Received Date: May 20, 2022
  • Revised Date: November 20, 2022
  • Available Online: April 13, 2023
  • In recent years, text-to-image generation methods based on generative adversarial networks have become a popular area of research in cross-media convergence. Text-to-image generation methods aim to improve the semantic consistency between text descriptions and generated images by extracting more representational text and image features. Most of the existing methods model the global image features and the initial text semantic features, ignoring the limitations of the initial text features and not fully utilizing the guidance of the semantic consistency of the generated images with the text features, thus reducing the representativeness of the text information in text-to-image synthesis. In addition, because the dynamic interaction between the generated object regions is not considered, the generated network can only roughly delineate the target region and ignore the potential correspondence between local regions of the image and the semantic labels of the text. To solve the above problems, a text-to-image generation method, called ITSC-GAN, based on image-text semantic consistency is proposed in this paper. The model firstly designs a text information enhancement module to enhance the text information using the generated images, thus improving the characterization of text features. Secondly, the model proposes an image regional attention module to enhance the characterization ability of image features by mining the relationship between image sub-regions. By jointly utilizing the two modules, higher consistency between image local features and text semantic labels is achieved. Finally, the model uses the generator and discriminator loss functions as constraints to improve the quality of the generated images and the semantic agreement with the text description. The experimental results show that the IS (inception score) metric of the ITSC-GAN model increases by about 7.42%, the FID (Fréchet inception distance) decreases by about 28.76% and the R-precision increased by about 14.95% on CUB dataset compared with the current mainstream approach AttnGAN model. A large number of experimental results fully validate the effectiveness and superiority of ITSC-GAN model.

  • [1]
    Antol S, Agrawal A, Lu Jiasen, et al. VQA: Visual question answering[C]// Proc of the 15th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2015: 2425−2433
    [2]
    Vinyals O, Toshev A, Bengio S, et al. Show and tell: A neural image caption generator[C]// Proc of the 28th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2015: 3156−3164
    [3]
    Reed S, Akata Z, Yan Xinchen, et al. Generative adversarial text to image synthesis[C]// Proc of the 33rd Int Conf on Machine Learning. New York: ACM, 2016: 1060−1069
    [4]
    Goodfellow I, Pouget-Abadie J, Mirza M, et al. Generative adversarial nets[C]// Proc of the 27th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2014: 2672−2680
    [5]
    Xu Tao, Zhang Pengchuan, Huang Qiuyuan, et al. AttnGAN: Fine-grained text to image generation with attentional generative adversarial networks[C]// Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2018: 1316−1324
    [6]
    Qiao Tingting, Zhang Jing, Xu Duanqing, et al. Learn, imagine and create: Text-to-image generation from prior knowledge[C]// Proc of the 32nd Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2019: 885−895
    [7]
    Liang Jiadong, Pei Wenjie, Lu Feng. CPGAN: Content-parsing generative adversarial networks for text-to-image synthesis[C]// Proc of the 16th European Conf on Computer Vision. Berlin: Springer, 2020: 491−508
    [8]
    Cao Guimei, Xie Xuexie, Yang Wenzhe, et al. Feature-fused SSD: Fast detection for small objects[G]// SPIE 10615: Proc of the 9th Int Conf on Graphic and Image Processing. Bellingham, WA: SPIE, 2018: 381−388
    [9]
    Wah C, Branson S, Welinder P, et al. The Caltech-UCSD Birds-200−2011 dataset, CNS-TR-2011−001[R/OL]. 2011[2022-08-12].https://authors.library.caltech.edu/27452/1/CUB_200_2011.pdf
    [10]
    Lin Tsungyin, Maire M, Belongie S, et al. Microsoft COCO: Common objects in context[C]// Proc of the 13th European Conf on Computer Vision. Berlin: Springer, 2014: 740−755
    [11]
    Zhang Han, Xu Tao, Li Hongsheng, et al. StackGAN: Text to photo-realistic image synthesis with stacked generative adversarial networks[C]// Proc of the 16th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 5908−5916
    [12]
    Zhang Han, Xu Tao, Li Hongsheng, et al. StackGAN++: Realistic image synthesis with stacked generative adversarial networks[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(8): 1947−1962
    [13]
    Qiao Tingting, Zhang Jing, Xu Duanqing, et al. MirrorGAN: Learning text-to-image generation by redescription[C]// Proc of the 32nd IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 1505−1514
    [14]
    Zhu Minfeng, Pan Pingbo, Chen Wei, et al. DM-GAN: Dynamic memory generative adversarial networks for text-to-image synthesis[C]//Proc of the 31st IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2019: 5802−5810
    [15]
    Zhu Junyan, Park T, Isola P, et al. Unpaired image-to-image translation using cycle-consistent adversarial networks[C]//Proc of the 16th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2017: 2223−2232
    [16]
    Almahairi A, Rajeshwar S, Sordoni A, et al. Augmented cycleGAN: Learning many-to-many mappings from unpaired data[C]// Proc of the 35th Int Conf on Machine Learning. New York: ACM, 2018: 195−204
    [17]
    Gou Yuchuan, Wu Qiancheng, Li Minghao, et al. SegAttnGAN: Text to image generation with segmentation attention[J]. arXiv preprint, arXiv: 2005.12444, 2020
    [18]
    Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need[C]// Proc of the 30th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 5998−6008
    [19]
    Li L H, Yatskar M, Yin Da, et al. VisualBERT: A simple and performant baseline for vision and language[J]. arXiv preprint, arXiv: 1908.03557, 2019
    [20]
    Su Weijie, Zhu Xizhou, Cao Yue, et al. VL-BERT: Pre-training of generic visual-linguistic representations[J]. arXiv preprint, arXiv: 1908.08530, 2019
    [21]
    Lu Jiasen, Batra D, Parikh D, et al. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks[C]// Proc of the 32nd Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2019: 13−23
    [22]
    Li Gen, Duan Nan, Fang Yuejian, et al. Unicoder-VL: A universal encoder for vision and language by cross-modal pre-training[C]//Proc of the 34th AAAI Conf on Artificial Intelligence. Menlo Park, CA: AAAI, 2020: 11336−11344
    [23]
    Wang Zihao, Liu Xihui, Li Hongsheng, et al. CAMP: Cross-modal adaptive message passing for text-image retrieval[C]//Proc of the 17th IEEE Int Conf on Computer Vision. Piscataway, NJ: IEEE, 2019: 5763−5772
    [24]
    Zhang Han, Koh J Y, Baldridge J, et al. Cross-modal contrastive learning for text-to-image generation[C]//Proc of the 34th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2021: 833−842
    [25]
    Naveen S, Kiran M S S R, Indupriya M, et al. Transformer models for enhancing AttnGAN based text to image generation[J/OL]. Image and Vision Computing, 2021[2022-08-12].https://www.sciencedirect.com/science/article/pii/S026288562100189X
    [26]
    Devlin J, Chang Mingwei, Lee K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding[C]// Proc of the 17th Conf of the North American Chapter of the ACL: Human Language Technologies. Stroudsburg, PA: ACL, 2019: 4171−4186
    [27]
    Radford A, Wu J, Child R, et al. Language models are unsupervised multitask learners[EB/OL]. OpenAI Blog, 2019[2022-08-23].https://cdn.openai.com/better-language-models/language_models_are_unsupervised_multitask_learners.pdf
    [28]
    Szegedy C, Vanhoucke V, Ioffe S, et al. Rethinking the inception architecture for computer vision[C]//Proc of the 29th IEEE Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2016: 2818−2826
    [29]
    Krizhevsky A, Sutskever I, Hinton G E. ImageNet classification with deep convolutional neural networks[C]// Proc of the 25th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2012: 1106−1114
    [30]
    Salimans T, Goodfellow I, Zaremba W, et al. Improved techniques for training GANs[C]// Proc of the 29th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2016: 2226−2234
    [31]
    Heusel M, Ramsauer H, Unterthiner T, et al. GANs trained by a two time-scale update rule converge to a local Nash equilibrium[C]// Proc of the 30th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 6626−6637
    [32]
    Zhang Zhiqiang, Fu Chen, Zhou Jinjia, et al. Text to image synthesis based on multi-perspective fusion[C/OL]// Proc of the 31st Int Joint Conf on Neural Networks. Piscataway, NJ: IEEE, 2021[2022-08-12].https://ieeexplore.ieee.org/abstract/document/9533925
    [33]
    Stap D, Bleeker M, Ibrahimi S, et al. Conditional image generation and manipulation for user-specified content[J]. arXiv preprint, arXiv: 2005.04909, 2020
  • Related Articles

    [1]Wang Houzhen, Qin Wanying, Liu Qin, Yu Chunwu, Shen Zhidong. Identity Based Group Key Distribution Scheme[J]. Journal of Computer Research and Development, 2023, 60(10): 2203-2217. DOI: 10.7544/issn1000-1239.202330457
    [2]Chen Yewang, Shen Lianlian, Zhong Caiming, Wang Tian, Chen Yi, Du Jixiang. Survey on Density Peak Clustering Algorithm[J]. Journal of Computer Research and Development, 2020, 57(2): 378-394. DOI: 10.7544/issn1000-1239.2020.20190104
    [3]Zhang Qikun, Gan Yong, Wang Ruifang, Zheng Jiamin, Tan Yu’an. Inter-Cluster Asymmetric Group Key Agreement[J]. Journal of Computer Research and Development, 2018, 55(12): 2651-2663. DOI: 10.7544/issn1000-1239.2018.20170651
    [4]Xu Xiao, Ding Shifei, Sun Tongfeng, Liao Hongmei. Large-Scale Density Peaks Clustering Algorithm Based on Grid Screening[J]. Journal of Computer Research and Development, 2018, 55(11): 2419-2429. DOI: 10.7544/issn1000-1239.2018.20170227
    [5]Wang Haiyan, Dong Maowei. Latent Group Recommendation Based on Dynamic Probabilistic Matrix Factorization Model Integrated with CNN[J]. Journal of Computer Research and Development, 2017, 54(8): 1853-1863. DOI: 10.7544/issn1000-1239.2017.20170344
    [6]Gong Shufeng, Zhang Yanfeng. EDDPC: An Efficient Distributed Density Peaks Clustering Algorithm[J]. Journal of Computer Research and Development, 2016, 53(6): 1400-1409. DOI: 10.7544/issn1000-1239.2016.20150616
    [7]Zhang Qikun, Wang Ruifang, Tan Yu'an. Identity-Based Authenticated Asymmetric Group Key Agreement[J]. Journal of Computer Research and Development, 2014, 51(8): 1727-1738. DOI: 10.7544/issn1000-1239.2014.20121165
    [8]Zhu Mu, Meng Fanrong, and Zhou Yong. Density-Based Link Clustering Algorithm for Overlapping Community Detection[J]. Journal of Computer Research and Development, 2013, 50(12): 2520-2530.
    [9]Wang Feng, Zhou Yousheng, Gu Lize, Yang Yixian. A Multi-Policies Threshold Signature Scheme with Group Verifiability[J]. Journal of Computer Research and Development, 2012, 49(3): 499-505.
    [10]Cao Jia, Lu Shiwen. Research on Topology Discovery in the Overlay Multicast[J]. Journal of Computer Research and Development, 2006, 43(5): 784-790.
  • Cited by

    Periodical cited type(7)

    1. 毛伊敏,甘德瑾,廖列法,陈志刚. 基于Spark框架和ASPSO的并行划分聚类算法. 通信学报. 2022(03): 148-163 .
    2. 王永贵,林佳敏,何佳玉. 融合领导者影响与隐式信任度的群组推荐方法. 计算机工程与应用. 2022(09): 98-106 .
    3. 刘鑫,梅红岩,王嘉豪,李晓会. 图神经网络推荐方法研究. 计算机工程与应用. 2022(10): 41-49 .
    4. 刘聪,谢莉,杨慧中. 基于改进DPC的青霉素发酵过程多模型软测量建模. 化工学报. 2021(03): 1606-1615 .
    5. 刘功民,朱俊杰. WSN中利用双重接收器结合自适应加权数据融合的簇首优化聚类算法. 计算机应用与软件. 2021(05): 145-151 .
    6. 任昌鸿,安军. 改进PSO结合DSA技术的无线传感器网络均衡密度聚类方法. 计算机应用与软件. 2020(08): 122-129 .
    7. 许晓明,梅红岩,于恒,李晓会. 基于偏好融合的群组推荐方法研究综述. 小型微型计算机系统. 2020(12): 2500-2508 .

    Other cited types(13)

Catalog

    Article views (405) PDF downloads (150) Cited by(20)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return