• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Lin Meng, Dai Chengwei, Guo Tao. A Method for Generating Explanations of Offensive Memes Based on Multimodal Large Language Models[J]. Journal of Computer Research and Development, 2024, 61(5): 1206-1217. DOI: 10.7544/issn1000-1239.202330960
Citation: Lin Meng, Dai Chengwei, Guo Tao. A Method for Generating Explanations of Offensive Memes Based on Multimodal Large Language Models[J]. Journal of Computer Research and Development, 2024, 61(5): 1206-1217. DOI: 10.7544/issn1000-1239.202330960

A Method for Generating Explanations of Offensive Memes Based on Multimodal Large Language Models

More Information
  • Author Bio:

    Lin Meng: born in 1991. PhD candidate. Her main research interests include multi-modal hate speech detection and multimodal semantic understanding

    Dai Chengwei: born in 2000. Master candidate. His main research interests include model extraction and large language model distillation

    Guo Tao: born in 1974. PhD, Professor, PhD supervisor. His main research interests include cybersecurity, vulnerability analysis and risk assessment

  • Received Date: November 29, 2023
  • Revised Date: March 11, 2024
  • Available Online: March 19, 2024
  • With the advancement of 5G technology, offensive speech has increasingly proliferated across social networks in the form of multimodal memes. Consequently, the detection and interpretive generation of offensive memes play a crucial role in enhancing content moderation effectiveness and maintaining a harmonious and healthy public discourse environment. Existing studies on the interpretive generation of offensive memes focus solely on the targets and content of offense, neglecting the societal background knowledge and metaphorical expressions embedded in memes. This oversight limits the ability to comprehensively and accurately interpret the meaning of offensive memes, thus constraining the applicability of interpretations. To address this challenge, we propose a method based on multimodal large language model for generating interpretations of offensive memes. By augmenting elements such as offense targets, the content of the offense, and metaphor recognition into the instruction tuning data, we can effectively improve the multimodal large model’s proficiency in interpretively generating offensive memes through instruction tuning. The experimental outcomes validate three key strengths of our method: first, it achieves a notable 19% enhancement in the BERTScore evaluation metric over baseline models; second, it incorporates comprehensive background knowledge pertinent to offensive metaphorical expressions within its interpretations; third, it exhibits strong generalization capabilities when handling previously unseen meme data.

  • [1]
    虎嵩林,赵军,唐杰,等. 虚假信息检测专题前言[J]. 计算机研究与发展,2021,58(7):1351−1352 doi: 10.7544/issn1000-1239.2021.qy0701

    Hu Songlin, Zhao Jun, Tang Jie, et al. Preface to the special issue on fake information detection[J]. Journal of Computer Research and Development, 2021, 58(7): 1351−1352 (in Chinese) doi: 10.7544/issn1000-1239.2021.qy0701
    [2]
    Kiela D, Firooz H, Mohan A, et al. The hateful memes challenge: Detecting hate speech in multimodal memes[J]. Advances in Neural Information Processing Systems, 2020, 33: 2611−2624
    [3]
    Zhang Linhao, Jin Li, Sun Xian, et al. TOT: Topology-aware optimal transport for multimodal hate detection[C]//Proc of the AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2023, 37(4): 4884−4892
    [4]
    Cao Rui, Hee MS, Kuek A, et al. Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection[C]//Proc of the 31st ACM Int Conf on Multimedia. New York: ACM, 2023: 5244−5252
    [5]
    Hee M S, Chong W H, Lee R K W. Decoding the underlying meaning of multimodal hateful memes[C]//Proc of the 32nd Int Joint Conf on Artificial Intelligence. Freiburg: IJCAI, 2023: 5995−6003
    [6]
    Sharma S, Agarwal S, Suresh T, et al. What do you MEME? Generating explanations for visual semantic role labelling in memes[C]//Proc of the AAAI Conf on Artificial Intelligence. Washington, DC: AAAI, 2023, 37(8): 9763−9771
    [7]
    Scott K. Memes as multimodal metaphors: A relevance theory analysis[J]. Pragmatics & Cognition, 2021, 28(2): 277−298
    [8]
    Pramanick S, Sharma S, Dimitrov D, et al. MOMENTA: A multimodal framework for detecting harmful memes and their targets[C]//Findings of the Association for Computational Linguistics: EMNLP 2021. Stroudsburg, PA: ACL, 2021: 4439−4455
    [9]
    Zhu Ron. Enhance multimodal Transformer with external label and in-domain pretrain: Hateful meme challenge winning solution[J]. arXiv preprint, arXiv: 2012.08290, 2020
    [10]
    Yang Chuanpeng, Zhu Fuqing, Liu Guihua, et al. Multimodal hate speech detection via cross-domain knowledge transfer[C]//Proc of the 30th ACM Int Conf on Multimedia. New York: ACM, 2022: 4505−4514
    [11]
    Lee R K W, Cao Rui, Fan Ziqing, et al. Disentangling hate in online memes[C]//Proc of the 29th ACM Int Conf on Multimedia. New York: ACM, 2021: 5138−5147
    [12]
    Velioglu R, Rose J. Detecting hate speech in memes using multimodal deep learning approaches: Prize-winning solution to hateful memes challenge[J]. arXiv preprint, arXiv: 2012.12975, 2020
    [13]
    Yin Shukang, Fu Chaoyou, Zhao Sirui, et al. A survey on multimodal large language models[J]. arXiv preprint, arXiv: 2306.13549, 2023
    [14]
    Gupta T, Kembhavi A. Visual programming: Compositional visual reasoning without training[C]//Proc of the IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2023: 14953−14962
    [15]
    Shao Zhenwei, Yu Zhou, Wang Meng, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering[C]//Proc of the IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2023: 14974−14983
    [16]
    Cao Rui, Lee R K W, Chong W H, et al. Prompting for multimodal hateful meme classification[C]//Proc of the 2022 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 321−332
    [17]
    Liu Yinhan, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach[J]. arXiv preprint, arXiv: 1907.11692, 2019
    [18]
    Ji Junhui, Ren Wei, Naseem U. Identifying creative harmful memes via prompt based approach[C]//Proc of the ACM Web Conf 2023. New York: ACM, 2023: 3868−3872
    [19]
    Hwang E J, Shwartz V. MemeCap: A dataset for captioning and interpreting memes[J]. arXiv preprint, arXiv: 2305.13703, 2023
    [20]
    Zhu Deyao, Chen Jun, Shen Xiaoqian, et al. MiniGPT-4: Enhancing vision-language understanding with advanced large language models[J]. arXiv preprint, arXiv: 2304.10592, 2023
    [21]
    Zhang Renhui, Han Jiaming, Liu Chris, et al. LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention[J]. arXiv preprint, arXiv: 2303.16199, 2023
    [22]
    Gao Peng, Han Jiaming, Zhang Renrui, et al. LLaMA-Adapter V2: Parameter-efficient visual instruction model[J]. arXiv preprint, arXiv: 2304.15010, 2023
    [23]
    Liu Haotian, Li Chunyuan, Wu Qingyang, et al. Visual instruction tuning[J]. Advances in Neural Information Processing Systems, 2024. DOI: 10.48550/arXiv.2304.08485
    [24]
    Horawalavithana S, Munikoti S, Stewart I, et al. SciTune: Aligning large language models with scientific multimodal instructions[J]. arXiv preprint, arXiv: 2307.01139, 2023
    [25]
    Wei J, Zou Kai. EDA: Easy data augmentation techniques for boosting performance on text classification tasks[C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing (EMNLP-IJCNLP). Stroudsburg, PA: ACL, 2019: 6382−6388
    [26]
    Wu Xing, Lv Shangwen, Zang Liangjun, et al. Conditional BERT contextual augmentation[C]// Proc of 19th Int Conf on Computational Science(CCS 2019). Berlin: Springer, 2019: 84−95
    [27]
    Kumar V, Choudhary A, Cho E. Data augmentation using pre-trained Transformer models[C]//Proc of the 2nd Workshop on Life-long Learning for Spoken Language Systems. Stroudsburg, PA: ACL, 2020: 18−26
    [28]
    Radford A, Jeffrey W, Child R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9−9
    [29]
    Kenton J D M W C, Toutanova L K. BERT: Pre-training of deep bidirectional Transformers for language understanding[C]//Proc of Annual Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2019: 4171−4186
    [30]
    Lewis M, Liu Yinhan, Goyal N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 7871−7880
    [31]
    Dai Haixing, Liu Zhengliang, Liao Wenxiong, et al. ChatAug: Leveraging ChatGPT for text data augmentation[J]. arXiv preprint, arXiv: 2302.13007, 2023
    [32]
    Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//Proc of Int Conf on Machine Learning. New York: PMLR, 2021: 8748−8763
    [33]
    Touvron H, Lavril T, Izacard G, et al. Llama: Open and efficient foundation language models[J]. arXiv preprint, arXiv: 2302.13971, 2023
    [34]
    Xu Bo, Li Tingting, Zheng Junzhe, et al. MET-Meme: A multimodal meme dataset rich in metaphors[C]//Proc of the 45th Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2022: 2887−2899
    [35]
    Dimitrov D, Ali B B, Shaar S, et al. Detecting propaganda techniques in memes[C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing (ACL-IJCNLP 2021). Stroudsburg, PA: 2021: 6603−6617
    [36]
    Pramanick S, Sharma S, Dimitrov D, et al. MOMENTA: A multimodal framework for detecting harmful memes and their targets[C]//Findings of the Association for Computational Linguistics: EMNLP 2021. Stroudsburg, PA: ACL, 2021: 4439−4455
    [37]
    Cai Yitao, Cai Huiyu, Wan Xiaojun. Multi-modal sarcasm detection in twitter with hierarchical fusion model[C]//Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 2506−2515
    [38]
    Li Junnan, Li Dongxu, Savarese S, et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[J]. arXiv preprint, arXiv: 2301.12597, 2023
    [39]
    Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proc of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2002: 311−318
    [40]
    Lin Chin-Yew. Rouge: A package for automatic evaluation of summaries[C]//Text Summarization Branches Out. Stroudsburg, PA: ACL, 2004: 74−81
    [41]
    Zhang Tianyi, Kishore V, Wu Felix, et al. BERTScore: Evaluating text generation with BERT[J]. arXiv preprint, arXiv: 1904.09675, 2019
    [42]
    Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485−5551
    [43]
    Fersini E, Gasparini F, Rizzi G, et al. SemEval-2022 Task 5: Multimedia automatic misogyny identification[C]//Proc of the 16th Int Workshop on Semantic Evaluation (SemEval-2022). Stroudsburg, PA: ACL, 2022: 533−549
  • Related Articles

    [1]Yao Hao, Xiong Jinghui, Li Chunsheng, Wu Changxing. Implicit Discourse Relation Recognition Based on Multi-Granularity Information Interaction and Data Augmentation[J]. Journal of Computer Research and Development. DOI: 10.7544/issn1000-1239.202440511
    [2]Wang Zepeng, Ma Chao, Zhang Zhuangzhuang, Wu Libing, Shi Xiaochuan. Dynamic Decision-Driven Threat Detection Method for Data Elements in Industrial Control Networks[J]. Journal of Computer Research and Development, 2024, 61(10): 2404-2416. DOI: 10.7544/issn1000-1239.202440387
    [3]Xiao Jinsheng, Zhao Tao, Zhou Jian, Le Qiuping, Yang Liheng. Small Target Detection Network Based on Context Augmentation and Feature Refinement[J]. Journal of Computer Research and Development, 2023, 60(2): 465-474. DOI: 10.7544/issn1000-1239.202110956
    [4]Tan Jian, Luo Qiaoling, Wang Liyi, Hu Xiahui, Fan Hao, Xu Zhan. Data Constraint Generation Technology for Microprocessor Instruction Verification Based on SMT Solver[J]. Journal of Computer Research and Development, 2020, 57(12): 2694-2702. DOI: 10.7544/issn1000-1239.2020.20190718
    [5]Liu Bingtao, Wang Da, Ye Xiaochun, Fan Dongrui, Zhang Zhimin, Tang Zhimin. The Data-Flow Block Based Spatial Instruction Scheduling Method[J]. Journal of Computer Research and Development, 2017, 54(4): 750-763. DOI: 10.7544/issn1000-1239.2017.20160138
    [6]Li Ting, Dong Hang, Yuan Chunyang, Du Yuejin, Xu Guo'ai. Description of Android Malware Feature Based on Dalvik Instructions[J]. Journal of Computer Research and Development, 2014, 51(7): 1458-1466.
    [7]Zhao Long, Han Wenbao, and Yang Hongzhi. Research on ECC Attacking Algorithm Based on SIMD Instructions[J]. Journal of Computer Research and Development, 2012, 49(7): 1553-1559.
    [8]Hu Xiao and Chen Shuming. Code Layout for Phase Prefetch on Instruction Cache[J]. Journal of Computer Research and Development, 2009, 46(5): 747-755.
    [9]Zhou Xuehai, Yu Jie, Li Xi, and Wand Zhigang. Research on Reliability Evaluation of Cache Based on Instruction Behavior[J]. Journal of Computer Research and Development, 2007, 44(4): 553-559.
    [10]Ma Zhiqiang, Ji Zhenzhou, and Hu Mingzeng. A Low-Power Instruction Cache Design Based on Record Buffer[J]. Journal of Computer Research and Development, 2006, 43(4): 744-751.
  • Cited by

    Periodical cited type(1)

    1. 刘韵洁,汪硕,黄韬,王佳森. 数算融合网络技术发展研究. 中国工程科学. 2025(01): 1-13 .

    Other cited types(0)

Catalog

    Article views (332) PDF downloads (125) Cited by(1)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return