A Method for Generating Explanations of Offensive Memes Based on Multimodal Large Language Models

Lin Meng; Dai Chengwei; Guo Tao

doi:10.7544/issn1000-1239.202330960

Journal of Computer Research and Development > 2024 > 61(5): 1206-1217. > DOI: 10.7544/issn1000-1239.202330960 CSTR: 32373.14.issn1000-1239.202330960

Lin Meng, Dai Chengwei, Guo Tao. A Method for Generating Explanations of Offensive Memes Based on Multimodal Large Language Models[J]. Journal of Computer Research and Development, 2024, 61(5): 1206-1217. DOI: 10.7544/issn1000-1239.202330960

Citation:

PDF (1361 KB)

A Method for Generating Explanations of Offensive Memes Based on Multimodal Large Language Models

Lin Meng^{1, 2,},
Dai Chengwei^{1, 2},
Guo Tao^1, ,

1.
Institute of Information Engineering, Chinese Academy of Sciences, Beijing 100019
2.
School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100049

More Information

Author Bio:
Lin Meng: born in 1991. PhD candidate. Her main research interests include multi-modal hate speech detection and multimodal semantic understanding

Dai Chengwei: born in 2000. Master candidate. His main research interests include model extraction and large language model distillation

Guo Tao: born in 1974. PhD, Professor, PhD supervisor. His main research interests include cybersecurity, vulnerability analysis and risk assessment
Received Date: November 29, 2023
Revised Date: March 11, 2024
Available Online: March 19, 2024

Graphical Abstract

Abstract

Abstract

With the advancement of 5G technology, offensive speech has increasingly proliferated across social networks in the form of multimodal memes. Consequently, the detection and interpretive generation of offensive memes play a crucial role in enhancing content moderation effectiveness and maintaining a harmonious and healthy public discourse environment. Existing studies on the interpretive generation of offensive memes focus solely on the targets and content of offense, neglecting the societal background knowledge and metaphorical expressions embedded in memes. This oversight limits the ability to comprehensively and accurately interpret the meaning of offensive memes, thus constraining the applicability of interpretations. To address this challenge, we propose a method based on multimodal large language model for generating interpretations of offensive memes. By augmenting elements such as offense targets, the content of the offense, and metaphor recognition into the instruction tuning data, we can effectively improve the multimodal large model’s proficiency in interpretively generating offensive memes through instruction tuning. The experimental outcomes validate three key strengths of our method: first, it achieves a notable 19% enhancement in the BERTScore evaluation metric over baseline models; second, it incorporates comprehensive background knowledge pertinent to offensive metaphorical expressions within its interpretations; third, it exhibits strong generalization capabilities when handling previously unseen meme data.
- offensive meme,
- explanation generation,
- multi-modal LLM,
- data augment,
- instruction fine-tuning

FullText(HTML)

References (43)

References

[1]	虎嵩林,赵军,唐杰,等. 虚假信息检测专题前言[J]. 计算机研究与发展,2021,58(7):1351−1352 doi: 10.7544/issn1000-1239.2021.qy0701 Hu Songlin, Zhao Jun, Tang Jie, et al. Preface to the special issue on fake information detection[J]. Journal of Computer Research and Development, 2021, 58(7): 1351−1352 (in Chinese) doi: 10.7544/issn1000-1239.2021.qy0701
[2]	Kiela D, Firooz H, Mohan A, et al. The hateful memes challenge: Detecting hate speech in multimodal memes[J]. Advances in Neural Information Processing Systems, 2020, 33: 2611−2624
[3]	Zhang Linhao, Jin Li, Sun Xian, et al. TOT: Topology-aware optimal transport for multimodal hate detection[C]//Proc of the AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2023, 37(4): 4884−4892
[4]	Cao Rui, Hee MS, Kuek A, et al. Pro-Cap: Leveraging a frozen vision-language model for hateful meme detection[C]//Proc of the 31st ACM Int Conf on Multimedia. New York: ACM, 2023: 5244−5252
[5]	Hee M S, Chong W H, Lee R K W. Decoding the underlying meaning of multimodal hateful memes[C]//Proc of the 32nd Int Joint Conf on Artificial Intelligence. Freiburg: IJCAI, 2023: 5995−6003
[6]	Sharma S, Agarwal S, Suresh T, et al. What do you MEME? Generating explanations for visual semantic role labelling in memes[C]//Proc of the AAAI Conf on Artificial Intelligence. Washington, DC: AAAI, 2023, 37(8): 9763−9771
[7]	Scott K. Memes as multimodal metaphors: A relevance theory analysis[J]. Pragmatics & Cognition, 2021, 28(2): 277−298
[8]	Pramanick S, Sharma S, Dimitrov D, et al. MOMENTA: A multimodal framework for detecting harmful memes and their targets[C]//Findings of the Association for Computational Linguistics: EMNLP 2021. Stroudsburg, PA: ACL, 2021: 4439−4455
[9]	Zhu Ron. Enhance multimodal Transformer with external label and in-domain pretrain: Hateful meme challenge winning solution[J]. arXiv preprint, arXiv: 2012.08290, 2020
[10]	Yang Chuanpeng, Zhu Fuqing, Liu Guihua, et al. Multimodal hate speech detection via cross-domain knowledge transfer[C]//Proc of the 30th ACM Int Conf on Multimedia. New York: ACM, 2022: 4505−4514
[11]	Lee R K W, Cao Rui, Fan Ziqing, et al. Disentangling hate in online memes[C]//Proc of the 29th ACM Int Conf on Multimedia. New York: ACM, 2021: 5138−5147
[12]	Velioglu R, Rose J. Detecting hate speech in memes using multimodal deep learning approaches: Prize-winning solution to hateful memes challenge[J]. arXiv preprint, arXiv: 2012.12975, 2020
[13]	Yin Shukang, Fu Chaoyou, Zhao Sirui, et al. A survey on multimodal large language models[J]. arXiv preprint, arXiv: 2306.13549, 2023
[14]	Gupta T, Kembhavi A. Visual programming: Compositional visual reasoning without training[C]//Proc of the IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2023: 14953−14962
[15]	Shao Zhenwei, Yu Zhou, Wang Meng, et al. Prompting large language models with answer heuristics for knowledge-based visual question answering[C]//Proc of the IEEE/CVF Conf on Computer Vision and Pattern Recognition. Piscataway, NJ: IEEE, 2023: 14974−14983
[16]	Cao Rui, Lee R K W, Chong W H, et al. Prompting for multimodal hateful meme classification[C]//Proc of the 2022 Conf on Empirical Methods in Natural Language Processing. Stroudsburg, PA: ACL, 2022: 321−332
[17]	Liu Yinhan, Ott M, Goyal N, et al. RoBERTa: A robustly optimized BERT pretraining approach[J]. arXiv preprint, arXiv: 1907.11692, 2019
[18]	Ji Junhui, Ren Wei, Naseem U. Identifying creative harmful memes via prompt based approach[C]//Proc of the ACM Web Conf 2023. New York: ACM, 2023: 3868−3872
[19]	Hwang E J, Shwartz V. MemeCap: A dataset for captioning and interpreting memes[J]. arXiv preprint, arXiv: 2305.13703, 2023
[20]	Zhu Deyao, Chen Jun, Shen Xiaoqian, et al. MiniGPT-4: Enhancing vision-language understanding with advanced large language models[J]. arXiv preprint, arXiv: 2304.10592, 2023
[21]	Zhang Renhui, Han Jiaming, Liu Chris, et al. LLaMA-Adapter: Efficient fine-tuning of language models with zero-init attention[J]. arXiv preprint, arXiv: 2303.16199, 2023
[22]	Gao Peng, Han Jiaming, Zhang Renrui, et al. LLaMA-Adapter V2: Parameter-efficient visual instruction model[J]. arXiv preprint, arXiv: 2304.15010, 2023
[23]	Liu Haotian, Li Chunyuan, Wu Qingyang, et al. Visual instruction tuning[J]. Advances in Neural Information Processing Systems, 2024. DOI: 10.48550/arXiv.2304.08485
[24]	Horawalavithana S, Munikoti S, Stewart I, et al. SciTune: Aligning large language models with scientific multimodal instructions[J]. arXiv preprint, arXiv: 2307.01139, 2023
[25]	Wei J, Zou Kai. EDA: Easy data augmentation techniques for boosting performance on text classification tasks[C]//Proc of the 2019 Conf on Empirical Methods in Natural Language Processing and the 9th Int Joint Conf on Natural Language Processing (EMNLP-IJCNLP). Stroudsburg, PA: ACL, 2019: 6382−6388
[26]	Wu Xing, Lv Shangwen, Zang Liangjun, et al. Conditional BERT contextual augmentation[C]// Proc of 19th Int Conf on Computational Science(CCS 2019). Berlin: Springer, 2019: 84−95
[27]	Kumar V, Choudhary A, Cho E. Data augmentation using pre-trained Transformer models[C]//Proc of the 2nd Workshop on Life-long Learning for Spoken Language Systems. Stroudsburg, PA: ACL, 2020: 18−26
[28]	Radford A, Jeffrey W, Child R, et al. Language models are unsupervised multitask learners[J]. OpenAI Blog, 2019, 1(8): 9−9
[29]	Kenton J D M W C, Toutanova L K. BERT: Pre-training of deep bidirectional Transformers for language understanding[C]//Proc of Annual Conf of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA: ACL, 2019: 4171−4186
[30]	Lewis M, Liu Yinhan, Goyal N, et al. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension[C]//Proc of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2020: 7871−7880
[31]	Dai Haixing, Liu Zhengliang, Liao Wenxiong, et al. ChatAug: Leveraging ChatGPT for text data augmentation[J]. arXiv preprint, arXiv: 2302.13007, 2023
[32]	Radford A, Kim J W, Hallacy C, et al. Learning transferable visual models from natural language supervision[C]//Proc of Int Conf on Machine Learning. New York: PMLR, 2021: 8748−8763
[33]	Touvron H, Lavril T, Izacard G, et al. Llama: Open and efficient foundation language models[J]. arXiv preprint, arXiv: 2302.13971, 2023
[34]	Xu Bo, Li Tingting, Zheng Junzhe, et al. MET-Meme: A multimodal meme dataset rich in metaphors[C]//Proc of the 45th Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2022: 2887−2899
[35]	Dimitrov D, Ali B B, Shaar S, et al. Detecting propaganda techniques in memes[C]//Proc of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Int Joint Conf on Natural Language Processing (ACL-IJCNLP 2021). Stroudsburg, PA: 2021: 6603−6617
[36]	Pramanick S, Sharma S, Dimitrov D, et al. MOMENTA: A multimodal framework for detecting harmful memes and their targets[C]//Findings of the Association for Computational Linguistics: EMNLP 2021. Stroudsburg, PA: ACL, 2021: 4439−4455
[37]	Cai Yitao, Cai Huiyu, Wan Xiaojun. Multi-modal sarcasm detection in twitter with hierarchical fusion model[C]//Proc of the 57th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2019: 2506−2515
[38]	Li Junnan, Li Dongxu, Savarese S, et al. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models[J]. arXiv preprint, arXiv: 2301.12597, 2023
[39]	Papineni K, Roukos S, Ward T, et al. BLEU: A method for automatic evaluation of machine translation[C]//Proc of the 40th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA: ACL, 2002: 311−318
[40]	Lin Chin-Yew. Rouge: A package for automatic evaluation of summaries[C]//Text Summarization Branches Out. Stroudsburg, PA: ACL, 2004: 74−81
[41]	Zhang Tianyi, Kishore V, Wu Felix, et al. BERTScore: Evaluating text generation with BERT[J]. arXiv preprint, arXiv: 1904.09675, 2019
[42]	Raffel C, Shazeer N, Roberts A, et al. Exploring the limits of transfer learning with a unified text-to-text transformer[J]. The Journal of Machine Learning Research, 2020, 21(1): 5485−5551
[43]	Fersini E, Gasparini F, Rizzi G, et al. SemEval-2022 Task 5: Multimedia automatic misogyny identification[C]//Proc of the 16th Int Workshop on Semantic Evaluation (SemEval-2022). Stroudsburg, PA: ACL, 2022: 533−549

[1]	Lin Hanyue, Wu Jingya, Lu Wenyan, Zhong Langhui, Yan Guihai. Neptune: A Framework for Generic Network Processor Microarchitecture Modeling and Performance Simulation[J]. Journal of Computer Research and Development, 2025, 62(5): 1091-1107. DOI: 10.7544/issn1000-1239.202440084
[2]	Zhang Qianlong, Hou Rui, Yang Sibo, Zhao Boyan, Zhang Lixin. The Role of Architecture Simulators in the Process of CPU Design[J]. Journal of Computer Research and Development, 2019, 56(12): 2702-2719. DOI: 10.7544/issn1000-1239.2019.20190044
[3]	Liu He, Ji Yu, Han Jianhui, Zhang Youhui, Zheng Weimin. Training and Software Simulation for ReRAM-Based LSTM Neural Network Acceleration[J]. Journal of Computer Research and Development, 2019, 56(6): 1182-1191. DOI: 10.7544/issn1000-1239.2019.20190113
[4]	Yang Meifang, Che Yonggang, Gao Xiang. Heterogeneous Parallel Optimization of an Engine Combustion Simulation Application with the OpenMP 4.0 Standard[J]. Journal of Computer Research and Development, 2018, 55(2): 400-408. DOI: 10.7544/issn1000-1239.2018.20160872
[5]	Liu Yuchen, Wang Jia, Chen Yunji, Jiao Shuai. Survey on Computer System Simulator[J]. Journal of Computer Research and Development, 2015, 52(1): 3-15. DOI: 10.7544/issn1000-1239.2015.20140104
[6]	Lü Huiwei, Cheng Yuan, Bai Lu, Chen Mingyu, Fan Dongrui, Sun Ninghui. Parallel Simulation of Many-Core Processor and Many-Core Clusters[J]. Journal of Computer Research and Development, 2013, 50(5): 1110-1117.
[7]	Yu Lisheng, Zhang Yansong, Wang Shan, and Zhang Qian. Research on Simulative Column-Storage Model Policy Based on Row-Storage Model[J]. Journal of Computer Research and Development, 2010, 47(5): 878-885.
[8]	Liu Shiguang, Chai Jiawei, Wen Yuan. A New Method for Fast Simulation of 3D Clouds[J]. Journal of Computer Research and Development, 2009, 46(9): 1417-1423.
[9]	Mao Chengying, Lu Yansheng. Strategies of Regression Test Case Selection for Component-Based Software[J]. Journal of Computer Research and Development, 2006, 43(10): 1767-1774.
[10]	Wang Shihao, Wang Xinmin, Liu Mingye. Software Simulation for Hardware/Software Co-Verification[J]. Journal of Computer Research and Development, 2005, 42(3).