Abstract:
LVLMs (Large Vision-Language Models) represent a significant advancement in the intersection of natural language processing and computer vision. By integrating pre-trained visual encoders, vision-language adapters, and large language models, LVLMs can understand both visual and textual information, and generate responses in natural language, making them suitable for a range of downstream vision-language tasks such as image captioning and visual question answering. However, these models commonly exhibit hallucinations — generating inaccurate perceptions of image contents. Such hallucinations significantly limit the application of LVLMs in high-stakes domains like medical image diagnosis and autonomous driving. This survey aims to systematically organize and analyze the causes, evaluations, and mitigation strategies of hallucinations to guide research in the field and enhance the safety and reliability of LVLMs in practical applications. It begins with an introduction to the basic concepts of LVLMs and the definition and classification of hallucinations within them. It then explores the causes of hallucinations from four perspectives: training data, training task, visual encoding, and text generation, while also discussing the interactions among these factors. Following this, it discusses mainstream benchmarks for assessing LVLM hallucinations in terms of task setting, data construction, and assessment metrics. Additionally, it examines hallucination mitigating techniques across five aspects: training data, visual perception, training strategy, model inference, and post-hoc corrections. Finally, the review provides directions for future research in the areas of cause analysis, evaluation, and mitigation of hallucinations in LVLMs.