面向复杂噪声数据的鲁棒文本-图像行人检索方法

胡冰玉; 徐艺心; 余珊; 赵巨峰; 杨宇翔

doi:10.7544/issn1000-1239.202550394

面向复杂噪声数据的鲁棒文本-图像行人检索方法

Robust Text-Based Person Retrieval Under Complex Noisy Data

摘要

摘要: 文本-图像行人检索（text-based person retrieval）作为多模态智能监控系统的核心任务，旨在通过自由形式的文本描述从大规模数据库中识别目标行人图像，在公共安全与视频取证领域具有关键应用价值，如刑事侦查中的嫌疑人追踪及跨摄像头取证分析. 传统方法通常基于图像-文本对完美对齐的理想化假设，忽视了实际场景中普遍存在的复杂噪声数据问题，即视觉实例与其文本标注间因人工标注偏差、网络爬取噪声，或局部视觉属性与全局文本语境间的语义粒度失配而产生的错误或歧义性关联. 为弥补这一缺陷，提出了一种语义感知噪声关联学习框架，通过双重创新机制系统性地实现噪声辨识与鲁棒学习. 首先，语义感知噪声辨识准则融合模态内语义一致性与跨模态交互信号，基于自适应阈值判定精准区分噪声关联；其次，噪声鲁棒互补学习范式实施差异化优化策略：对于可靠子集采用对比损失进行正向学习以增强特征判别性，而对噪声子集则通过反向学习以抑制过拟合. 在3个公开基准数据集上的大量实验表明，该方法在合成噪声数据与真实噪声数据场景中均展现出优越性能.

Abstract: Text-based person retrieval, a pivotal task in multimodal intelligent surveillance systems, seeks to identify target pedestrian images from large-scale databases using free-form textual descriptions, with critical applications in public security and video forensics such as suspect tracking in criminal investigations and cross-camera forensic analysis. Conventional approaches predominantly operate under the idealized assumption of perfectly aligned image-text pairs, thereby overlooking the pervasive presence of complex noisy data—incorrect or ambiguous associations between visual instances and their textual annotations—that inevitably arise from human annotation biases, web-crawled data impurities, or semantic granularity mismatches between localized visual attributes and holistic textual contexts. To bridge this critical gap, we propose semantic-informed noisy correspondence learning, a unified framework that systematically addresses both noise identification and robust learning through dual innovative mechanisms. First, the semantic-aware noise identification criterion holistically integrates intra-modal semantic consistency with cross-modal interaction signals, enabling precise discrimination of false-positive and false-negative correspondences via adaptive thresholding. Second, the noise-robust complementary learning paradigm implements differentiated optimization strategies: positive learning with momentum contrastive alignment enforces discriminative feature learning on purified reliable pairs, while negative learning via entropy minimization constrains overfitting to noisy subsets. Extensive experiments on three public benchmarks validate the effectiveness of the proposed approach, achieving superior performance under both synthetic and real-world noise scenarios.

HTML全文

参考文献(29)

施引文献

资源附件(0)