Abstract:
Text-based person retrieval, a pivotal task in multimodal intelligent surveillance systems, seeks to identify target pedestrian images from large-scale databases using free-form textual descriptions, with critical applications in public security and video forensics such as suspect tracking in criminal investigations and cross-camera forensic analysis. Conventional approaches predominantly operate under the idealized assumption of perfectly aligned image-text pairs, thereby overlooking the pervasive presence of complex noisy data—incorrect or ambiguous associations between visual instances and their textual annotations—that inevitably arise from human annotation biases, web-crawled data impurities, or semantic granularity mismatches between localized visual attributes and holistic textual contexts. To bridge this critical gap, we propose semantic-informed noisy correspondence learning, a unified framework that systematically addresses both noise identification and robust learning through dual innovative mechanisms. First, the semantic-aware noise identification criterion holistically integrates intra-modal semantic consistency with cross-modal interaction signals, enabling precise discrimination of false-positive and false-negative correspondences via adaptive thresholding. Second, the noise-robust complementary learning paradigm implements differentiated optimization strategies: positive learning with momentum contrastive alignment enforces discriminative feature learning on purified reliable pairs, while negative learning via entropy minimization constrains overfitting to noisy subsets. Extensive experiments on three public benchmarks validate the effectiveness of the proposed approach, achieving superior performance under both synthetic and real-world noise scenarios.