置信引导的多模态数据测试时适应框架

尹小龙; 詹德川; 姜远

doi:10.7544/issn1000-1239.202550418

摘要: 测试时适应（test time adaptation, TTA）旨在在模型部署后的测试阶段，通过无标注或少量标注的在线微调策略来动态校正模型，以应对由于分布偏移、传感器噪声、光照变化等引发的性能退化问题，并在自动驾驶、远程医疗、视频监控等对实时性和鲁棒性要求极高的领域展现出广泛应用价值. 然而，现有多模态 TTA 方法往往忽视各视图数据的质量差异，容易因低质量或故障视图引入有害梯度，且缺乏对视图内部动态变化的刻画，难以保持时序特征的一致性和稳定性. 为解决上述挑战，提出了一种置信度引导的视图裁剪与时序对比注意力（confidence-guided view pruning and temporal contrastive attention, CVPTA）框架. 该框架包含三大模块：首先，基于模型预测分布计算视图不确定性，自适应削弱低置信度视图数据对特征融合的干扰；其次，视图裁剪策略利用 Laplacian 方差与亮度直方图偏度评估图像质量，丢弃质量分数低于阈值的视图数据，并结合相邻高质量视图重采样，显著降低噪声累积风险；最后，时序对比自监督任务将同一样本的不同视图视为正样本而非同一样本视图视为负样本，通过对比损失强化时序特征一致性. 在 Kinetics-50-C 与 VGGSound-C 两个公开多模态扰动基准上，沿用先前实验配置进行在线更新评测. 结果显示，CVPTA 在2组基准上分别将 Top-1 准确率提升约 2.3 个百分点和 0.7个百分点，在极端噪声场景中依然保持超过 0.2% 的性能增益；消融研究进一步验证了各模块的独立贡献与协同效应. 该方法无需额外标注，可无缝集成现有多模态系统，兼具高效性与鲁棒性，具有重要的理论意义与工程应用前景.

Abstract: Test-time adaptation (TTA) seeks to dynamically recalibrate a deployed model through online fine-tuning with unlabeled or sparsely labeled data, addressing performance degradation caused by distribution shift, sensor noise, illumination changes, and other real-world challenges. It has demonstrated great promise in latency- and robustness-critical applications such as autonomous driving, remote healthcare, and video surveillance. However, existing multimodal TTA methods typically overlook the quality variance across views—low-quality or faulty views data can introduce harmful gradients—and lack mechanisms to preserve internal temporal consistency, undermining stability in dynamic scenarios. To tackle these issues, we propose CVPTA (confidence-guided view pruning and temporal contrastive attention), a unified framework comprising three core modules: 1) confidence-guided dynamic attention, which computes each view’s confidence via its predictive entropy and uses a softmax over inverted entropy scores to attenuate high-entropy views data during feature fusion; 2) view pruning, which evaluates image-modal quality using the Laplacian variance (for blur) and histogram skewness (for exposure), discards views below a quality threshold, and maintains a constant view count by interpolating adjacent high-quality views data or injecting Gaussian noise—dramatically reducing noise accumulation; 3) temporal contrastive self-supervision, which treats adjacent frames of the same view as positive pairs and non-adjacent frames as negatives, and applies a contrastive loss to enforce temporal feature consistency. We evaluate EVPTA on the public Kinetics-50-C and VGGSound-C perturbation benchmarks under the same online-update settings as prior work. Results show that EVPTA boosts Top-1 accuracy by approximately 2.3% and 0.7% on each benchmark and retains over 0.2% gains even under extreme noise conditions. Ablation studies further confirm the individual and synergistic contributions of all three modules. EVPTA requires no extra annotations, integrates seamlessly with existing multimodal systems, and delivers both efficiency and robustness—offering significant theoretical insights and practical value.

置信引导的多模态数据测试时适应框架

Confidence-Guided Multimodal Data Test-Time Adaptation Framework