Abstract:
Test-time adaptation (TTA) seeks to dynamically recalibrate a deployed model through online fine-tuning with unlabeled or sparsely labeled data, addressing performance degradation caused by distribution shift, sensor noise, illumination changes, and other real-world challenges. It has demonstrated great promise in latency- and robustness-critical applications such as autonomous driving, remote healthcare, and video surveillance. However, existing multimodal TTA methods typically overlook the quality variance across views—low-quality or faulty views data can introduce harmful gradients—and lack mechanisms to preserve internal temporal consistency, undermining stability in dynamic scenarios. To tackle these issues, we propose CVPTA (confidence-guided view pruning and temporal contrastive attention), a unified framework comprising three core modules: 1) confidence-guided dynamic attention, which computes each view’s confidence via its predictive entropy and uses a softmax over inverted entropy scores to attenuate high-entropy views data during feature fusion; 2) view pruning, which evaluates image-modal quality using the Laplacian variance (for blur) and histogram skewness (for exposure), discards views below a quality threshold, and maintains a constant view count by interpolating adjacent high-quality views data or injecting Gaussian noise—dramatically reducing noise accumulation; 3) temporal contrastive self-supervision, which treats adjacent frames of the same view as positive pairs and non-adjacent frames as negatives, and applies a contrastive loss to enforce temporal feature consistency. We evaluate EVPTA on the public Kinetics-50-C and VGGSound-C perturbation benchmarks under the same online-update settings as prior work. Results show that EVPTA boosts Top-1 accuracy by approximately 2.3% and 0.7% on each benchmark and retains over 0.2% gains even under extreme noise conditions. Ablation studies further confirm the individual and synergistic contributions of all three modules. EVPTA requires no extra annotations, integrates seamlessly with existing multimodal systems, and delivers both efficiency and robustness—offering significant theoretical insights and practical value.