Abstract:
To tackle the dual challenges of spatiotemporal feature redundancy and limited generalization in violence action recognition, we propose RDT-MoE, a lightweight yet effective framework that integrates Redundancy-Diminishing Tokens (RDT) with a Mixture-of-Experts (MoE) architecture. The RDT module employs spatially and temporally decoupled attention along with a dynamic Top-K token selection strategy to suppress nearly 50% of background redundancy, retaining only the most salient motion cues. On top of this, we introduce a LoRA-guided expert architecture trained via a two-stage paradigm: the first stage optimizes shared representations and LoRA adapters, while the second stage freezes the backbone and focuses on fine-tuning the gating mechanism and expert composition. To further mitigate the limitations of existing datasets—such as low resolution and limited diversity — we construct a new dataset, DVAD, which is 50% larger than RWF-2000, offers higher-resolution videos, and incorporates multi-view scenarios, providing greater real-world representativeness. Comprehensive experiments on RWF-2000 and DVAD demonstrate that RDT-MoE surpasses VideoMAE by 2–2.34% in accuracy while reducing parameters by 50%, showcasing superior efficiency, generalization, and practical applicability.