Abstract:
To tackle the dual challenges of spatiotemporal feature redundancy and limited generalization in violent action recognition, we present RDT-MoE, a lightweight yet effective framework that couples redundancy-diminishing tokens (RDT) with a mixture-of-experts (MoE) architecture. The RDT module employs spatially and temporally decoupled attention to disentangle motion from appearance, and uses a dynamic Top-
k token selection strategy to suppress nearly 50% of background redundancy, keeping only the most salient motion cues for subsequent encoding. Building on these compact tokens, we develop a LoRA-guided expert system and train it with a two-stage strategy, where we first jointly optimize the encoder and LoRA adapters to learn strong shared representations and stable adaptation signals, and then freeze the backbone to focus on refining the gating function and expert composition, which further improves robustness, convergence stability, and generalization across diverse scenes. To address limitations of existing benchmarks, we also introduce diverse violence action dataset (DVAD), a higher-resolution, multi-view violence dataset that is 50% larger than RWF-2000 and better reflects real-world variability. Extensive evaluations on RWF-2000 and DVAD, together with component-wise ablations, show that RDT-MoE outperforms VideoMAE by 2.00% and 2.34% in accuracy, respectively, while reducing parameters by about 50%, demonstrating strong efficiency, generalization, and practical applicability. This design supports deployment on resource-limited surveillance devices and maintains stable performance across diverse scenes consistently.