RDT-MoE：融合冗余削减与混合专家的暴力行为识别方法

杨建锋; 于大为; 曹瑞; 任佳航; 王平; 胡琦瑶

doi:10.7544/issn1000-1239.202550743

RDT-MoE：融合冗余削减与混合专家的暴力行为识别方法

RDT-MoE: Redundancy-Reduced Mixture-of-Experts for Violence Behavior Recognition

摘要

摘要: 面对暴力行为识别中普遍存在的时空特征冗余与模型泛化不足的双重挑战，提出了一种融合稀疏词元选择与混合专家架构的高效识别模型——RDT-MoE。该模型由冗余特征去除模块（redundancy-aware dynamic token，RDT）与参数化混合专家网络（mixture of experts，MoE）组成。具体而言，RDT模块基于时空分离注意力机制实现运动特征的解耦，并结合Top-k词元选择策略动态过滤约50%的背景冗余特征，从而显著提升特征表达效率；MoE部分创新性地引入LoRA参数更新策略，采用2阶段训练范式：第1阶段训练RDT模块与编码器以获得LoRA参数，第2阶段以LoRA参数初始化专家网络，并冻结RDT及编码器，仅微调门控与专家组合策略，以增强模型的泛化能力与收敛稳定性。此外，针对现有暴力行为数据集分辨率低、场景单一、样本量不足等问题，构建了一个高分辨率、多视角的暴力行为数据集DVAD（diverse violence action dataset），其数据规模较RWF-2000提升约50%，在场景多样性与应用代表性方面具有显著优势。实验结果表明，RDT-MoE在RWF-2000与DVAD数据集上均优于现有主流方法在参数量减少约50%的同时，识别准确率分别提升2.00%与2.34%，展现出较高的效率与性能优势。

Abstract: To tackle the dual challenges of spatiotemporal feature redundancy and limited generalization in violent action recognition, we present RDT-MoE, a lightweight yet effective framework that couples redundancy-diminishing tokens (RDT) with a mixture-of-experts (MoE) architecture. The RDT module employs spatially and temporally decoupled attention to disentangle motion from appearance, and uses a dynamic Top-k token selection strategy to suppress nearly 50% of background redundancy, keeping only the most salient motion cues for subsequent encoding. Building on these compact tokens, we develop a LoRA-guided expert system and train it with a two-stage strategy, where we first jointly optimize the encoder and LoRA adapters to learn strong shared representations and stable adaptation signals, and then freeze the backbone to focus on refining the gating function and expert composition, which further improves robustness, convergence stability, and generalization across diverse scenes. To address limitations of existing benchmarks, we also introduce diverse violence action dataset (DVAD), a higher-resolution, multi-view violence dataset that is 50% larger than RWF-2000 and better reflects real-world variability. Extensive evaluations on RWF-2000 and DVAD, together with component-wise ablations, show that RDT-MoE outperforms VideoMAE by 2.00% and 2.34% in accuracy, respectively, while reducing parameters by about 50%, demonstrating strong efficiency, generalization, and practical applicability. This design supports deployment on resource-limited surveillance devices and maintains stable performance across diverse scenes consistently.

HTML全文

参考文献(56)

施引文献

资源附件(0)