高级检索

    RDT-MoE:一种融合冗余削减与混合专家的暴力行为识别框架

    RDT-MoE: Redundancy-Diminishing Tokens Meet Mixture-of-Experts for Efficient Violent Action Recognition

    • 摘要: 面对暴力行为识别中普遍存在的时空特征冗余与模型泛化不足的双重挑战,本文提出了一种融合稀疏词元选择与混合专家架构的高效识别模型——RDT-MoE。该模型由冗余特征去除模块(RDT)与参数化混合专家网络(MoE)组成。具体而言,RDT模块基于时空分离注意力机制实现运动特征的解耦,并结合Top-K词元选择策略动态过滤约50%的背景冗余特征,从而显著提升特征表达效率;MoE部分创新性地引入LoRA参数更新策略,采用两阶段训练范式:第一阶段训练RDT模块与编码器以获得LoRA参数,第二阶段以LoRA参数初始化专家网络,并冻结RDT及编码器,仅微调门控与专家组合策略,以增强模型的泛化能力与收敛稳定性。此外,针对现有暴力行为数据集分辨率低、场景单一、样本量不足等问题,本文构建了一个高分辨率、多视角的暴力行为数据集DVAD,其数据规模较RWF-2000提升约50%,在场景多样性与应用代表性方面具有显著优势。实验结果表明,RDT-MoE在RWF-2000与DVAD数据集上均优于现有主流方法:在参数量减少约50%的同时,识别准确率分别提升2.00%与2.34%,展现出较高的效率与性能优势。

       

      Abstract: To tackle the dual challenges of spatiotemporal feature redundancy and limited generalization in violence action recognition, we propose RDT-MoE, a lightweight yet effective framework that integrates Redundancy-Diminishing Tokens (RDT) with a Mixture-of-Experts (MoE) architecture. The RDT module employs spatially and temporally decoupled attention along with a dynamic Top-K token selection strategy to suppress nearly 50% of background redundancy, retaining only the most salient motion cues. On top of this, we introduce a LoRA-guided expert architecture trained via a two-stage paradigm: the first stage optimizes shared representations and LoRA adapters, while the second stage freezes the backbone and focuses on fine-tuning the gating mechanism and expert composition. To further mitigate the limitations of existing datasets—such as low resolution and limited diversity — we construct a new dataset, DVAD, which is 50% larger than RWF-2000, offers higher-resolution videos, and incorporates multi-view scenarios, providing greater real-world representativeness. Comprehensive experiments on RWF-2000 and DVAD demonstrate that RDT-MoE surpasses VideoMAE by 2–2.34% in accuracy while reducing parameters by 50%, showcasing superior efficiency, generalization, and practical applicability.

       

    /

    返回文章
    返回