融合运动领域知识与自适应时空Transformer的人体骨架行为识别

梁成武; 蒋松琪; 杨杰; 朱培旺; 帖云; 高磊; 胡伟; 郭文博

doi:10.7544/issn1000-1239.202440558

融合运动领域知识与自适应时空Transformer的人体骨架行为识别

Skeleton-Based Human Action Recognition with Fusion of Motion Domain Knowledge and Adaptive Spatio-Temporal Transformer

摘要

摘要: 现有人体骨架行为识别方法往往忽略运动学领域知识，造成模型内在人类可理解的逻辑决策解释性不足. 基于此，提出一种融合领域知识与自适应时空片段Transformer的骨架行为识别方法，以提高骨架行为识别模型的性能和可解释性. 首先，受短时运动领域知识启发，设计多时间分支结构用于学习和捕捉多时间尺度的短时子动作特征. 其次，提出一种动态信息融合模块，学习不同时间分支的权重向量进而动态融合多时间分支、多尺度短时运动特征. 最后，融合长时运动领域知识，提出多尺度时间卷积特征融合模块捕捉长时运动关联，用于学习不同子动作片段之间的关系并促进不同骨架关节点间的运动信息交互. 在4个数据集上进行评估与实验，包括人体日常行为数据集NTU RGB+D和NTU RGB+D 120、体育行为数据集FineGym，以及工业场景行为数据集InHARD. 结果表明所提方法的行为识别性能，优于包括基准Transformer方法在内的多个先进方法，可有效提升骨架序列短时运动特征学习和关节点之间信息交互的建模能力，并具有一定可解释性.

Abstract: Existing methods for skeleton-based human action recognition always ignore motion domain knowledge, resulting in the lack of interpretability of logical decision-making that human can understand. In this paper, we propose a novel skeleton-based human action recognition method by fusion of domain knowledge and adaptive spatio-temporal Transformer, to improve recognition performance and interpretability. Firstly, inspired by the short-term motion knowledge, a temporal multi-branch structure is designed to learn and capture the characteristics of short-term sub-acitons. Secondly, a dynamic information fusion module is proposed to learn the weight vectors of different temporal branches, and then fuse multiscale short-term motion features. Finally, to learn the relationship between different sub-actions and facilitate the motion information interaction between skeleton joints, a multiscale temporal convolution feature fusion module is proposed to capture the long-term motion correlations, by integrating the domain knowledge of the long-term motion. Experimental evaluations are conducted on four large action datasets, including NTU RGB+D, NTU RGB+D 120, FineGym, and InHARD. The experimental results show that the recognition performance of the proposed method is superior to several data-driven methods, effectively improving the modelling ability of short-term motion feature learning and information interaction between skeleton joints, with the interpretability.

HTML全文

参考文献(46)

施引文献

资源附件(0)