Abstract:
Existed methods for skeleton-based human action recognition always ignore motion domain knowledge, resulting in the lack of interpretability of logical decision-making that human can understand. In this paper, we propose a novel skeleton-based human action recognition method by fusion of domain knowledge and adaptive spatio-temporal transformer, to improve recognition performance and interpretability. Firstly, inspired by the short-term motion knowledge, a temporal multi-branch structure is designed to learn and capture the characteristics of short-term sub-acitons. Secondly, a dynamic information fusion module is proposed to learn the weight vectors of different temporal branches, and then fuse multiscale short-term motion features. Finally, to learn the relationship between different sub-actions and facilitate the motion information interaction between skeleton joints, a multiscale temporal convolution feature fusion module is proposed to capture the long-term motion correlations, by integrating the domain knowledge of the long-term motion. Experimental evaluations are conducted on four large action datasets, including NTU RGB+D, NTU RGB+D 120, FineGym, and InHARD. The experimental results show that the recognition performance of the proposed method is superior to several data-driven methods, effectively improving the modelling ability of short-term motion feature learning and information interaction between skeleton joints, with the interpretability.