Abstract:
Video anomaly detection is of great importance to the fields of intelligent video surveillance, public security, etc. With the advancement of deep learning, conventional pixel-based video anomaly detection methods gradually expose their vulnerability to noises, high computation cost, while skeleton-based approaches have emerged as a research hotspot due to their robustness and efficiency. This paper provides a detailed explanation of the fundamental concepts, learning paradigms, and methodological workflows of human skeleton-based video anomaly detection methods, and systematically reviews and classifies the representative skeleton-based studies published in the recent five years into different categories, including prediction-based method, reconstruction-based method, hybrid reconstruction and prediction-based method, hybrid reconstruction and clustering-based method, as well as unclassified method. An in depth analysis is given to explore the underlying principles and innovations of each type of method. Moreover, this paper summarizes existing benchmark datasets and evaluation metrics for skeleton-based video anomaly detection, as well as gives a comparative performance analysis of the mainstream methods across the benchmarks. Currently, skeleton-based anomaly detection methods face significant challenges in skeleton feature extraction, modeling and understanding of multi-person interaction, etc. Therefore, this paper further proposes scientifically grounded future directions with regard to robust skeleton feature extraction algorithm, multimodal information integration and heterogeneous data fusion, etc., aiming at exploring the positive factors for constructing robust video anomaly detection framework, as well as analyzing the existing problems and applicable research directions of these series of video anomaly detection methods in different aspects, including adaptation ability, generalization ability, running efficiency, etc.