Abstract:
The task of generating human–scene interaction motions involves multiple disciplines, including computer vision, computer graphics, and robotics. The primary goal of this task is to leverage deep learning algorithms to model and learn the relationships between humans and scenes from large-scale interaction motion data, thereby generating diverse human motions interacting with indoor scenes or objects within them. These motions include, for instance, obstacle-avoidance navigation, human–chair interaction, and object grasping. Compared with traditional physics-based simulation approaches, data-driven methods for interaction motion generation are free from reliance on physical simulation engines, providing higher computational efficiency and stronger generalization capability. As a result, these methods hold broad application prospects in areas such as game design, film production, and human–computer interaction. Nevertheless, current research on human–scene interaction motion generation has not yet formed a systematic review or synthesis. To address this gap, this work systematically organizes and explicates the core advancements in this domain. Specifically, it first introduces data representation methods for 3D humans and scenes. Building on this foundation, it then summarizes the different types of interaction tasks and the associated technical challenges, while also presenting the key characteristics of relevant benchmark datasets and evaluation metrics. Finally, it highlights the limitations of existing technical approaches and discusses potential breakthrough directions and development paths for future research.