Abstract:
Visual reinforcement learning has demonstrated enormous potential across a wide range of domains. However, current algorithms still face two core challenges: insufficient generalization capability and low sample efficiency. Due to the limited availability of training data and the diversity and complexity of environments, reinforcement learning agents often become overly reliant on specific features present in the training settings, which hampers their ability to adapt to novel environments. Even minor environmental changes can induce significant alterations in image pixels, thereby affecting the latent representations learned by the agents and ultimately causing their policies to fail. To develop robust representations, this paper introduces the Environment Disentangled Representation Learning (EDRL) algorithm for visual reinforcement learning. It employs a self-supervised framework to extract environment-invariant features, thereby significantly enhancing the agent's generalization and sample efficiency. Firstly, by employing periodic data augmentation and integrating historical observations, the method simulates complex environmental changes and expands the effective observation range, thus reducing policy biases caused by training instability. Secondly, robust, environment-invariant features are obtained by disentangling latent features and isolating them through reconstruction. Finally, by predicting the changes in representations across time steps and incorporating a dynamic consistency loss, the approach ensures the consistency and robustness of the representations. Experimental results demonstrate the effectiveness of EDRL, as validated on the DMControl Generalization Benchmark (DMControl-GB) and the Distracting Control Suite (DistractingCS). The results indicate that, compared with state-of-the-art methods, EDRL achieves an average performance improvement of over 15% in complex scenarios on DMControl-GB and over 20% in highly distracting environments on DistractingCS.