高级检索

    基于记忆增强和环境解耦的视觉强化学习表征优化

    Visual Reinforcement Learning Representation Optimization Based on Memory Augmentation and Environment Disentanglement

    • 摘要: 视觉强化学习已在诸多领域中展现出了巨大的潜力。然而,现有的算法仍面临着两大核心挑战:泛化能力不足和样本效率低下。由于训练数据的有限性以及环境的多样性和复杂性,强化学习智能体常因过度依赖训练环境中的特定特征,从而难以适应新的未见环境。即便是微小的环境变化,也可能导致图像中像素的显著变化,影响智能体已学习到的潜在表征,进而使其学到的策略失效。为了学习更具鲁棒性的表征,提出一种基于记忆增强与环境解耦的视觉强化学习表征算法(environment disentangled representation learning,EDRL),旨在通过自监督学习框架提取环境无关的鲁棒表征,提升智能体的泛化能力和样本效率。首先,通过周期性数据增强和历史观测融合,模拟复杂环境变化并扩大有效观测范围,降低因训练不稳定性导致的策略偏差。其次,通过表征解耦与重构分离出环境无关的鲁棒表征。最后,通过预测时间步间的表征变化并引入动态一致性损失,确保表征的一致性与鲁棒性。在DMControl-GB(deepmind control generalization benchmark)和DistractingCS(distracting control suite)基准测试上的实验验证了EDRL的有效性。结果表明,相较于当前最优方法,EDRL在DMControl-GB的复杂场景下平均性能提升15%以上,在DistractingCS的高干扰环境下平均性能提升20%以上。

       

      Abstract: Visual reinforcement learning has demonstrated enormous potential across a wide range of domains. However, current algorithms still face two core challenges: insufficient generalization capability and low sample efficiency. Due to the limited availability of training data and the diversity and complexity of environments, reinforcement learning agents often become overly reliant on specific features present in the training settings, which hampers their ability to adapt to novel environments. Even minor environmental changes can induce significant alterations in image pixels, thereby affecting the latent representations learned by the agents and ultimately causing their policies to fail. To develop robust representations, this paper introduces the Environment Disentangled Representation Learning (EDRL) algorithm for visual reinforcement learning. It employs a self-supervised framework to extract environment-invariant features, thereby significantly enhancing the agent's generalization and sample efficiency. Firstly, by employing periodic data augmentation and integrating historical observations, the method simulates complex environmental changes and expands the effective observation range, thus reducing policy biases caused by training instability. Secondly, robust, environment-invariant features are obtained by disentangling latent features and isolating them through reconstruction. Finally, by predicting the changes in representations across time steps and incorporating a dynamic consistency loss, the approach ensures the consistency and robustness of the representations. Experimental results demonstrate the effectiveness of EDRL, as validated on the DMControl Generalization Benchmark (DMControl-GB) and the Distracting Control Suite (DistractingCS). The results indicate that, compared with state-of-the-art methods, EDRL achieves an average performance improvement of over 15% in complex scenarios on DMControl-GB and over 20% in highly distracting environments on DistractingCS.

       

    /

    返回文章
    返回