基于TD-error自适应校正的深度Q学习主动采样方法

白辰甲; 刘鹏; 赵巍; 唐降龙

doi:10.7544/issn1000-1239.2019.20170812

基于TD-error自适应校正的深度Q学习主动采样方法

Active Sampling for Deep Q-Learning Based on TD-error Adaptive Correction

摘要

摘要: 强化学习中智能体与环境交互的成本较高.针对深度Q学习中经验池样本利用效率的问题，提出基于TD-error自适应校正的主动采样方法.深度Q学习训练中样本存储优先级的更新滞后于Q网络参数的更新，存储优先级不能准确反映经验池中样本TD-error的真实分布.提出的TD-error自适应校正主动采样方法利用样本回放周期和Q网络状态建立优先级偏差模型，估计经验池中样本的真实优先级.在Q网络迭代中使用校正后的优先级选择样本，偏差模型在学习过程中分段更新.分析了Q网络学习性能与偏差模型阶数和模型更新周期之间的依赖关系，并对算法复杂度进行了分析.方法在Atari 2600平台进行了实验，结果表明，使用TD-error自适应校正的主动采样方法选择样本提高了智能体的学习速度，减少了智能体与环境的交互次数，同时改善了智能体的学习效果，提升了最优策略的质量.

Abstract: Deep reinforcement learning (DRL) is one of research hotspots in artificial intelligence. Deep Q-learning is one of the representative achievements of DRL. In some fields, its performance has met or exceeded the level of human expert. It is necessary for training deep Q-learning to acquire lots of samples. These samples are obtained by the interaction between agent and environment. However, it is usually computationally intensive and sometimes impossible to keep away from interaction risk. We propose an active sampling method based on TD-error adaptive correction in order to solve sample efficiency problem in deep Q-learning. In various deep Q-learning methods, the updating of storage priority in experience memory lags behind the updating of Q-network parameters. It causes that a lot of samples are not selected to apply in Q-network training because the storage priority cannot reflect the true distribution of TD-error in experience memory. The TD-error adaptive correction active sampling method proposed in this paper uses the replay periods of samples and Q-network state to establish a priority bias model to estimate the real priority of each sample in experience memory during the Q-network iteration. The samples are selected from experience memory according to the corrected priority and the bias model parameters are adaptively updated by a segmented form. We analyze the complexity of the algorithm and the relationship between learning performance and the order of polynomial feature and updating period of model parameters. Our method is verified on the platform of Atari 2600. The experimental results show that proposed method improves the learning speed and reduces the number of interaction between agent and environment. Meanwhile, it ameliorates the quality of optimal policy.

HTML全文

参考文献(0)

施引文献

资源附件(0)