一种权重平均值的深度双Q网络方法

吴金金; 刘全; 陈松; 闫岩

doi:10.7544/issn1000-1239.2020.20190159

一种权重平均值的深度双Q网络方法

Averaged Weighted Double Deep Q-Network

摘要

摘要: 深度强化学习算法的不稳定性和可变性对其性能有重要的影响.深度Q网络模型在处理需要感知高维输入数据的决策控制任务中性能良好.然而，深度Q网络存在着高估动作值使agent性能变差的问题.尽管深度双Q网络能够缓解高估带来的影响，但是仍然存在低估动作值的问题.在一些复杂的强化学习环境中，即使是很小的估计误差也会对学习到的策略产生很大影响.为了解决深度Q网络中高估动作值和深度双Q网络中低估动作值的问题，提出一种基于权重平均值的深度双Q网络方法(averaged weighted double deep Q-network, AWDDQN)，该方法将带权重的双估计器整合到深度双Q网络中.为了进一步地减少目标值的估计误差，通过计算之前学习到的动作估计值的平均值来产生目标值，并且根据时间差分误差动态地确定平均动作值的数量.实验结果表明:AWDDQN方法可以有效减少估计偏差，并且能够提升agent在部分Atari 2600游戏中的表现.

Abstract: The instability and variability of deep reinforcement learning algorithms have an important effect on their performance. Deep Q-Network is the first algorithm to combine deep neural networks with Q-learning successfully. It is proved that deep Q-Network can perform human-level control for handling problems requiring both rich perception of high-dimensional raw inputs and policy control. However, deep Q-Network has the problem of overestimating the action value and such overestimation can degrade the performance of agent. Although double deep Q-Network is proposed to mitigate the impact of overestimation, it still exists the problem of underestimating the value of the action. In some complex reinforcement learning environments, even a small estimation error may have a large impact on the learned policy. In this paper, in order to solve the problem of overestimating the action value in deep Q-Network and the underestimation of the action value in double deep Q-Network, a new deep reinforcement learning framework is proposed-AWDDQN, which integrates the newly proposed weighted double estimator into double deep Q-Network. In order to reduce the estimation error of the target value, the average value of the previously learned action estimation values is calculated to generate a target value and the number of average action values is dynamically determined based on the temporal difference error. The experimental results show that AWDDQN can effectively reduce the bias and can enhance agent’s performance in some Atari 2600 games.

HTML全文

参考文献(0)

施引文献

资源附件(0)