一种最大置信上界经验采样的深度Q网络方法

朱斐; 吴文; 刘全; 伏玉琛

doi:10.7544/issn1000-1239.2018.20180148

一种最大置信上界经验采样的深度Q网络方法

A Deep Q-Network Method Based on Upper Confidence Bound Experience Sampling

摘要

摘要: 由深度学习(deep learning, DL)和强化学习(reinforcement learning, RL)结合形成的深度强化学习(deep reinforcement learning, DRL)是目前人工智能领域的一个热点.深度强化学习在处理具有高维度输入的最优策略求解任务中取得了很大的突破.为了减少转移状态之间暂时的相关性，传统深度Q网络使用经验回放的采样机制，从缓存记忆中随机采样转移样本.然而，随机采样并不考虑缓存记忆中各个转移样本的优先级，导致网络训练过程中可能会过多地采用信息较低的样本，而忽略一些高信息量的样本，结果不但增加了训练时间，而且训练效果也不理想.针对此问题，在传统深度Q网络中引入优先级概念，提出基于最大置信上界的采样算法，通过奖赏、时间步、采样次数共同决定经验池中样本的优先级，提高未被选择的样本、更有信息价值的样本以及表现优秀的样本的被选概率，保证了所采样本的多样性，使智能体能更有效地选择动作.最后，在Atari 2600的多个游戏环境中进行仿真实验，验证了算法的有效性.

Abstract: Recently, deep reinforcement learning (DRL), which combines deep learning (DL) with reinforcement learning (RL) together, has become a hot topic in the field of artificial intelligence. Deep reinforcement learning has made a great breakthrough in the task of optimal policy solving with high dimensional inputs. To remove the temporary correlation among the observed transitions, deep Q-network uses a sampling mechanism called experience replay that replays transitions at random from the memory buffer, which breaks the relationship among samples. However, random sampling doesn’t consider the priority of sample’s transition in the memory buffer. As a result, it is likely to sample data with insignificant information excessively while ignoring informative samples during the process of network training, which leads to longer training time as well as unsatisfactory training effect. To solve this problem, we introduce the idea of priority to traditional deep Q-network and put forward a prioritized sampling algorithm based on upper confidence bound (UCB). It determines sample’s probability of being selected in memory buffer by reward, time step, and sampling times. The proposed approach assigns samples that haven’t been chosen, samples that are more valuable, and samples that have good results, with higher probability of being selected, which guarantees the diversity of samples, such that the agent is able to select action more effectively. Finally, simulation experiments of Atari 2600 games verify the approach.

HTML全文

参考文献(0)

施引文献

资源附件(0)