ISSN 1000-1239 CN 11-1777/TP

计算机研究与发展 ›› 2018, Vol. 55 ›› Issue (8): 1694-1705.doi: 10.7544/issn1000-1239.2018.20180148

所属专题: 2018数据挖掘前沿进展专题

• 人工智能 • 上一篇    下一篇

一种最大置信上界经验采样的深度Q网络方法

朱斐1,2,3,吴文1,刘全1,3,伏玉琛1,4   

  1. 1(苏州大学计算机科学与技术学院 江苏苏州 215006);2(江苏省计算机信息处理技术重点实验室(苏州大学) 江苏苏州 215006);3(符号计算与知识工程教育部重点实验室(吉林大学) 长春 130012);4(常熟理工学院计算机科学与工程学院 江苏常熟 215500) (zhufei@suda.edu.cn)
  • 出版日期: 2018-08-01
  • 基金资助: 
    国家自然科学基金项目(61303108,61373094,61772355);江苏省高校自然科学研究项目重大项目(17KJA520004);符号计算与知识工程教育部重点实验室(吉林大学)资助项目(93K172014K04);苏州市应用基础研究计划工业部分(SYG201422);高校省级重点实验室(苏州大学)项目(KJS1524);中国国家留学基金项目(201606920013) This work was supported by the National Natural Science Foundation of China (61303108, 61373094, 61772355), Jiangsu College Natural Science Research Key Program (17KJA520004), the Program of the Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education (Jilin University) (93K172014K04), Suzhou Industrial Application of Basic Research Program (SYG201422), the Program of the Provincial Key Laboratory for Computer Information Processing Technology (Soochow University) (KJS1524), and China Scholarship Council Project (201606920013).

A Deep Q-Network Method Based on Upper Confidence Bound Experience Sampling

Zhu Fei1,2,3, Wu Wen1, Liu Quan1,3,Fu Yuchen1,4   

  1. 1(School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006);2(Provincial Key Laboratory for Computer Information Processing Technology (Soochow University), Suzhou, Jiangsu 215006);3(Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012);4(School of Computer Science and Engineering, Changshu Institute of Technology, Changshu, Jiangsu 215500)
  • Online: 2018-08-01

摘要: 由深度学习(deep learning, DL)和强化学习(reinforcement learning, RL)结合形成的深度强化学习(deep reinforcement learning, DRL)是目前人工智能领域的一个热点.深度强化学习在处理具有高维度输入的最优策略求解任务中取得了很大的突破.为了减少转移状态之间暂时的相关性,传统深度Q网络使用经验回放的采样机制,从缓存记忆中随机采样转移样本.然而,随机采样并不考虑缓存记忆中各个转移样本的优先级,导致网络训练过程中可能会过多地采用信息较低的样本,而忽略一些高信息量的样本,结果不但增加了训练时间,而且训练效果也不理想.针对此问题,在传统深度Q网络中引入优先级概念,提出基于最大置信上界的采样算法,通过奖赏、时间步、采样次数共同决定经验池中样本的优先级,提高未被选择的样本、更有信息价值的样本以及表现优秀的样本的被选概率,保证了所采样本的多样性,使智能体能更有效地选择动作.最后,在Atari 2600的多个游戏环境中进行仿真实验,验证了算法的有效性.

关键词: 强化学习, 深度强化学习, 最大置信上界, 经验回放, 深度Q网络

Abstract: Recently, deep reinforcement learning (DRL), which combines deep learning (DL) with reinforcement learning (RL) together, has become a hot topic in the field of artificial intelligence. Deep reinforcement learning has made a great breakthrough in the task of optimal policy solving with high dimensional inputs. To remove the temporary correlation among the observed transitions, deep Q-network uses a sampling mechanism called experience replay that replays transitions at random from the memory buffer, which breaks the relationship among samples. However, random sampling doesn’t consider the priority of sample’s transition in the memory buffer. As a result, it is likely to sample data with insignificant information excessively while ignoring informative samples during the process of network training, which leads to longer training time as well as unsatisfactory training effect. To solve this problem, we introduce the idea of priority to traditional deep Q-network and put forward a prioritized sampling algorithm based on upper confidence bound (UCB). It determines sample’s probability of being selected in memory buffer by reward, time step, and sampling times. The proposed approach assigns samples that haven’t been chosen, samples that are more valuable, and samples that have good results, with higher probability of being selected, which guarantees the diversity of samples, such that the agent is able to select action more effectively. Finally, simulation experiments of Atari 2600 games verify the approach.

Key words: reinforcement learning (RL), deep reinforcement learning (DRL), upper confidence bound, experience replay, deep Q-network (DQN)

中图分类号: