• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
高级检索

一种最大置信上界经验采样的深度Q网络方法

朱斐, 吴文, 刘全, 伏玉琛

朱斐, 吴文, 刘全, 伏玉琛. 一种最大置信上界经验采样的深度Q网络方法[J]. 计算机研究与发展, 2018, 55(8): 1694-1705. DOI: 10.7544/issn1000-1239.2018.20180148
引用本文: 朱斐, 吴文, 刘全, 伏玉琛. 一种最大置信上界经验采样的深度Q网络方法[J]. 计算机研究与发展, 2018, 55(8): 1694-1705. DOI: 10.7544/issn1000-1239.2018.20180148
Zhu Fei, Wu Wen, Liu Quan, Fu Yuchen. A Deep Q-Network Method Based on Upper Confidence Bound Experience Sampling[J]. Journal of Computer Research and Development, 2018, 55(8): 1694-1705. DOI: 10.7544/issn1000-1239.2018.20180148
Citation: Zhu Fei, Wu Wen, Liu Quan, Fu Yuchen. A Deep Q-Network Method Based on Upper Confidence Bound Experience Sampling[J]. Journal of Computer Research and Development, 2018, 55(8): 1694-1705. DOI: 10.7544/issn1000-1239.2018.20180148

一种最大置信上界经验采样的深度Q网络方法

基金项目: 国家自然科学基金项目(61303108,61373094,61772355);江苏省高校自然科学研究项目重大项目(17KJA520004);符号计算与知识工程教育部重点实验室(吉林大学)资助项目(93K172014K04);苏州市应用基础研究计划工业部分(SYG201422);高校省级重点实验室(苏州大学)项目(KJS1524);中国国家留学基金项目(201606920013) This work was supported by the National Natural Science Foundation of China (61303108, 61373094, 61772355), Jiangsu College Natural Science Research Key Program (17KJA520004), the Program of the Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education (Jilin University) (93K172014K04), Suzhou Industrial Application of Basic Research Program (SYG201422), the Program of the Provincial Key Laboratory for Computer Information Processing Technology (Soochow University) (KJS1524), and China Scholarship Council Project (201606920013).
详细信息
  • 中图分类号: TP18

A Deep Q-Network Method Based on Upper Confidence Bound Experience Sampling

  • 摘要: 由深度学习(deep learning, DL)和强化学习(reinforcement learning, RL)结合形成的深度强化学习(deep reinforcement learning, DRL)是目前人工智能领域的一个热点.深度强化学习在处理具有高维度输入的最优策略求解任务中取得了很大的突破.为了减少转移状态之间暂时的相关性,传统深度Q网络使用经验回放的采样机制,从缓存记忆中随机采样转移样本.然而,随机采样并不考虑缓存记忆中各个转移样本的优先级,导致网络训练过程中可能会过多地采用信息较低的样本,而忽略一些高信息量的样本,结果不但增加了训练时间,而且训练效果也不理想.针对此问题,在传统深度Q网络中引入优先级概念,提出基于最大置信上界的采样算法,通过奖赏、时间步、采样次数共同决定经验池中样本的优先级,提高未被选择的样本、更有信息价值的样本以及表现优秀的样本的被选概率,保证了所采样本的多样性,使智能体能更有效地选择动作.最后,在Atari 2600的多个游戏环境中进行仿真实验,验证了算法的有效性.
    Abstract: Recently, deep reinforcement learning (DRL), which combines deep learning (DL) with reinforcement learning (RL) together, has become a hot topic in the field of artificial intelligence. Deep reinforcement learning has made a great breakthrough in the task of optimal policy solving with high dimensional inputs. To remove the temporary correlation among the observed transitions, deep Q-network uses a sampling mechanism called experience replay that replays transitions at random from the memory buffer, which breaks the relationship among samples. However, random sampling doesn’t consider the priority of sample’s transition in the memory buffer. As a result, it is likely to sample data with insignificant information excessively while ignoring informative samples during the process of network training, which leads to longer training time as well as unsatisfactory training effect. To solve this problem, we introduce the idea of priority to traditional deep Q-network and put forward a prioritized sampling algorithm based on upper confidence bound (UCB). It determines sample’s probability of being selected in memory buffer by reward, time step, and sampling times. The proposed approach assigns samples that haven’t been chosen, samples that are more valuable, and samples that have good results, with higher probability of being selected, which guarantees the diversity of samples, such that the agent is able to select action more effectively. Finally, simulation experiments of Atari 2600 games verify the approach.
  • 期刊类型引用(14)

    1. 李晓辉,周媛媛,吕思婷,苏家楠. 面向车联网的动态网络切片资源部署算法. 北京邮电大学学报. 2024(04): 124-129 . 百度学术
    2. 丁世飞,杜威,郭丽丽,张健,徐晓. 基于双评论家的多智能体深度确定性策略梯度方法. 计算机研究与发展. 2023(10): 2394-2404 . 本站查看
    3. 吴卿源,谭晓阳. 基于UCB算法的交替深度Q网络. 南京师范大学学报(工程技术版). 2022(01): 24-29 . 百度学术
    4. 席磊,王昱昊,陈宋宋,陈珂,孙梦梦,周礼鹏. 面向综合能源系统的多智能体协同AGC策略. 电机与控制学报. 2022(04): 77-88 . 百度学术
    5. 张佳能,李辉,吴昊霖,王壮. 一种平衡探索和利用的优先经验回放方法. 计算机科学. 2022(05): 179-185 . 百度学术
    6. 温廷新,陈依琳. 基于深度学习的不均衡评论文本情感分析模型. 情报探索. 2022(07): 14-22 . 百度学术
    7. 朱斐,葛洋洋,凌兴宏,刘全. 基于受限MDP的无模型安全强化学习方法. 软件学报. 2022(08): 3086-3102 . 百度学术
    8. 李国燕,史东雨,张宗辉. 基于改进Dueling DQN的多园区网络动态路由算法. 电子测量与仪器学报. 2022(11): 211-220 . 百度学术
    9. 周江卫,关亚兵,白万民,刘白林. 一种二次采样的强化学习方法. 西安工业大学学报. 2021(03): 345-351 . 百度学术
    10. 杨彤,秦进,谢仲涛,袁琳琳. 基于遗传交叉算子的深度Q网络样本扩充. 计算机系统应用. 2021(12): 155-162 . 百度学术
    11. 景栋盛,薛劲松,冯仁君. 基于深度Q网络的垃圾邮件文本分类方法. 计算机与现代化. 2020(06): 89-94 . 百度学术
    12. 吴夏铭,李明秋,陈恩志,王春阳. 基于动作空间噪声的深度Q网络学习. 长春理工大学学报(自然科学版). 2020(04): 85-91 . 百度学术
    13. 陈红名,刘全,闫岩,何斌,姜玉斌,张琳琳. 基于经验指导的深度确定性多行动者-评论家算法. 计算机研究与发展. 2019(08): 1708-1720 . 本站查看
    14. 陈勃,王锦艳. 一种高效的经验回放模块设计. 计算机应用. 2019(11): 3242-3249 . 百度学术

    其他类型引用(14)

计量
  • 文章访问数:  1745
  • HTML全文浏览量:  3
  • PDF下载量:  638
  • 被引次数: 28
出版历程
  • 发布日期:  2018-07-31

目录

    /

    返回文章
    返回