Active Sampling for Deep Q-Learning Based on TD-error Adaptive Correction

Bai Chenjia; Liu Peng; Zhao Wei; Tang Xianglong

doi:10.7544/issn1000-1239.2019.20170812

Journal of Computer Research and Development > 2019 > 56(2): 262-280. > DOI: 10.7544/issn1000-1239.2019.20170812 CSTR: 32373.14.issn1000-1239.2019.20170812

Bai Chenjia, Liu Peng, Zhao Wei, Tang Xianglong. Active Sampling for Deep Q-Learning Based on TD-error Adaptive Correction[J]. Journal of Computer Research and Development, 2019, 56(2): 262-280. DOI: 10.7544/issn1000-1239.2019.20170812

Citation:

PDF (7041 KB)

Active Sampling for Deep Q-Learning Based on TD-error Adaptive Correction

(Pattern Recognition and Intelligent System Research Center, School of Computer Science and Technology, Harbin Institute of Technology, Harbin 150001)

More Information

Published Date: January 31, 2019

Graphical Abstract

Abstract

Abstract

Deep reinforcement learning (DRL) is one of research hotspots in artificial intelligence. Deep Q-learning is one of the representative achievements of DRL. In some fields, its performance has met or exceeded the level of human expert. It is necessary for training deep Q-learning to acquire lots of samples. These samples are obtained by the interaction between agent and environment. However, it is usually computationally intensive and sometimes impossible to keep away from interaction risk. We propose an active sampling method based on TD-error adaptive correction in order to solve sample efficiency problem in deep Q-learning. In various deep Q-learning methods, the updating of storage priority in experience memory lags behind the updating of Q-network parameters. It causes that a lot of samples are not selected to apply in Q-network training because the storage priority cannot reflect the true distribution of TD-error in experience memory. The TD-error adaptive correction active sampling method proposed in this paper uses the replay periods of samples and Q-network state to establish a priority bias model to estimate the real priority of each sample in experience memory during the Q-network iteration. The samples are selected from experience memory according to the corrected priority and the bias model parameters are adaptively updated by a segmented form. We analyze the complexity of the algorithm and the relationship between learning performance and the order of polynomial feature and updating period of model parameters. Our method is verified on the platform of Atari 2600. The experimental results show that proposed method improves the learning speed and reduces the number of interaction between agent and environment. Meanwhile, it ameliorates the quality of optimal policy.
- sample priority,
- TD-error correction,
- adaption,
- active sampling,
- deep Q-learning,
- rein-forcement learning

FullText(HTML)

References (0)

[1]	Zhang Jing, Wang Ziming, Ren Yonggong. A3C Deep Reinforcement Learning Model Compression and Knowledge Extraction[J]. Journal of Computer Research and Development, 2023, 60(6): 1373-1384. DOI: 10.7544/issn1000-1239.202111186
[2]	Ma Ang, Yu Yanhua, Yang Shengli, Shi Chuan, Li Jie, Cai Xiuxiu. Survey of Knowledge Graph Based on Reinforcement Learning[J]. Journal of Computer Research and Development, 2022, 59(8): 1694-1722. DOI: 10.7544/issn1000-1239.20211264
[3]	Yu Xian, Li Zhenyu, Sun Sheng, Zhang Guangxing, Diao Zulong, Xie Gaogang. Adaptive Virtual Machine Consolidation Method Based on Deep Reinforcement Learning[J]. Journal of Computer Research and Development, 2021, 58(12): 2783-2797. DOI: 10.7544/issn1000-1239.2021.20200366
[4]	Qi Faxin, Tong Xiangrong, Yu Lei. Agent Trust Boost via Reinforcement Learning DQN[J]. Journal of Computer Research and Development, 2020, 57(6): 1227-1238. DOI: 10.7544/issn1000-1239.2020.20190403
[5]	Fan Hao, Xu Guangping, Xue Yanbing, Gao Zan, Zhang Hua. An Energy Consumption Optimization and Evaluation for Hybrid Cache Based on Reinforcement Learning[J]. Journal of Computer Research and Development, 2020, 57(6): 1125-1139. DOI: 10.7544/issn1000-1239.2020.20200010
[6]	Zhang Wentao, Wang Lu, Cheng Yaodong. Performance Optimization of Lustre File System Based on Reinforcement Learning[J]. Journal of Computer Research and Development, 2019, 56(7): 1578-1586. DOI: 10.7544/issn1000-1239.2019.20180797
[7]	Zhang Kaifeng, Yu Yang. Methodologies for Imitation Learning via Inverse Reinforcement Learning: A Review[J]. Journal of Computer Research and Development, 2019, 56(2): 254-261. DOI: 10.7544/issn1000-1239.2019.20170578
[8]	Zhao Fengfei and Qin Zheng. A Multi-Motive Reinforcement Learning Framework[J]. Journal of Computer Research and Development, 2013, 50(2): 240-247.
[9]	Lin Fen, Shi Chuan, Luo Jiewen, Shi Zhongzhi. Dual Reinforcement Learning Based on Bias Learning[J]. Journal of Computer Research and Development, 2008, 45(9): 1455-1462.
[10]	Shi Chuan, Shi Zhongzhi, Wang Maoguang. Online Hierarchical Reinforcement Learning Based on Path-matching[J]. Journal of Computer Research and Development, 2008, 45(9).

Cited By

Cited by

Periodical cited type(12)

1.	龚雪，彭鹏菲，荣里，郑雅莲，姜俊. 基于深度强化学习的任务分析方法. 系统仿真学报. 2024(07): 1670-1681 .
2.	朱永红，余英剑，李蔓华. 基于改进DQN算法的陶瓷梭式窑温度智能控制. 中国陶瓷工业. 2024(05): 33-38 .
3.	唐香蕉，赵奕凡. 动力电池-超级电容混动车辆能量管理研究. 机械设计与制造. 2024(10): 198-202 .
4.	刘森，李玺，黄运. 基于改进DQN算法的NPC行进路线规划研究. 无线电工程. 2022(08): 1441-1446 .
5.	曾熠，刘丽华，李璇，杜溢墨，陈丽娜. 基于决策知识学习的多无人机航迹协同规划. 计算机系统应用. 2022(08): 125-132 .
6.	张建行，刘全. 基于情节经验回放的深度确定性策略梯度方法. 计算机科学. 2021(10): 37-43 .
7.	李菲，梁振宇. 多线程电子通信网络数据流冗余量消除方法. 计算机仿真. 2021(11): 158-161+167 .
8.	杨惟轶，白辰甲，蔡超，赵英男，刘鹏. 深度强化学习中稀疏奖励问题研究综述. 计算机科学. 2020(03): 182-191 .
9.	陈建平，周鑫，傅启明，高振，付保川，吴宏杰. 基于二阶时序差分误差的双网络DQN算法. 计算机工程. 2020(05): 78-85+93 .
10.	何金，丁勇，高振龙. 基于Double Deep Q Network的无人机隐蔽接敌策略. 电光与控制. 2020(07): 52-57 .
11.	何金，丁勇，杨勇，黄鑫城. 未知环境下基于PF-DQN的无人机路径规划. 兵工自动化. 2020(09): 15-21 .
12.	惠庆琳. 基于深度强化学习的多小区功率分配算法. 技术与市场. 2020(10): 11-14 .