ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2018, Vol. 55 ›› Issue (8): 1694-1705.doi: 10.7544/issn1000-1239.2018.20180148

Special Issue: 2018数据挖掘前沿进展专题

Previous Articles     Next Articles

A Deep Q-Network Method Based on Upper Confidence Bound Experience Sampling

Zhu Fei1,2,3, Wu Wen1, Liu Quan1,3,Fu Yuchen1,4   

  1. 1(School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006);2(Provincial Key Laboratory for Computer Information Processing Technology (Soochow University), Suzhou, Jiangsu 215006);3(Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012);4(School of Computer Science and Engineering, Changshu Institute of Technology, Changshu, Jiangsu 215500)
  • Online:2018-08-01

Abstract: Recently, deep reinforcement learning (DRL), which combines deep learning (DL) with reinforcement learning (RL) together, has become a hot topic in the field of artificial intelligence. Deep reinforcement learning has made a great breakthrough in the task of optimal policy solving with high dimensional inputs. To remove the temporary correlation among the observed transitions, deep Q-network uses a sampling mechanism called experience replay that replays transitions at random from the memory buffer, which breaks the relationship among samples. However, random sampling doesn’t consider the priority of sample’s transition in the memory buffer. As a result, it is likely to sample data with insignificant information excessively while ignoring informative samples during the process of network training, which leads to longer training time as well as unsatisfactory training effect. To solve this problem, we introduce the idea of priority to traditional deep Q-network and put forward a prioritized sampling algorithm based on upper confidence bound (UCB). It determines sample’s probability of being selected in memory buffer by reward, time step, and sampling times. The proposed approach assigns samples that haven’t been chosen, samples that are more valuable, and samples that have good results, with higher probability of being selected, which guarantees the diversity of samples, such that the agent is able to select action more effectively. Finally, simulation experiments of Atari 2600 games verify the approach.

Key words: reinforcement learning (RL), deep reinforcement learning (DRL), upper confidence bound, experience replay, deep Q-network (DQN)

CLC Number: