ISSN 1000-1239 CN 11-1777/TP

Journal of Computer Research and Development ›› 2015, Vol. 52 ›› Issue (12): 2764-2775.doi: 10.7544/issn1000-1239.2015.20148160

Previous Articles     Next Articles

A Heuristic Dyna Optimizing Algorithm Using Approximate Model Representation

Zhong Shan1,2, Liu Quan1,3,5, Fu Qiming1,4, Zhang Zongzhang1, Zhu Fei1, Gong Shengrong1,2   

  1. 1(School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006); 2(School of Computer Science and Engineering, Changshu Institute of Technology, Changshu, Jiangsu 215500); 3(Collaborative Innovation Center of Novel Software Technology and Industrialization, Nanjing 210000); 4(College of Electronic & Information Engineering, Suzhou University of Science and Technology, Suzhou, Jiangsu 215006); 5(Key Laboratory of Symbol Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012)
  • Online:2015-12-01

Abstract: In allusion to the problems of reinforcement learning with Dyna-framework, such as slow convergence and inappropriate representation of the environment model, delayed learning of the changed environment and so on, this paper proposes a novel heuristic Dyna optimization algorithm based on approximate model—HDyna-AMR, which approximates Q value function via linear function, and solves the optimal value function by using gradient descent method. HDyna-AMR can be divided into two phases, such as the learning phase and the planning phase. In the former one, the algorithm approximately models the environment by interacting with the environment and records the feature appearing frequency, while in the latter one, the approximated environment model can be used to do the planning with some extra rewards according to the feature appearing frequency. Additionally, the paper proves the convergence of the proposed algorithm theoretically. Experimentally, we apply HDyna-AMR to the extended Boyan Chain problem and Mountain Car problem, and the results show that HDyna-AMR can get the approximately optimal policy in both discrete and continuous state space. Furthermore, compared with Dyna-LAPS (Dyna-style planning with linear approximation and prioritized sweeping) and Sarsa(λ), HDyna-AMR outperforms Dyna-LAPS and Sarsa(λ) in terms of convergence rate, and the robustness to the changed environment.

Key words: reinforcement learning (RL), model learning, planning, function approximation, machine learning

CLC Number: