高级检索
    孙洪坤, 刘 全, 傅启明, 肖 飞, 高 龙. 一种优先级扫描的Dyna结构优化算法[J]. 计算机研究与发展, 2013, 50(10): 2176-2184.
    引用本文: 孙洪坤, 刘 全, 傅启明, 肖 飞, 高 龙. 一种优先级扫描的Dyna结构优化算法[J]. 计算机研究与发展, 2013, 50(10): 2176-2184.
    Sun Hongkun, Liu Quan, Fu Qiming, Xiao Fei, Gao Long. An Optimized Dyna Architecture Algorithm with Prioritized Sweeping[J]. Journal of Computer Research and Development, 2013, 50(10): 2176-2184.
    Citation: Sun Hongkun, Liu Quan, Fu Qiming, Xiao Fei, Gao Long. An Optimized Dyna Architecture Algorithm with Prioritized Sweeping[J]. Journal of Computer Research and Development, 2013, 50(10): 2176-2184.

    一种优先级扫描的Dyna结构优化算法

    An Optimized Dyna Architecture Algorithm with Prioritized Sweeping

    • 摘要: 不确定环境的时序决策问题是强化学习研究的主要内容之一,agent的目标是最大化其与环境交互过程中获得的累计奖赏值.直接学习方法寻找最优策略的算法收敛效率较差,而采用Dyna结构将学习与规划并行集成,可提高算法的收敛效率.为了进一步提高传统Dyna结构的收敛速度和收敛精度,提出了Dyna-PS算法,并在理论上证明了其收敛性.该算法在Dyna结构规划部分使用优先级扫描算法的思想,对优先级函数值高的状态优先更新,剔除了传统值迭代、策略迭代过程中不相关和无更新意义的状态更新,提升了规划的收敛效率,从而进一步提升了Dyna结构算法的性能.将此算法应用于一系列经典规划问题,实验结果表明,Dyna-PS算法有更快的收敛速度和更高的收敛精度,且对于状态空间的增长具有较强的鲁棒性.

       

      Abstract: Reinforcement learning involves sequential decision making in model-free environments. The aim of the agent is to maximize the accumulated reward of acting in its environment over an extended period of time. Finding the optimal policy in direct RL may be very slow. To speed up converging, one often-used solution is the integration of learning with planning. In order to further improve the convergence time and convergence precision of the Dyna structure algorithm, an optimized Dyna structure algorithm with prioritized sweeping named Dyna-PS is proposed, and its proof of convergence in theory is given. The key idea of Dyna-PS is integrating prioritized sweeping method in Dyna architecture so as to update the states according to their priority functions in the planning part. Moreover, it omits the insignificant and unrelated states' updating which are often updated in traditional value iteration and policy iteration. Achieved experiment results show that the Dyna-PS algorithm has better convergence performance and robustness for state space growth when it is applied to the maze experiment scenario and a series of classical AI programming problems.

       

    /

    返回文章
    返回