高级检索
    石 川, 史忠植, 王茂光. 基于路径匹配的在线分层强化学习方法[J]. 计算机研究与发展, 2008, 45(9).
    引用本文: 石 川, 史忠植, 王茂光. 基于路径匹配的在线分层强化学习方法[J]. 计算机研究与发展, 2008, 45(9).
    Shi Chuan, Shi Zhongzhi, Wang Maoguang. Online Hierarchical Reinforcement Learning Based on Path-matching[J]. Journal of Computer Research and Development, 2008, 45(9).
    Citation: Shi Chuan, Shi Zhongzhi, Wang Maoguang. Online Hierarchical Reinforcement Learning Based on Path-matching[J]. Journal of Computer Research and Development, 2008, 45(9).

    基于路径匹配的在线分层强化学习方法

    Online Hierarchical Reinforcement Learning Based on Path-matching

    • 摘要: 如何在线找到正确的子目标是基于option的分层强化学习的关键问题.通过分析学习主体在子目标处的动作,发现了子目标的有效动作受限的特性,进而将寻找子目标的问题转化为寻找路径中最匹配的动作受限状态.针对网格学习环境,提出了单向值方法表示子目标的有效动作受限特性和基于此方法的option自动发现算法.实验表明,基于单向值方法产生的option能够显著加快Q学习算法,也进一步分析了option产生的时机和大小对Q学习算法性能的影响.

       

      Abstract: Although reinforcement learning (RL) is an effective approach for building autonomous agents that improve their performance with experiences, a fundamental problem of the standard RL algorithm is that in practice they are not solvable in reasonable time. The hierarchical reinforcement learning (HRL) is a successful solution which decomposes the learning task into simpler subtasks and learns each of them independently. As a promising HRL, option is introduced as closed-loop policies for sequences of actions to enable HRL. A key problem for HRL based on options is to discover the correct subgoals online. Through analyzing the actions of agents in subgoals, two useful properties are found: (1) the subgoals have more possibility to be passed through and (2) the effective actions in subgoals are restricted. As a consequence, subgoals can be regarded as the most matching action-restricted states in the paths. Considering the grid environment, the concept of unique-direction value is proposed to denote the action-restricted property, and the option discovering algorithm based on unique-direction value is introduced. The experiments show that the options discovered by the unique-direction value method can speed up the primitive Q learning significantly. Moreover, the experiments further analyze how the size and generating time of options affects the performance of Q learning.

       

    /

    返回文章
    返回