高级检索
    朱 斐, 刘 全, 傅启明, 伏玉琛. 一种用于连续动作空间的最小二乘行动者-评论家方法[J]. 计算机研究与发展, 2014, 51(3): 548-558.
    引用本文: 朱 斐, 刘 全, 傅启明, 伏玉琛. 一种用于连续动作空间的最小二乘行动者-评论家方法[J]. 计算机研究与发展, 2014, 51(3): 548-558.
    Zhu Fei, Liu Quan, Fu Qiming, Fu Yuchen. A Least Square Actor-Critic Approach for Continuous Action Space[J]. Journal of Computer Research and Development, 2014, 51(3): 548-558.
    Citation: Zhu Fei, Liu Quan, Fu Qiming, Fu Yuchen. A Least Square Actor-Critic Approach for Continuous Action Space[J]. Journal of Computer Research and Development, 2014, 51(3): 548-558.

    一种用于连续动作空间的最小二乘行动者-评论家方法

    A Least Square Actor-Critic Approach for Continuous Action Space

    • 摘要: 解决具有连续动作空间的问题是当前强化学习领域的一个研究热点和难点.在处理这类问题时,传统的强化学习算法通常利用先验信息对连续动作空间进行离散化处理,然后再求解最优策略.然而,在很多实际应用中,由于缺乏用于离散化处理的先验信息,算法效果会变差甚至算法失效.针对这类问题,提出了一种最小二乘行动者-评论家方法(least square actor-critic algorithm, LSAC),使用函数逼近器近似表示值函数及策略,利用最小二乘法在线动态求解近似值函数参数及近似策略参数,以近似值函数作为评论家指导近似策略参数的求解.将LSAC算法用于解决经典的具有连续动作空间的小车平衡杆问题和mountain car问题,并与Cacla(continuous actor-critic learning automaton)算法和eNAC(episodic natural actor-critic)算法进行比较.结果表明,LSAC算法能有效地解决连续动作空间问题,并具有较优的执行性能.

       

      Abstract: The research of the reinforcement learning problem with continuous action space is one of the most challenging and difficult concerns for the time being. Conventional reinforcement learning algorithms are usually aimed at solving the problems of the small scale and discrete action space. For the problems with continuous actions space, most approaches tend to discretize the continuous space by taking advantage of prior information, and then try to find out the optimal solution. However, in many practical applications, action spaces are usually continuous, and moreover little prior information is available for discretizing the action space appropriately. In order to solve this problem, we hereby put forward a least square actor-critic algorithm (LSAC) for continuous action space, which takes advantage of approximate function to represent value function and policy respectively; and uses online least square method to obtain the parameters of approximate value function and approximate policy, where approximate value function is considered as the critic part to guide the solution of the parameter of approximate policy. We applied LSAC to solve the cart pole balancing problem and the mountain car problem which are characterized by continuous action space, and then compared the results with those returned by two classic algorithms, Cacla (continuous actor-critic learning automaton) algorithm and eNAC (episodic natural actor-critic) algorithm. The experimental results show that LSAC can solve the continuous action space problem well and has better executing performance.

       

    /

    返回文章
    返回