一种用于连续动作空间的最小二乘行动者-评论家方法

朱  斐; 刘  全; 傅启明; 伏玉琛

一种用于连续动作空间的最小二乘行动者-评论家方法

朱斐^1,2,
刘全^1,3,
傅启明¹,
伏玉琛¹

1(苏州大学计算机科学与技术学院江苏苏州 215006) 2(苏州大学系统生物学研究中心江苏苏州 215006) 3(符号计算与知识工程教育部重点实验室(吉林大学) 长春 130012) (zhufei@suda.edu.cn)

计量
- 文章访问数: 1129
- HTML全文浏览量: 4
- PDF下载量: 681
出版历程
- 发布日期: 2014-03-14

A Least Square Actor-Critic Approach for Continuous Action Space

1(School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006) 2(Center for Systems Biology, Soochow University, Suzhou, Jiangsu 215006) 3(Key Laboratory of Symbolic Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012)

摘要

摘要: 解决具有连续动作空间的问题是当前强化学习领域的一个研究热点和难点.在处理这类问题时，传统的强化学习算法通常利用先验信息对连续动作空间进行离散化处理，然后再求解最优策略.然而，在很多实际应用中，由于缺乏用于离散化处理的先验信息，算法效果会变差甚至算法失效.针对这类问题，提出了一种最小二乘行动者-评论家方法(least square actor-critic algorithm, LSAC)，使用函数逼近器近似表示值函数及策略，利用最小二乘法在线动态求解近似值函数参数及近似策略参数，以近似值函数作为评论家指导近似策略参数的求解.将LSAC算法用于解决经典的具有连续动作空间的小车平衡杆问题和mountain car问题，并与Cacla(continuous actor-critic learning automaton)算法和eNAC(episodic natural actor-critic)算法进行比较.结果表明，LSAC算法能有效地解决连续动作空间问题，并具有较优的执行性能.
- 强化学习 /
- 行动者-评论家算法 /
- 连续动作空间 /
- 最小二乘法 /
- 小车平衡杆问题 /
- mountain car问题
Abstract: The research of the reinforcement learning problem with continuous action space is one of the most challenging and difficult concerns for the time being. Conventional reinforcement learning algorithms are usually aimed at solving the problems of the small scale and discrete action space. For the problems with continuous actions space, most approaches tend to discretize the continuous space by taking advantage of prior information, and then try to find out the optimal solution. However, in many practical applications, action spaces are usually continuous, and moreover little prior information is available for discretizing the action space appropriately. In order to solve this problem, we hereby put forward a least square actor-critic algorithm (LSAC) for continuous action space, which takes advantage of approximate function to represent value function and policy respectively; and uses online least square method to obtain the parameters of approximate value function and approximate policy, where approximate value function is considered as the critic part to guide the solution of the parameter of approximate policy. We applied LSAC to solve the cart pole balancing problem and the mountain car problem which are characterized by continuous action space, and then compared the results with those returned by two classic algorithms, Cacla (continuous actor-critic learning automaton) algorithm and eNAC (episodic natural actor-critic) algorithm. The experimental results show that LSAC can solve the continuous action space problem well and has better executing performance.
- reinforcement learning /
- actor-critic algorithm /
- continuous action space /
- least squares method /
- cart pole balancing /
- mountain car