一种基于智能调度的可扩展并行强化学习方法

刘  全; 傅启明; 杨旭东; 荆  玲; 李  瑾; 李  娇

一种基于智能调度的可扩展并行强化学习方法

A Scalable Parallel Reinforcement Learning Method Based on Intelligent Scheduling

摘要

摘要: 针对强化学习在大状态空间或连续状态空间中存在的“维数灾”问题，提出一种基于智能调度的可扩展并行强化学习方法——IS-SRL，并从理论上进行分析，证明其收敛性.该方法采用分而治之策略对大状态空间进行分块，使得每个分块能够调入内存独立学习.在每个分块学习了一个周期之后交换到外存上，调入下一个分块继续学习.分块之间在换入换出的过程中交换信息，以使整个学习任务收敛到最优解.同时针对各分块之间的学习顺序会显著影响学习效率的问题，提出了一种新颖的智能调度算法，该算法利用强化学习值函数更新顺序的分布特点，基于多种调度策略加权优先级的思想，把学习集中在能产生最大效益的子问题空间，保障了IS-SRL方法的学习效率.在上述调度算法中融入并行调度框架，利用多Agent同时学习，得到了IS-SRL方法的并行版本——IS-SPRL方法.实验结果表明，IS-SPRL方法具有较快的收敛速度和较好的扩展性能.

Abstract: Aiming at the “curse of dimensionality” problem of reinforcement learning in large state space or continuous state space, a scalable reinforcement learning method, IS-SRL method, is proposed on the basis of divide-and-conquer strategy, and its convergence is proved. In this method, the learning problem with large state space or continuous state space is divided into smaller subproblems so that each subproblem can be learned independently in memory. After a cycle of learning, next subproblem will be swapped in to continue the learning process. Information exchanges between the subproblems during the process of swap so that the learning process will converge to optima eventually. The order of subproblems’ executing significantly affects the efficiency of learning. Therefore, we propose an efficient scheduling algorithm which takes advantage of the distribution of value function’s backup in reinforcement learning and the idea of weighting the priorities of multiple scheduling strategies. This scheduling algorithm ensures that computation is focused on regions of the problem space which are expected to be maximally productive. To expedite the learning process, a parallel scheduling architecture, which can flexibly allocate learning tasks between learning agents, is proposed. A new method, IS-SPRL, is obtained after we blended the proposed architecture into the IS-SRL method. The experimental results show that learning based on this scheduling architecture has faster convergence speed and good scalability.

HTML全文

参考文献(0)

施引文献

资源附件(0)