栗军伟, 刘全, 黄志刚, 徐亚鹏. 基于兴趣函数优化的多样化Option-Critic算法[J]. 计算机研究与发展.
 引用本文: 栗军伟, 刘全, 黄志刚, 徐亚鹏. 基于兴趣函数优化的多样化Option-Critic算法[J]. 计算机研究与发展.
Li Junwei, Liu Quan, Huang Zhigang, Xu Yapeng. A Diversity-Enriched Option-Critic Algorithm with Interest Functions[J]. Journal of Computer Research and Development.
 Citation: Li Junwei, Liu Quan, Huang Zhigang, Xu Yapeng. A Diversity-Enriched Option-Critic Algorithm with Interest Functions[J]. Journal of Computer Research and Development.

## A Diversity-Enriched Option-Critic Algorithm with Interest Functions

• 摘要: Option框架作为分层强化学习的一种常用时序抽象方法，允许智能体在不同的时间尺度上学习策略，可以有效解决稀疏奖励问题. 为了保证Option可以引导智能体访问更多的状态空间，一些方法通过引入基于互信息的内部奖励和终止函数来提升Option内部策略的多样性. 但这会导致算法学习速度慢和内部策略的知识迁移能力低等问题，严重影响了算法性能. 针对以上问题，提出基于兴趣函数优化的多样化Option-Critic算法(diversity-enriched option-critic algorithm with interest functions，DEOC-IF). 该算法在多样化Option-Critic算法(diversity-enriched option-critic，DEOC)的基础上，通过引入兴趣函数约束上层策略对Option内部策略的选择，既保证了Option集合的多样性，又使得学习到的内部策略可以关注状态空间的不同区域，有利于提高算法的知识迁移能力，加快学习速度. 此外，DEOC-IF算法引入一种新的兴趣函数更新梯度，有利于提高算法的探索能力. 为了验证算法的有效性和知识迁移能力，分别在4房间导航任务、Mujoco和MiniWorld实验环境中，将DEOC-IF算法与其他最新算法进行对比实验. 结果表明，DEOC-IF算法具有更好的性能优势和策略迁移能力.

Abstract: As a common temporal abstraction method for hierarchical reinforcement learning, Option framework allows agents to learn strategies at different time scales, which can effectively solve sparse reward problems. In order to ensure that options can guide agents to access more state space, some methods improve the diversity of options by introducing mutual information in internal reward and termination functions. However, it will lead to slow algorithm learning speed and low knowledge transfer ability of internal strategy, which seriously affect algorithm performance. To address the above problems, diversity-enriched option-critic algorithm with interest functions(DEOC-IF) is proposed. Based on the diversity-enriched option-critic (DEOC) algorithm, the algorithm constrains the selection of the upper-level strategy on the internal strategy of Option by introducing the interest function, which not only ensures the diversity of the Option set, but also makes the learned internal strategies can focus on different regions of the state space, which is conducive to improving the knowledge transfer ability of the algorithm and accelerating the learning speed. In addition, DEOC-IF introduces a new interest function update gradient, which is beneficial to improve the exploration ability of the algorithm. In order to verify the effectiveness and option reusability of the algorithm, the algorithm comparison experiments were carried out in Four-Rooms Navigation Task, Mujoco, and MiniWorld. Experimental results show that DEOC-IF algorithm has better performance and option reusability compared with other algorithms.

/

• 分享
• 用微信扫码二维码

分享至好友和朋友圈