A Diversity-Enriched Option-Critic Algorithm with Interest Functions

Li Junwei; Liu Quan; Huang Zhigang; Xu Yapeng

doi:10.7544/issn1000-1239.202220970

Journal of Computer Research and Development > 2024 > 61(12): 3108-3120. > DOI: 10.7544/issn1000-1239.202220970

Li Junwei, Liu Quan, Huang Zhigang, Xu Yapeng. A Diversity-Enriched Option-Critic Algorithm with Interest Functions[J]. Journal of Computer Research and Development, 2024, 61(12): 3108-3120. DOI: 10.7544/issn1000-1239.202220970

Citation:

PDF (5924 KB)

A Diversity-Enriched Option-Critic Algorithm with Interest Functions

Li Junwei^1,,
Liu Quan^{1, 2, 3, 4, ,},
Huang Zhigang¹,
Xu Yapeng¹

1.
School of Computer Science and Technology, Soochow University, Suzhou, Jiangsu 215006
2.
Collaborative Innovation Center of Novel Software Technology and Industrialization (Nanjing University), Nanjing 210023
3.
Key Laboratory of Symbol Computation and Knowledge Engineering (Jilin University), Ministry of Education, Changchun 130012
4.
Jiangsu Provincial Key Laboratory for Computer Information Processing Technology (Soochow University), Suzhou, Jiangsu 215006

Funds: This work was supported by the National Natural Science Foundation of China (62376179，61772355，61702055，61876217，62176175) and the Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institutions.

More Information

Author Bio:
Li Junwei: born in 1998. Master candidate. His main research interests include reinforcement learning and hierarchical reinforcement learning

Liu Quan: born in 1969. PhD, professor, PhD supervisor. Senior member of CCF. His main research interests include intelligence information processing, automated reasoning, and machine learning

Huang Zhigang: born in 1993. PhD candidate. His main research interests include reinforcement learning, deep reinforcement learning, and hierarchical reinforcement learning

Xu Yapeng: born in 1996. Master candidate. His main research interests include deep reinforcement learning and hierarchical reinforcement learning
Received Date: November 22, 2022
Revised Date: January 02, 2024
Accepted Date: March 05, 2024
Available Online: March 06, 2024

Graphical Abstract

Abstract

Abstract

As a common temporal abstraction method for hierarchical reinforcement learning, Option framework allows agents to learn strategies at different time scales, which can effectively solve sparse reward problems. In order to ensure that options can guide agents to access more state space, some methods improve the diversity of options by introducing mutual information in internal reward and termination functions. However, it will lead to slow algorithm learning speed and low knowledge transfer ability of internal strategy, which seriously affect algorithm performance. To address the above problems, diversity-enriched option-critic algorithm with interest functions(DEOC-IF) is proposed. Based on the diversity-enriched option-critic (DEOC) algorithm, the algorithm constrains the selection of the upper-level strategy on the internal strategy of Option by introducing the interest function, which not only ensures the diversity of the Option set, but also makes the learned internal strategies focus on different regions of the state space, which is conducive to improving the knowledge transfer ability of the algorithm and accelerating the learning speed. In addition, DEOC-IF introduces a new interest function update gradient, which is beneficial to improve the exploration ability of the algorithm. In order to verify the effectiveness and option reusability of the algorithm, the algorithm comparison experiments are carried out in four-room navigation task, Mujoco, and MiniWorld. Experimental results show that DEOC-IF algorithm has better performance and option reusability compared with other algorithms.
- reinforcement learning,
- temporal abstractions,
- Option framework,
- interest function,
- Option-Critic algorithm

FullText(HTML)

References (31)

References

[1]	Sutton R S, Barto A G. Reinforcement Learning: An Introduction[M]. Cambridge, MA: MIT, 2018
[2]	刘全,翟建伟,章宗长,等. 深度强化学习综述[J]. 计算机学报,2018,41(1):1−27 doi: 10.11897/SP.J.1016.2019.00001 Liu Quan, Zhai Jianwei, Zhang Zongzhang, et al. A survey on deep reinforcement learning[J]. Chinese Journal of Computers, 2018, 41(1): 1−27 (in Chinese) doi: 10.11897/SP.J.1016.2019.00001
[3]	刘建伟,高峰,罗雄麟. 基于值函数和策略梯度的深度强化学习综述[J]. 计算机学报,2019,42(6):1406−1438 doi: 10.11897/SP.J.1016.2019.01406 Liu Jianwei, Gao Feng, Luo Xionglin. Survey of deep reinforcement learning based on value function and policy gradient[J]. Chinese Journal of Computers, 2019, 42(6): 1406−1438 (in Chinese) doi: 10.11897/SP.J.1016.2019.01406
[4]	Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529−533 doi: 10.1038/nature14236
[5]	Lillicrap T P, Hunt J J, Pritzel A, et al. Continuous control with deep reinforcement learning[J]. arXiv preprint, arXiv: 1509.02971, 2015
[6]	Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods[C]//Proc of the 35th Int Conf on Machine Learning. New York: ACM, 2018: 1582−1591
[7]	Schulman J, Levine S, Abbeel P, et al. Trust region policy optimization[C]//Proc of the 32nd Int Conf on Machine Learning. New York: ACM, 2015: 1889−1897
[8]	Haarnoja T, Zhou A, Abbeel P, et al. Soft actor-critic: Off-policy maximum entropy deep reinforcement learning with a stochastic actor[C]//Proc of the 35th Int Conf on Machine Learning. New York: ACM, 2018: 1861−1870
[9]	赖俊,魏竞毅,陈希亮. 分层强化学习综述[J]. 计算机工程与应用,2021,57(3):72−79 doi: 10.3778/j.issn.1002-8331.2010-0038 Lai Jun, Wei Jingyi, Chen Xiliang. Overview of hierarchical reinforcement learning[J]. Computer Engineering and Applications, 2021, 57(3): 72−79 (in Chinese) doi: 10.3778/j.issn.1002-8331.2010-0038
[10]	Sutton R S, Precup D, Singh S. Between MDPs and semi-MDPs: A framework for temporal abstraction in reinforcement learning[J]. Artificial Intelligence, 1999, 112(1/2): 181−211
[11]	Kulkarni T D, Narasimhan K, Saeedi A, et al. Hierarchical deep reinforcement learning: Integrating temporal abstraction and intrinsic motivation[C]//Proc of the 29th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2016: 3675−3683
[12]	Zhao Dongyang, Zhang Liang, Zhang Bo, et al. Mahrl: Multi-goals abstraction based deep hierarchical reinforcement learning for recommendations[C]//Proc of the 43rd Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2020: 871−880
[13]	Duan Jingliang, Li Shengbo, Guan Yang, et al. Hierarchical reinforcement learning for self-driving decision-making without reliance on labelled driving data[J]. IET Intelligent Transport Systems, 2020, 14(5): 297−305 doi: 10.1049/iet-its.2019.0317
[14]	Liu Jianfeng, Pan Feiyang, Luo Ling. Gochat: Goal-oriented chatbots with hierarchical reinforcement learning[C]//Proc of the 43rd Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 2020: 1793−1796
[15]	Levy A, Konidaris G, Platt R, et al. Learning multi-level hierarchies with hindsight[J]. arXiv preprint, arXiv: 1712.00948, 2017
[16]	Bacon P L, Harb J, Precup D. The option-critic architecture[C]//Proc of the 31st AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2017: 1726−1734
[17]	Harb J, Bacon P L, Klissarov M, et al. When waiting is not an option: Learning options with a deliberation cost[C]//Proc of the 32nd AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2018: 3165−3172
[18]	Klissarov M, Bacon P L, Harb J, et al. Learnings options end-to-end for continuous action tasks[J]. arXiv preprint, arXiv: 1712.00004, 2017
[19]	Schulman J, Wolski F, Dhariwal P, et al. Proximal policy optimization algorithms[J]. arXiv preprint, arXiv: 1707.06347, 2017
[20]	Li Chenghao, Ma Xiaoteng, Zhang Chongjie, et al. Soac: The soft option actor-critic architecture[J]. arXiv preprint, arXiv: 2006.14363, 2020
[21]	Kanagawa Y, Kaneko T. Diverse exploration via infomax options[J]. arXiv preprint, arXiv: 2010.02756, 2020
[22]	Eysenbach B, Gupta A, Ibarz J, et al. Diversity is all you need: Learning skills without a reward function[J]. arXiv preprint, arXiv: 1802.06070, 2018
[23]	Gregor K, Rezende D J, Wierstra D. Variational intrinsic control[J]. arXiv preprint, arXiv: 1611.07507, 2016
[24]	Harutyunyan A, Dabney W, Borsa D, et al. The termination critic[C]//Proc of the 22nd Int Conf on Artificial Intelligence and Statistics. New York: PMLR, 2019: 2231−2240
[25]	Kamat A, Precup D. Diversity-enriched option-critic[J]. arXiv preprint, arXiv: 2011.02565, 2020
[26]	Khetarpal K, Klissarov M, Chevalier-Boisvert M, et al. Options of interest: Temporal abstraction with interest functions[C]//Proc of the 34th AAAI Conf on Artificial Intelligence. Palo Alto, CA: AAAI, 2020: 4444−4451
[27]	Sutton R S, McAllester D A, Singh S P, et al. Policy gradient methods for reinforcement learning with function approximation[C]//Proc of the 14th Advances in Neural Information Processing Systems. Cambridge, MA: MIT, 2000: 1057−1063
[28]	Brockman G, Cheung V, Pettersson L, et al. OpenAI gym[J]. arXiv preprint, arXiv: 1606.01540, 2016
[29]	Finn C, Abbeel P, Levine S. Model-agnostic meta-learning for fast adaptation of deep networks[C]//Proc of the 34th Int Conf on Machine Learning. New York: ICML, 2017: 1126−1135
[30]	Henderson P, Chang W D, Shkurti F, et al. Benchmark environments for multitask learning in continuous domains[J]. arXiv preprint, arXiv: 1708.04352, 2017
[31]	Chevalier-Boisvert M, Bolun D, Mark T, et al. Gym-miniworld environment for OpenAI Gym[EB/OL]. 2018[2022-11-22]. https://github.com/maximecb/gym-miniworld

[1]	Zhang Yuan, Cao Huawei, Zhang Jie, Shen Yue, Sun Yiming, Dun Ming, An Xuejun, Ye Xiaochun. Survey on Key Technologies of Graph Processing Systems Based on Multi-core CPU and GPU Platforms[J]. Journal of Computer Research and Development, 2024, 61(6): 1401-1428. DOI: 10.7544/issn1000-1239.202440073
[2]	Zhang Jun, Xie Jingcheng, Shen Fanfan, Tan Hai, Wang Lümeng, He Yanxiang. Performance Optimization of Cache Subsystem in General Purpose Graphics Processing Units: A Survey[J]. Journal of Computer Research and Development, 2020, 57(6): 1191-1207. DOI: 10.7544/issn1000-1239.2020.20200113
[3]	Duan Qiong, Tian Bo, Chen Zheng, Wang Jie, He Zengyou. CUDA-TP: A GPU-Based Parallel Algorithm for Top-Down Intact Protein Identification[J]. Journal of Computer Research and Development, 2018, 55(7): 1525-1538. DOI: 10.7544/issn1000-1239.2018.20170080
[4]	Feng Jiaying, Zhang Xiaowang, Feng Zhiyong. Parallel Algorithms for RDF Type-Isomorphism on GPU[J]. Journal of Computer Research and Development, 2018, 55(3): 651-661. DOI: 10.7544/issn1000-1239.2018.20160845
[5]	Su Huayou, Wen Wen, Li Dongsheng. Optimization and Parallelization Single Particle Cryo-EM Software RELION with GPU[J]. Journal of Computer Research and Development, 2018, 55(2): 409-417. DOI: 10.7544/issn1000-1239.2018.20160873
[6]	Zhang Heng, Zhang Libo, WuYanjun. Large-Scale Graph Processing on Multi-GPU Platforms[J]. Journal of Computer Research and Development, 2018, 55(2): 273-288. DOI: 10.7544/issn1000-1239.2018.20170697
[7]	Zheng Zhen, Zhai Jidong, Li Yan, Chen Wenguang. Workload Analysis for Typical GPU Programs Using CUPTI Interface[J]. Journal of Computer Research and Development, 2016, 53(6): 1249-1262. DOI: 10.7544/issn1000-1239.2016.20148354
[8]	Tang Liang, Luo Zuying, Zhao Guoxing, and Yang Xu. SOR-Based P/G Solving Algorithm of Linear Parallelism for GPU Computing[J]. Journal of Computer Research and Development, 2013, 50(7): 1491-1500.
[9]	Cai Yong, Li Guangyao, and Wang Hu. Parallel Computing of Central Difference Explicit Finite Element Based on GPU General Computing Platform[J]. Journal of Computer Research and Development, 2013, 50(2): 412-419.
[10]	Hu Wei and Qin Kaihuai. A New Rendering Technology of GPU-Accelerated Radiosity[J]. Journal of Computer Research and Development, 2005, 42(6): 945-950.