Citation: | Ding Shifei, Du Wei, Guo Lili, Zhang Jian, Xu Xiao. Multi-Agent Deep Deterministic Policy Gradient Method Based on Double Critics[J]. Journal of Computer Research and Development, 2023, 60(10): 2394-2404. DOI: 10.7544/issn1000-1239.202220399 |
In the complex multi-agent environments of the real world, the completion of tasks usually requires cooperation between agents, which promotes the emergence of various multi-agent reinforcement learning methods. The estimation bias of Q-value is an important problem in the field of single-agent reinforcement learning, but it is rarely studied in multi-agent environments. To solve this problem, the multi-agent deep deterministic policy gradient method commonly used in multi-agent reinforcement learning, is proved to have the issue of overestimating the value function by theory and experiment respectively. A multi-agent deep deterministic policy gradient method based on double critic (MADDPG-DC) network structure is proposed to avoid the overestimation of Q-value and further promotes the policy learning of agents. In addition, the method of delaying the updating of the actor network is introduced to ensure the efficiency and stability of the policy updating and improves the quality of policy learning. In order to prove the effectiveness and versatility of the proposed method, experiments are conducted on different tasks in multi-agent particle environment. The experimental results show that the proposed method can effectively avoid the overestimation of the value function. In addition, experiments on traffic signal control environment verify the feasibility and superiority of the proposed method in practical application.
[1] |
朱斐,吴文,刘全,等. 一种最大置信上界经验采样的深度Q网络方法[J]. 计算机研究与发展,2018,55(8):1694−1705 doi: 10.7544/issn1000-1239.2018.20180148
Zhu Fei, Wu Wen, Liu Quan, et al. A deep Q network method based on upper confidence bound experience sampling[J]. Journal of Computer Research and Development, 2018, 55(8): 1694−1705 (in Chinese) doi: 10.7544/issn1000-1239.2018.20180148
|
[2] |
Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-learning[C] //Proc of the 30th AAAI Conf on Artificial Intelligence. Palo Alto, CA : AAAI, 2016: 2094−2100
|
[3] |
亓法欣,童向荣,于雷. 基于强化学习DQN的智能体信任增强[J]. 计算机研究与发展,2020,57(6):1227−1238 doi: 10.7544/issn1000-1239.2020.20190403
Qi Faxin, Tong Xiangrong, Yu Lei. Agent trust boost via reinforcement learning DQN[J]. Journal of Computer Research and Development, 2020, 57(6): 1227−1238 (in Chinese) doi: 10.7544/issn1000-1239.2020.20190403
|
[4] |
陈红名,刘全,闫岩,等. 基于经验指导的深度确定性多行动者-评论家算法[J]. 计算机研究与发展,2019,56(8):1708−1720 doi: 10.7544/issn1000-1239.2019.20190155
Chen Hongming, Liu Quan, Yan Yan, et al. An experience-guided deep deterministic actor-critic algorithm with multi-actor[J]. Journal of Computer Research and Development, 2019, 56(8): 1708−1720 (in Chinese) doi: 10.7544/issn1000-1239.2019.20190155
|
[5] |
Ibarz J, Tan J, Finn C, et al. How to train your robot with deep reinforcement learning: Lessons we have learned[J]. The International Journal of Robotics Research, 2021, 40(4/5): 698−721
|
[6] |
Zhu Kai, Zhang Tao. Deep reinforcement learning based mobile robot navigation: A review[J]. Tsinghua Science and Technology, 2021, 26(5): 674−691 doi: 10.26599/TST.2021.9010012
|
[7] |
陈佳盼,郑敏华. 基于深度强化学习的机器人操作行为研究综述[J]. 机器人,2022,44(2):236−256
Chen Jiapan, Zheng Minhua. A survey of robot manipulation behavior research based on deep reinforcement learning[J]. Robot, 2022, 44(2): 236−256 (in Chinese)
|
[8] |
Du Wei, Ding Shifei. A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications[J]. Artificial Intelligence Review, 2021, 54(5): 3215−3238 doi: 10.1007/s10462-020-09938-y
|
[9] |
Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529−533 doi: 10.1038/nature14236
|
[10] |
Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484−489 doi: 10.1038/nature16961
|
[11] |
刘全,翟建伟,章宗长,等. 深度强化学习综述[J]. 计算机学报,2018,41(1):1−27 doi: 10.11897/SP.J.1016.2019.00001
Liu Quan, Zhai Jianwei, Zhang Zongchang, et al. A survey on deep reinforcement learning[J]. Chinese Journal of Computers, 2018, 41(1): 1−27 (in Chinese) doi: 10.11897/SP.J.1016.2019.00001
|
[12] |
Duan Jingliang, Li Shengbo, Guan Yang, et al. Hierarchical reinforcement learning for self-driving decision-making without reliance on labelled driving data[J]. IET Intelligent Transport Systems, 2020, 14(5): 297−305 doi: 10.1049/iet-its.2019.0317
|
[13] |
Yang Jiachen, Zhang Jipeng, Wang Huihui. Urban traffic control in software defined internet of things via a multi-agent deep reinforcement learning approach[J]. IEEE Transactions on Intelligent Transportation Systems, 2020, 22(6): 3742−3754
|
[14] |
Li Zhenning, Yu Hao, Zhang Guohui, et al. Network-wide traffic signal control optimization using a multi-agent deep reinforcement learning [J]. Transportation Research Part C: Emerging Technologies, 2021, 125(4): 103059
|
[15] |
Lowe R, Wu Yi, Tamar A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C] //Proc of the 31st Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 6382−6393
|
[16] |
Iqbal S, Sha F. Actor-attention-critic for multi-agent reinforcement learning[C] //Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2019: 2961−2970
|
[17] |
Zhang Kaiqing, Yang zhuoran, Liu Han et al. Fully decentralized multi-agent reinforcement learning with networked agents[C] //Proc of the 35th Int Conf on Machine Learning. New York: ACM, 2018: 5872−5881
|
[18] |
Foerster J, Farquhar G, Afouras T, et al. Counterfactual multi-agent policy gradients[C] //Proc of the 32nd AAAI Conf on Artificial Intelligence. Menlo Park, CA: AAAI, 2018: 2974−2982
|
[19] |
Sunehag P, Lever G, Gruslys A, et al. Value-decomposition networks for cooperative multi-agent learning[C]// Proc of the 17th Int Conf on Autonomous Agents and Multi Agent Systems. New York: ACM, 2018: 2085−2087
|
[20] |
Rashid T, Samvelyan M, Schroeder C, et al. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning[C] //Proc of the 35th Int Conf on Machine Learning. New York: ACM, 2018: 4295−4304
|
[21] |
Rashid T, Farquhar G, Peng B, et al. Weighted QMIX: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning[C] //Proc of the 34th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2020: 10199−10210
|
[22] |
Hostallero W J K D E, Son K, Kim D, et al. Learning to factorize with transformation for cooperative multi-agent reinforcement learning[C] //Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2019: 5887−5896
|
[23] |
Anschel O, Baram N, Shimkin N. Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning[C] //Proc of the 34th Int Conf on Machine Learning. New York: ACM, 2017: 176−185
|
[24] |
Zhang Zongzhang, Pan Zhiyuan, Kochenderfer M J. Weighted double Q-learning[C] //Proc of the 26th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2017: 3455−3461
|
[25] |
Filar J, Vrieze K. Competitive Markov Decision Processes[M]. Berlin: Springer, 2012
|
[26] |
Hasselt H. Double Q-learning[C] //Proc of the 24th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2010: 2613−2621
|
[27] |
Abed-alguni B H, Ottom M A. Double delayed Q-learning[J]. International Journal of Artificial Intelligence, 2018, 16(2): 41−59
|
[28] |
Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods[C] //Proc of the 35th Int Conf on Machine Learning. New York: ACM, 2018: 1587−1596
|
[29] |
Xiong Gang, Zhu Fenghua, Liu Xiewei, et al. Cyber-physical-social system in intelligent transportation[J]. IEEE/CAA Journal of Automatica Sinica, 2015, 2(3): 320−333 doi: 10.1109/JAS.2015.7152667
|
[30] |
Li Li, Lv Yisheng, Wang Feiyue. Traffic signal timing via deep reinforcement learning[J]. IEEE/CAA Journal of Automatica Sinica, 2016, 3(3): 247−254 doi: 10.1109/JAS.2016.7508798
|
[31] |
Zhu Fenghua, Lv Yisheng, Chen Yuanyan et al. Parallel transportation systems: Toward IoT-enabled smart urban traffic control and management[J]. IEEE Transactions on Intelligent Transportation Systems, 2019, 21(10): 4063−4071
|
[32] |
Tan Ming. Multi-agent reinforcement learning: Independent vs cooperative agents[C] //Proc of the 10th Int Conf on Machine Learning. New York: ACM, 1993: 330−337
|
[33] |
Foerster J, Nardelli N, Farquhar G, et al. Stabilising experience replay for deep multi-agent reinforcement learning[C] //Proc of the 34th Int Conf on Machine Learning. New York: ACM, 2017: 1146−1155
|
[34] |
Tan Tian, Bao Feng, Deng Yue, et al. Cooperative deep reinforcement learning for large-scale traffic grid signal control[J]. IEEE Transactions on Cybernetics, 2019, 50(6): 2687−2700
|
[35] |
Li Shuyang. Multi-agent deep deterministic policy gradient for traffic signal control on urban road network[C] //Proc of the 1st IEEE Int Conf on Advances in Electrical Engineering and Computer Applications. Piscataway, NJ: IEEE, 2020: 896−900
|
[36] |
Chu Tianshu, Wang Jie. Traffic signal control by distributed Reinforcement Learning with min-sum communication[C] //Proc of the 36th American Control Conf. Piscataway, NJ: IEEE, 2017: 5095−5100
|
[37] |
Chu Tianshu, Wang Jie, Codecà L, et al. Multi-agent deep reinforcement learning for large-scale traffic signal control[J]. IEEE Transactions on Intelligent Transportation Systems, 2019, 21(3): 1086−1095
|
[38] |
Chen Xiaoyu, Xiong Gang, Lv Yisheng, et al. A collaborative communication-qmix approach for large-scale networked traffic signal control[C] //Proc of the 24th IEEE Int Intelligent Transportation Systems Conf (ITSC). Piscataway, NJ: IEEE, 2021: 3450−3455
|
[39] |
Yang Shantian, Yang Bo, Wong H S et al. Cooperative traffic signal control using multi-step return and off-policy asynchronous advantage actor-critic graph algorithm [J]. Knowledge-Based Systems, 2019, 183(3): 104855
|
[40] |
Yang Shantian, Yang Bo, Kang Zhongfeng, et al. IHG-MA: Inductive heterogeneous graph multi-agent reinforcement learning for multi-intersection traffic signal control[J]. Neural Networks, 2021, 139(2): 265−277
|
[41] |
Yu Chunhui, Feng Yiheng, et al. Corridor level cooperative trajectory optimization with connected and automated vehicles[J]. Transportation Research Part C: Emerging Technologies, 2019, 105(5): 405−421
|