• 中国精品科技期刊
  • CCF推荐A类中文期刊
  • 计算领域高质量科技期刊T1类
Advanced Search
Ding Shifei, Du Wei, Guo Lili, Zhang Jian, Xu Xiao. Multi-Agent Deep Deterministic Policy Gradient Method Based on Double Critics[J]. Journal of Computer Research and Development, 2023, 60(10): 2394-2404. DOI: 10.7544/issn1000-1239.202220399
Citation: Ding Shifei, Du Wei, Guo Lili, Zhang Jian, Xu Xiao. Multi-Agent Deep Deterministic Policy Gradient Method Based on Double Critics[J]. Journal of Computer Research and Development, 2023, 60(10): 2394-2404. DOI: 10.7544/issn1000-1239.202220399

Multi-Agent Deep Deterministic Policy Gradient Method Based on Double Critics

Funds: This work was supported by the National Natural Science Foundation of China (62276265, 61976216, 62206297,62206296).
More Information
  • Author Bio:

    Ding Shifei: born in 1963. PhD, professor. Member of CCF. His main research interests include pattern recognition and artificial intelligence

    Du Wei: born in 1994. PhD candidate. His main research interests include machine learning and reinforcement learning

    Guo Lili: born in 1990. PhD. Her main research interests include deep learning and emotion recognition

    Zhang Jian: born in 1990. PhD. His main research interests include machine learning and pattern recognition

    Xu Xiao: born in 1992. PhD. Her main research interests include machine learning and clustering analysis

  • Received Date: May 16, 2022
  • Revised Date: November 14, 2022
  • Available Online: April 17, 2023
  • In the complex multi-agent environments of the real world, the completion of tasks usually requires cooperation between agents, which promotes the emergence of various multi-agent reinforcement learning methods. The estimation bias of Q-value is an important problem in the field of single-agent reinforcement learning, but it is rarely studied in multi-agent environments. To solve this problem, the multi-agent deep deterministic policy gradient method commonly used in multi-agent reinforcement learning, is proved to have the issue of overestimating the value function by theory and experiment respectively. A multi-agent deep deterministic policy gradient method based on double critic (MADDPG-DC) network structure is proposed to avoid the overestimation of Q-value and further promotes the policy learning of agents. In addition, the method of delaying the updating of the actor network is introduced to ensure the efficiency and stability of the policy updating and improves the quality of policy learning. In order to prove the effectiveness and versatility of the proposed method, experiments are conducted on different tasks in multi-agent particle environment. The experimental results show that the proposed method can effectively avoid the overestimation of the value function. In addition, experiments on traffic signal control environment verify the feasibility and superiority of the proposed method in practical application.

  • [1]
    朱斐,吴文,刘全,等. 一种最大置信上界经验采样的深度Q网络方法[J]. 计算机研究与发展,2018,55(8):1694−1705 doi: 10.7544/issn1000-1239.2018.20180148

    Zhu Fei, Wu Wen, Liu Quan, et al. A deep Q network method based on upper confidence bound experience sampling[J]. Journal of Computer Research and Development, 2018, 55(8): 1694−1705 (in Chinese) doi: 10.7544/issn1000-1239.2018.20180148
    [2]
    Van Hasselt H, Guez A, Silver D. Deep reinforcement learning with double Q-learning[C] //Proc of the 30th AAAI Conf on Artificial Intelligence. Palo Alto, CA : AAAI, 2016: 2094−2100
    [3]
    亓法欣,童向荣,于雷. 基于强化学习DQN的智能体信任增强[J]. 计算机研究与发展,2020,57(6):1227−1238 doi: 10.7544/issn1000-1239.2020.20190403

    Qi Faxin, Tong Xiangrong, Yu Lei. Agent trust boost via reinforcement learning DQN[J]. Journal of Computer Research and Development, 2020, 57(6): 1227−1238 (in Chinese) doi: 10.7544/issn1000-1239.2020.20190403
    [4]
    陈红名,刘全,闫岩,等. 基于经验指导的深度确定性多行动者-评论家算法[J]. 计算机研究与发展,2019,56(8):1708−1720 doi: 10.7544/issn1000-1239.2019.20190155

    Chen Hongming, Liu Quan, Yan Yan, et al. An experience-guided deep deterministic actor-critic algorithm with multi-actor[J]. Journal of Computer Research and Development, 2019, 56(8): 1708−1720 (in Chinese) doi: 10.7544/issn1000-1239.2019.20190155
    [5]
    Ibarz J, Tan J, Finn C, et al. How to train your robot with deep reinforcement learning: Lessons we have learned[J]. The International Journal of Robotics Research, 2021, 40(4/5): 698−721
    [6]
    Zhu Kai, Zhang Tao. Deep reinforcement learning based mobile robot navigation: A review[J]. Tsinghua Science and Technology, 2021, 26(5): 674−691 doi: 10.26599/TST.2021.9010012
    [7]
    陈佳盼,郑敏华. 基于深度强化学习的机器人操作行为研究综述[J]. 机器人,2022,44(2):236−256

    Chen Jiapan, Zheng Minhua. A survey of robot manipulation behavior research based on deep reinforcement learning[J]. Robot, 2022, 44(2): 236−256 (in Chinese)
    [8]
    Du Wei, Ding Shifei. A survey on multi-agent deep reinforcement learning: from the perspective of challenges and applications[J]. Artificial Intelligence Review, 2021, 54(5): 3215−3238 doi: 10.1007/s10462-020-09938-y
    [9]
    Mnih V, Kavukcuoglu K, Silver D, et al. Human-level control through deep reinforcement learning[J]. Nature, 2015, 518(7540): 529−533 doi: 10.1038/nature14236
    [10]
    Silver D, Huang A, Maddison C J, et al. Mastering the game of Go with deep neural networks and tree search[J]. Nature, 2016, 529(7587): 484−489 doi: 10.1038/nature16961
    [11]
    刘全,翟建伟,章宗长,等. 深度强化学习综述[J]. 计算机学报,2018,41(1):1−27 doi: 10.11897/SP.J.1016.2019.00001

    Liu Quan, Zhai Jianwei, Zhang Zongchang, et al. A survey on deep reinforcement learning[J]. Chinese Journal of Computers, 2018, 41(1): 1−27 (in Chinese) doi: 10.11897/SP.J.1016.2019.00001
    [12]
    Duan Jingliang, Li Shengbo, Guan Yang, et al. Hierarchical reinforcement learning for self-driving decision-making without reliance on labelled driving data[J]. IET Intelligent Transport Systems, 2020, 14(5): 297−305 doi: 10.1049/iet-its.2019.0317
    [13]
    Yang Jiachen, Zhang Jipeng, Wang Huihui. Urban traffic control in software defined internet of things via a multi-agent deep reinforcement learning approach[J]. IEEE Transactions on Intelligent Transportation Systems, 2020, 22(6): 3742−3754
    [14]
    Li Zhenning, Yu Hao, Zhang Guohui, et al. Network-wide traffic signal control optimization using a multi-agent deep reinforcement learning [J]. Transportation Research Part C: Emerging Technologies, 2021, 125(4): 103059
    [15]
    Lowe R, Wu Yi, Tamar A, et al. Multi-agent actor-critic for mixed cooperative-competitive environments[C] //Proc of the 31st Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2017: 6382−6393
    [16]
    Iqbal S, Sha F. Actor-attention-critic for multi-agent reinforcement learning[C] //Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2019: 2961−2970
    [17]
    Zhang Kaiqing, Yang zhuoran, Liu Han et al. Fully decentralized multi-agent reinforcement learning with networked agents[C] //Proc of the 35th Int Conf on Machine Learning. New York: ACM, 2018: 5872−5881
    [18]
    Foerster J, Farquhar G, Afouras T, et al. Counterfactual multi-agent policy gradients[C] //Proc of the 32nd AAAI Conf on Artificial Intelligence. Menlo Park, CA: AAAI, 2018: 2974−2982
    [19]
    Sunehag P, Lever G, Gruslys A, et al. Value-decomposition networks for cooperative multi-agent learning[C]// Proc of the 17th Int Conf on Autonomous Agents and Multi Agent Systems. New York: ACM, 2018: 2085−2087
    [20]
    Rashid T, Samvelyan M, Schroeder C, et al. Qmix: Monotonic value function factorisation for deep multi-agent reinforcement learning[C] //Proc of the 35th Int Conf on Machine Learning. New York: ACM, 2018: 4295−4304
    [21]
    Rashid T, Farquhar G, Peng B, et al. Weighted QMIX: Expanding monotonic value function factorisation for deep multi-agent reinforcement learning[C] //Proc of the 34th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2020: 10199−10210
    [22]
    Hostallero W J K D E, Son K, Kim D, et al. Learning to factorize with transformation for cooperative multi-agent reinforcement learning[C] //Proc of the 36th Int Conf on Machine Learning. New York: ACM, 2019: 5887−5896
    [23]
    Anschel O, Baram N, Shimkin N. Averaged-DQN: Variance reduction and stabilization for deep reinforcement learning[C] //Proc of the 34th Int Conf on Machine Learning. New York: ACM, 2017: 176−185
    [24]
    Zhang Zongzhang, Pan Zhiyuan, Kochenderfer M J. Weighted double Q-learning[C] //Proc of the 26th Int Joint Conf on Artificial Intelligence. San Francisco, CA: Morgan Kaufmann, 2017: 3455−3461
    [25]
    Filar J, Vrieze K. Competitive Markov Decision Processes[M]. Berlin: Springer, 2012
    [26]
    Hasselt H. Double Q-learning[C] //Proc of the 24th Int Conf on Neural Information Processing Systems. Cambridge, MA: MIT, 2010: 2613−2621
    [27]
    Abed-alguni B H, Ottom M A. Double delayed Q-learning[J]. International Journal of Artificial Intelligence, 2018, 16(2): 41−59
    [28]
    Fujimoto S, Hoof H, Meger D. Addressing function approximation error in actor-critic methods[C] //Proc of the 35th Int Conf on Machine Learning. New York: ACM, 2018: 1587−1596
    [29]
    Xiong Gang, Zhu Fenghua, Liu Xiewei, et al. Cyber-physical-social system in intelligent transportation[J]. IEEE/CAA Journal of Automatica Sinica, 2015, 2(3): 320−333 doi: 10.1109/JAS.2015.7152667
    [30]
    Li Li, Lv Yisheng, Wang Feiyue. Traffic signal timing via deep reinforcement learning[J]. IEEE/CAA Journal of Automatica Sinica, 2016, 3(3): 247−254 doi: 10.1109/JAS.2016.7508798
    [31]
    Zhu Fenghua, Lv Yisheng, Chen Yuanyan et al. Parallel transportation systems: Toward IoT-enabled smart urban traffic control and management[J]. IEEE Transactions on Intelligent Transportation Systems, 2019, 21(10): 4063−4071
    [32]
    Tan Ming. Multi-agent reinforcement learning: Independent vs cooperative agents[C] //Proc of the 10th Int Conf on Machine Learning. New York: ACM, 1993: 330−337
    [33]
    Foerster J, Nardelli N, Farquhar G, et al. Stabilising experience replay for deep multi-agent reinforcement learning[C] //Proc of the 34th Int Conf on Machine Learning. New York: ACM, 2017: 1146−1155
    [34]
    Tan Tian, Bao Feng, Deng Yue, et al. Cooperative deep reinforcement learning for large-scale traffic grid signal control[J]. IEEE Transactions on Cybernetics, 2019, 50(6): 2687−2700
    [35]
    Li Shuyang. Multi-agent deep deterministic policy gradient for traffic signal control on urban road network[C] //Proc of the 1st IEEE Int Conf on Advances in Electrical Engineering and Computer Applications. Piscataway, NJ: IEEE, 2020: 896−900
    [36]
    Chu Tianshu, Wang Jie. Traffic signal control by distributed Reinforcement Learning with min-sum communication[C] //Proc of the 36th American Control Conf. Piscataway, NJ: IEEE, 2017: 5095−5100
    [37]
    Chu Tianshu, Wang Jie, Codecà L, et al. Multi-agent deep reinforcement learning for large-scale traffic signal control[J]. IEEE Transactions on Intelligent Transportation Systems, 2019, 21(3): 1086−1095
    [38]
    Chen Xiaoyu, Xiong Gang, Lv Yisheng, et al. A collaborative communication-qmix approach for large-scale networked traffic signal control[C] //Proc of the 24th IEEE Int Intelligent Transportation Systems Conf (ITSC). Piscataway, NJ: IEEE, 2021: 3450−3455
    [39]
    Yang Shantian, Yang Bo, Wong H S et al. Cooperative traffic signal control using multi-step return and off-policy asynchronous advantage actor-critic graph algorithm [J]. Knowledge-Based Systems, 2019, 183(3): 104855
    [40]
    Yang Shantian, Yang Bo, Kang Zhongfeng, et al. IHG-MA: Inductive heterogeneous graph multi-agent reinforcement learning for multi-intersection traffic signal control[J]. Neural Networks, 2021, 139(2): 265−277
    [41]
    Yu Chunhui, Feng Yiheng, et al. Corridor level cooperative trajectory optimization with connected and automated vehicles[J]. Transportation Research Part C: Emerging Technologies, 2019, 105(5): 405−421
  • Cited by

    Periodical cited type(6)

    1. 韩宇捷,徐志杰,杨定裕,黄波,郭健美. CDES:数据驱动的云数据库效能评估方法. 计算机科学. 2024(06): 111-117 .
    2. 刘传磊,张贺,杨贺. 地铁保护区智能化巡查系统开发及应用研究. 现代城市轨道交通. 2024(09): 23-30 .
    3. 董文,张俊峰,刘俊,张雷. 国产数据库在能源数字化转型中的创新应用研究. 信息通信技术与政策. 2024(10): 68-74 .
    4. 阎开. 计算机检测维修与数据恢复技术及应用研究. 信息记录材料. 2023(08): 89-91 .
    5. 冯丽琴,冯花平. 基于人脸识别的可控化学习数据库系统设计. 数字通信世界. 2023(10): 69-71 .
    6. 张惠芹,章小卫,杜坤,李江. 基于数字孪生的高校实验室高温设备智能化监管体系的探究. 实验室研究与探索. 2023(11): 249-252+282 .

    Other cited types(11)

Catalog

    Article views (258) PDF downloads (142) Cited by(17)

    /

    DownLoad:  Full-Size Img  PowerPoint
    Return
    Return