高级检索
    丁世飞, 杜威, 郭丽丽, 张健, 徐晓. 基于双评论家的多智能体深度确定性策略梯度方法[J]. 计算机研究与发展, 2023, 60(10): 2394-2404. DOI: 10.7544/issn1000-1239.202220399
    引用本文: 丁世飞, 杜威, 郭丽丽, 张健, 徐晓. 基于双评论家的多智能体深度确定性策略梯度方法[J]. 计算机研究与发展, 2023, 60(10): 2394-2404. DOI: 10.7544/issn1000-1239.202220399
    Ding Shifei, Du Wei, Guo Lili, Zhang Jian, Xu Xiao. Multi-Agent Deep Deterministic Policy Gradient Method Based on Double Critics[J]. Journal of Computer Research and Development, 2023, 60(10): 2394-2404. DOI: 10.7544/issn1000-1239.202220399
    Citation: Ding Shifei, Du Wei, Guo Lili, Zhang Jian, Xu Xiao. Multi-Agent Deep Deterministic Policy Gradient Method Based on Double Critics[J]. Journal of Computer Research and Development, 2023, 60(10): 2394-2404. DOI: 10.7544/issn1000-1239.202220399

    基于双评论家的多智能体深度确定性策略梯度方法

    Multi-Agent Deep Deterministic Policy Gradient Method Based on Double Critics

    • 摘要: 在现实世界的复杂多智能体环境中,任务的完成通常需要多个智能体之间的相互协作,这促使各种多智能体强化学习方法不断涌现. 动作价值函数估计偏差是单智能体强化学习领域中备受关注的一个重要问题,而在多智能体环境中却鲜有研究. 针对这一问题,分别从理论和实验上证明了多智能体深度确定性策略梯度方法存在价值函数被高估. 提出基于双评论家的多智能体深度确定性策略梯度(multi-agent deep deterministic policy gradient method based on double critics, MADDPG-DC)方法,通过在双评论家网络上的最小值操作来避免价值被高估,进一步促进智能体学得最优的策略. 此外,延迟行动者网络更新,保证行动者网络策略更新的效率和稳定性,提高策略学习和更新的质量. 在多智能体粒子环境和交通信号控制环境上的实验结果证明了所提方法的可行性和优越性.

       

      Abstract: In the complex multi-agent environments of the real world, the completion of tasks usually requires cooperation between agents, which promotes the emergence of various multi-agent reinforcement learning methods. The estimation bias of Q-value is an important problem in the field of single-agent reinforcement learning, but it is rarely studied in multi-agent environments. To solve this problem, the multi-agent deep deterministic policy gradient method commonly used in multi-agent reinforcement learning, is proved to have the issue of overestimating the value function by theory and experiment respectively. A multi-agent deep deterministic policy gradient method based on double critic (MADDPG-DC) network structure is proposed to avoid the overestimation of Q-value and further promotes the policy learning of agents. In addition, the method of delaying the updating of the actor network is introduced to ensure the efficiency and stability of the policy updating and improves the quality of policy learning. In order to prove the effectiveness and versatility of the proposed method, experiments are conducted on different tasks in multi-agent particle environment. The experimental results show that the proposed method can effectively avoid the overestimation of the value function. In addition, experiments on traffic signal control environment verify the feasibility and superiority of the proposed method in practical application.

       

    /

    返回文章
    返回