Abstract:
In the complex multi-agent environments of the real world, the completion of tasks usually requires cooperation between agents, which promotes the emergence of various multi-agent reinforcement learning methods. The estimation bias of
Q-value is an important problem in the field of single-agent reinforcement learning, but it is rarely studied in multi-agent environments. To solve this problem, the multi-agent deep deterministic policy gradient method commonly used in multi-agent reinforcement learning, is proved to have the issue of overestimating the value function by theory and experiment respectively. A multi-agent deep deterministic policy gradient method based on double critic (MADDPG-DC) network structure is proposed to avoid the overestimation of
Q-value and further promotes the policy learning of agents. In addition, the method of delaying the updating of the actor network is introduced to ensure the efficiency and stability of the policy updating and improves the quality of policy learning. In order to prove the effectiveness and versatility of the proposed method, experiments are conducted on different tasks in multi-agent particle environment. The experimental results show that the proposed method can effectively avoid the overestimation of the value function. In addition, experiments on traffic signal control environment verify the feasibility and superiority of the proposed method in practical application.