Abstract:
Asynchronous advantage actor-critic (A3C) constructs a parallel deep reinforcement learning framework composed by one-Learner and multi-Workers. However, A3C produces the high variance solutions, and Learner does not obtain the global optimal policy. Moreover, it is difficult to transfer and deploy from the large-scale parallel network to the low consumption end-platform. Aims to above problems, we propose a compression and knowledge extraction model based on supervised exploring, called Compactt_A3C. In the proposed model, we freeze Workers of the pre-trained A3C to measure these performances in the common state, and map the performances to probabilities by softmax. In this paper, we update Learner according to such probability, which is to obtain the global optimal sub-model (Worker) and enhance resource utilization. Furthermore, the updated Learner is assigned as Teacher Network to supervise Student Network in the early exploration stage. We exploit the linear factor to reduce the guidance of Teacher Network for encouraging the free exploration of Student Network. And building up two types of Student Network to demonstrate the effectiveness aims at the proposed model. In the popular states including Gym Classic Control and Atari 2600, the level of Teacher Network is achieved. The code of proposed model is published in
https://github.com/meadewaking/Compact_A3C.