高级检索
    张晶, 王子铭, 任永功. A3C深度强化学习模型压缩及知识抽取[J]. 计算机研究与发展, 2023, 60(6): 1373-1384. DOI: 10.7544/issn1000-1239.202111186
    引用本文: 张晶, 王子铭, 任永功. A3C深度强化学习模型压缩及知识抽取[J]. 计算机研究与发展, 2023, 60(6): 1373-1384. DOI: 10.7544/issn1000-1239.202111186
    Zhang Jing, Wang Ziming, Ren Yonggong. A3C Deep Reinforcement Learning Model Compression and Knowledge Extraction[J]. Journal of Computer Research and Development, 2023, 60(6): 1373-1384. DOI: 10.7544/issn1000-1239.202111186
    Citation: Zhang Jing, Wang Ziming, Ren Yonggong. A3C Deep Reinforcement Learning Model Compression and Knowledge Extraction[J]. Journal of Computer Research and Development, 2023, 60(6): 1373-1384. DOI: 10.7544/issn1000-1239.202111186

    A3C深度强化学习模型压缩及知识抽取

    A3C Deep Reinforcement Learning Model Compression and Knowledge Extraction

    • 摘要: 异步优势演员评论家(asynchronous advantage actor-critic,A3C)构建一主多从异步并行深度强化学习框架,其在最优策略探索中存在求解高方差问题,使主智能体难以保证全局最优参数更新及最佳策略学习. 同时,利用百万计算资源构建的大规模并行网络,难以部署低功耗近端平台. 针对上述问题,提出紧凑异步优势演员评论家(Compact_A3C)模型,实现模型压缩及知识抽取. 该模型冻结并评价A3C框架中所有子智能体学习效果,将评价结果转化为主智能体更新概率,保证全局最优策略获取,提升大规模网络资源利用率. 进一步,模型将优化主智能体作为“教师网络”,监督小规模“学生网络”前期探索与策略引导,并构建线性衰减损失函数鼓励“学生网络”对复杂环境自由探索,强化自主学习能力,实现大规模A3C模型知识抽取及网络压缩. 建立不同压缩比“学生网络”,在流行Gym Classic Control与Atari 2600环境中达到了与大规模“教师网络”一致的学习效果. 模型代码公布在https://github.com/meadewaking/Compact_A3C.

       

      Abstract: Asynchronous advantage actor-critic (A3C) constructs a parallel deep reinforcement learning framework composed by one-Learner and multi-Workers. However, A3C produces the high variance solutions, and Learner does not obtain the global optimal policy. Moreover, it is difficult to transfer and deploy from the large-scale parallel network to the low consumption end-platform. Aims to above problems, we propose a compression and knowledge extraction model based on supervised exploring, called Compactt_A3C. In the proposed model, we freeze Workers of the pre-trained A3C to measure these performances in the common state, and map the performances to probabilities by softmax. In this paper, we update Learner according to such probability, which is to obtain the global optimal sub-model (Worker) and enhance resource utilization. Furthermore, the updated Learner is assigned as Teacher Network to supervise Student Network in the early exploration stage. We exploit the linear factor to reduce the guidance of Teacher Network for encouraging the free exploration of Student Network. And building up two types of Student Network to demonstrate the effectiveness aims at the proposed model. In the popular states including Gym Classic Control and Atari 2600, the level of Teacher Network is achieved. The code of proposed model is published in https://github.com/meadewaking/Compact_A3C.

       

    /

    返回文章
    返回