高级检索

    给定预算下基于相对熵置信区间的蒙特卡洛树搜索最优动作识别算法

    Best Action Identification Algorithm in Monte Carlo Tree Search Based on Relative Entropy Confidence Interval with Given Budget

    • 摘要: 蒙特卡洛树搜索(Monte Carlo tree search, MCTS)将强化学习的反馈优化与生长树的动态规划相结合,在输出当前状态的最佳动作的同时极大地减少了计算量,因此成为开放环境下众多领域智能系统的关键通用方法. 但由于计算资源匮乏或者计算成本昂贵等原因,完全充分地对树结构进行搜索是难以实现的,因此在有限的预算下高效合理地分配计算资源从而获得当前状态下的最优动作是目前研究的一个重要问题. 现有大多数算法仅以识别准确率作为性能指标,通过实验对比验证算法性能,缺少对算法的识别误差和影响因素的分析,从而降低了算法的可信性和可解释性. 针对该问题,选择基础核心的2名玩家、完全信息、零和博弈场景,提出了固定预算设定下MCTS抽象模型的最优行动识别算法DLU——基于相对熵置信区间的纯探索(relative entropy confidence interval based pure exploration). 首先提出了基于相对熵置信区间的估值方法对叶子节点胜率进行估计,其可以从底层提高树节点估值准确性;其次给出了第1层节点值估计、最优节点选择策略以形成完整算法流程;然后推导了DLU算法的识别误差上界,并分析了算法性能的影响因素;最后在人造树模型和井字棋2种场景下验证算法性能. 实验结果表明,在人造树模型上基于相对熵的算法类具有更高的准确度,且模型越复杂识别难度越高时,该算法类的性能优势越显著. 在井字棋场景下,DLU算法能有效地识别最优动作.

       

      Abstract: Monte Carlo tree search (MCTS) integrates the feedback optimization of reinforcement learning with the dynamic planning of growing tree to output the best action for the current state while greatly reducing computational effort, making it a key general method for intelligent systems of many domains in open environment. However, a sufficient and complete search for tree structure is difficult to implement due to resource limitation or computation cost, so an efficient and reasonable allocation of computational resources to obtain the best action in the current state is an important issue of current research. Most of the existing algorithms only use identification accuracy as a performance indicator to verify algorithm performance by experimental comparison, lacking analysis for the algorithm’s identification error and influencing factors, thus reducing the credibility and interpretability. In order to solve the problem, the basic core two players, complete information and zero-sum game scenario is chosen and the best action identification (BAI) algorithm DLU (relative entropy confidence interval based pure exploration) is proposed for the MCTS abstraction model with fixed budget setting. Firstly, a relative entropy-based confidence interval estimation method is constructed to estimate the win rate of leaf, which can essentially improve the valuation accuracy of tree nodes. Secondly, value estimation and node selection strategy for depth-one nodes are proposed to form the complete algorithm flow. Then the upper bound for output error of DLU algorithm is derived and the influencing factors on the algorithm performance are analysed. Finally, the algorithm performance is verified in two scenarios: artificial tree model and tic-tac-toe. The experimental results show that the algorithm class based on relative entropy has higher accuracy on the artificial tree model, and the performance advantage is more significant when the model is more complex and the recognition difficulty is higher. In the tic-tac-toe scenario, the DLU algorithm can effectively identify the best move.

       

    /

    返回文章
    返回