Best Action Identification Algorithm in Monte Carlo Tree Search Based on Relative Entropy Confidence Interval with Given Budget
-
Graphical Abstract
-
Abstract
Monte Carlo tree search (MCTS) integrates the feedback optimization of reinforcement learning with the dynamic planning of growing tree to output the best action for the current state while greatly reducing computational effort, making it a key general method for intelligent systems of many domains in open environment. However, a sufficient and complete search for tree structure is difficult to implement due to resource limitation or computation cost, so an efficient and reasonable allocation of computational resources to obtain the best action in the current state is an important issue of current research. Most of the existing algorithms only use identification accuracy as a performance indicator to verify algorithm performance by experimental comparison, lacking analysis for the algorithm’s identification error and influencing factors, thus reducing the credibility and interpretability. In order to solve the problem, the basic core two players, complete information and zero-sum game scenario is chosen and the best action identification (BAI) algorithm DLU (relative entropy confidence interval based pure exploration) is proposed for the MCTS abstraction model with fixed budget setting. Firstly, a relative entropy-based confidence interval estimation method is constructed to estimate the win rate of leaf, which can essentially improve the valuation accuracy of tree nodes. Secondly, value estimation and node selection strategy for depth-one nodes are proposed to form the complete algorithm flow. Then the upper bound for output error of DLU algorithm is derived and the influencing factors on the algorithm performance are analysed. Finally, the algorithm performance is verified in two scenarios: artificial tree model and tic-tac-toe. The experimental results show that the algorithm class based on relative entropy has higher accuracy on the artificial tree model, and the performance advantage is more significant when the model is more complex and the recognition difficulty is higher. In the tic-tac-toe scenario, the DLU algorithm can effectively identify the best move.
-
-