基于策略迭代和值迭代的POMDP算法

孙  湧  仵  博  冯延蓬

基于策略迭代和值迭代的POMDP算法

孙湧仵博冯延蓬

A Policy-and Value- Iteration Algorithm for POMDP

Sun Yong, Wu Bo, and Feng Yanpeng

摘要

摘要: 部分可观察Markov决策过程是通过引入信念状态空间将非Markov链问题转化为Markov链问题来求解，其描述真实世界的特性使它成为研究随机决策过程的重要分支.介绍了部分可观察Markov决策过程的基本原理和决策过程，提出一种基于策略迭代和值迭代的部分可观察Markov决策算法，该算法利用线性规划和动态规划的思想，解决当信念状态空间较大时出现的“维数灾”问题，得到Markov决策的逼近最优解.实验数据表明该算法是可行的和有效的.

Abstract: Partially observable Markov decision processes (POMDP) changes the non-Markovian into Markovian over the belief state space. It has been an important branch of stochastic decision processes for its characteristics of describing the real world. Tradional algorithms to solve POMPDs are value iteration algorithm and policy iteration algorithm. However, the complexity of exact solution algorithms for such POMDPs are typically computationally intractable for all but the smallest problems. At first, the authors describe the principles and decision processes of POMDP, and then present a policy- and value- iteration algorithm(PVIA) for partially observable Markov decision processes. This algorithm uses advantages of policy iteration and value iteration when programming makes use of policy iteration and when computing uses value iteration. This algorithm using linear programming and dynamic programming resolves curse of dimensionality problem when the belief state is large, and obtains the approximate optimal value. A key contribution of this paper is that it shows how the basic operations of both algorithms can be performed effciently together. The algorithm was implemented in the SZPT_Roc team, which took the 2nd place in the simulation league of the RoboCup 2006 Chinese Open Championship. Finally, compared with some typical algorithms, experimental results show that the algorithm is practical and feasible.

HTML全文

参考文献(0)

施引文献

资源附件(0)