## Bayesian Policy Gradient and Actor-Critic Algorithms

*Mohammad Ghavamzadeh, Yaakov Engel, Michal Valko*; 17(66):1−53, 2016.

### Abstract

Policy gradient methods are reinforcement learning algorithms
that adapt a parameterized policy by following a performance
gradient estimate. Many conventional policy gradient methods use
Monte-Carlo techniques to estimate this gradient. The policy is
improved by adjusting the parameters in the direction of the
gradient estimate. Since Monte-Carlo methods tend to have high
variance, a large number of samples is required to attain
accurate estimates, resulting in slow convergence. In this
paper, we first propose a Bayesian framework for policy
gradient, based on modeling the policy gradient as a Gaussian
process. This reduces the number of samples needed to obtain
accurate gradient estimates. Moreover, estimates of the natural
gradient as well as a measure of the uncertainty in the gradient
estimates, namely, the gradient covariance, are provided at
little extra cost. Since the proposed Bayesian framework
considers system trajectories as its basic observable unit, it
does not require the dynamics within trajectories to be of any
particular form, and thus, can be easily extended to partially
observable problems. On the downside, it cannot take advantage
of the Markov property when the system is Markovian. To address
this issue, we proceed to supplement our Bayesian policy
gradient framework with a new actor-critic learning model in
which a Bayesian class of non- parametric critics, based on
Gaussian process temporal difference learning, is used. Such
critics model the action- value function as a Gaussian process,
allowing Bayes' rule to be used in computing the posterior
distribution over action-value functions, conditioned on the
observed data. Appropriate choices of the policy
parameterization and of the prior covariance (kernel) between
action-values allow us to obtain closed-form expressions for the
posterior distribution of the gradient of the expected return
with respect to the policy parameters. We perform detailed
experimental comparisons of the proposed Bayesian policy
gradient and actor-critic algorithms with classic Monte-Carlo
based policy gradient methods, as well as with each other, on a
number of reinforcement learning problems.

[abs][pdf][bib]