Using Trajectory Data to Improve Bayesian Optimization for Reinforcement Learning
Aaron Wilson, Alan Fern, Prasad Tadepalli; 15(Jan):253−282, 2014.
AbstractRecently, Bayesian Optimization (BO) has been used to successfully optimize parametric policies in several challenging Reinforcement Learning (RL) applications. BO is attractive for this problem because it exploits Bayesian prior information about the expected return and exploits this knowledge to select new policies to execute. Effectively, the BO framework for policy search addresses the exploration-exploitation tradeoff. In this work, we show how to more effectively apply BO to RL by exploiting the sequential trajectory information generated by RL agents. Our contributions can be broken into two distinct, but mutually beneficial, parts. The first is a new Gaussian process (GP) kernel for measuring the similarity between policies using trajectory data generated from policy executions. This kernel can be used in order to improve posterior estimates of the expected return thereby improving the quality of exploration. The second contribution, is a new GP mean function which uses learned transition and reward functions to approximate the surface of the objective. We show that the model-based approach we develop can recover from model inaccuracies when good transition and reward models cannot be learned. We give empirical results in a standard set of RL benchmarks showing that both our model-based and model-free approaches can speed up learning compared to competing methods. Further, we show that our contributions can be combined to yield synergistic improvement in some domains.