Episodic Policy Search Algorithms: a sample efficiency perspective

Episodic Policy Search Algorithms: a sample efficiency perspective Olivier Sigaud Sorbonne Universités, UPMC Univ Paris 06, CNRS UMR 7222,Institut des Systèmes Intelligents et de Robotique, F-75005 Paris, Franceolivier.sigaud@isir.upmc.fr +33 (0) 1 44 27 88 53 Freek Stulp German Aerospace Center (DLR), Institute of Robotics and Mechatronics, Wessling, Germanyfreek.stulp@dlr.de

Episodic policy search is currently the focus of intensive research driven by the recent success of deep reinforcement learning (RL) algorithms. In this paper we present a broad survey of episodic policy search methods, from optimization without a utility model and Bayesian Optimisation to derivative-based optimization and deep RL algorithms. We build a unified and didactical perspective to explain why deep RL algorithms are generally more sample efficient than previous methods, and we provide a conceptual survey of the most recent algorithms, without going into the details of mathematical derivations.

episodic policy search, sample efficiency, deep reinforcement learning

1 section 1 1 §1 <tag close=" ">1</tag>Introduction

Autonomous systems are systems which know what to do in their domain without external control. Generally, their behavior is specified through a policy. The policy of a robot, for instance, is defined through a controller which determines actions to take or signals to send to the actuators in any state of the robot and its environment.

A lot of robot policies are designed by hand, but this manual design is only viable for systems acting in well-structured environments and to achieve well-specified tasks. When those conditions are not met, one can let the system find its own policy by exploring various behaviors and exploiting those that perform well with respect to some predefined utility function. This is called policy search, a particular case of reinforcement learning (RL) ( , ) where the action is continuous. More precisely, the goal of policy search is to optimize a policy when the utility of the resulting behaviors is not known in advance. In practice, a policy search algorithm runs the system with the current policy to generate trajectories made of several state and action steps and gets the utility as a return. This approach is called black box optimization (BBO). BBO algorithms receive the outcome of running the system as a set of samples. Actually, there are two possible kinds of samples: samples corresponding to a single step of the system, that we call step-samples later on, and samples corresponding to a complete trajectory, that we call episodic-samples. The observed utility of these samples can then be used to choose a better policy, and the process is repeated until some satisfactory set of behaviors is found. In general, policies are represented with a parametrized function, and policy search explores the space of policy parameters.

The main limitation of policy search is that, if policies are executed on a real robot, evaluating the policy is costly, mainly because it takes time and leads to wear and tear for the robot. For this reason, policy search methods for robotics should optimize the policy whilst minimizing the number of policy executions required to do so. A policy search method that achieves the same improvement with fewer policy executions in comparison to another method is more sample efficient.

The cost of processing the samples is often negligible with respect to the cost of running the system. If this is so, one may collect some samples from few experiments and then process them off-line – i.e. without running again the system – for a potentially long duration (up to hours, days, or even weeks), so as to improve its behavior. So the processing cost of an off-line policy search method may not matter much, whereas its scalability to large spaces does, because one may not afford a method that would require months or years to process a high-dimensional data set.

1.1 subsection 1.1 1.1 §1.1 <tag close=" ">1.1</tag>Scope and Contributions

This paper provides a review of policy search methods, under the specific constraints of sample efficiency outlined above. More precisely, we scrutinize algorithms under the perspective of the use they make of collected samples. We consider two aspects: 1) data efficiency: extracting more information from available data (definition taken from ( , )); 2) sample reuse: being able to improve a policy several times by using the same samples more than once, which is also known as experience replay.

Furthermore, we focus on the case where the behavior expected from a system has a well-defined end point or duration, called the episodic policy search problem, and where the system is learning to solve a single task. That is, we do not cover the broader domain of lifelong learning, where a robot must learn how to perform various tasks over a potentially infinite horizon ( , ).

Additionally, though a subset of policy search methods are based on RL, we do not cover recent work on RL with discrete actions such as dqn and some of its successors ( , ). Finally, we restrict ourselves to the case where samples are the unique source of information for improving the policy. That is, we do not consider the interactive context where a human user can provide external guidance ( , ), either through feedback, shaping or demonstration ( , ).

Three surveys about policy search for robotics have been published a few years ago ( , ). With respect to these previous surveys, we bring the following contributions:

1. item 1 1 item 1

we focus on sample efficiency aspects, with the general ambition to explain which classes of algorithms are the most sample efficient and why.

2. item 2 2 item 2

we cover a broader range of episodic policy search algorithms, including optimization without a utility model, Bayesian optimization (BO) and deep RL which are currently the matter of intensive research, into a unifying perspective, giving rise to a more didactical understanding of the various families of algorithms. In particular, we cover more than 15 additional algorithms, most of which are more recent than ( , ), as summarized in Tables , page and , page . The counterpart of this breadth is that we cannot give a detailed account of these algorithms nor their mathematical derivation. We rather investigate the elementary optimization, exploration and model learning processes at the roots of these methods, and refer the reader to ( , ) for the mathematical derivation and description of many algorithms, to ( , ) for a survey of regression, and to ( , ) for a detailed presentation of natural gradient concepts playing an important role in the domain.

1.2 subsection 1.2 1.2 §1.2 <tag close=" ">1.2</tag>Perspective of the Review

We consider the distinction between episodic-samples and step-samples as crucial for our perspective. This distinction exactly matches the philogenetic RL versus ontogenetic RL distinction in ( , ).

As a consequence of our focus on sample efficiency, this survey is structured as depicted in Figure .

Figure 1 Figure 1 1 Figure 1 1Simplified classification of the algorithms covered in the paper. Algorithms not covered in ( , ) are in blue. From the left to the right, algorithms are classified in increasing order of sample efficiency. Todo:This is an informative picture. But the pedantic graphics guy in me would of course want do some fine-tuning for the camera-ready version ;-) Figure 1Simplified classification of the algorithms covered in the paper. Algorithms not covered in ( , ) are in blue. From the left to the right, algorithms are classified in increasing order of sample efficiency. Todo:This is an informative picture. But the pedantic graphics guy in me would of course want do some fine-tuning for the camera-ready version ;-)

As stated above, policy search is an instance of BBO. The most efficient approaches to optimization, based on convexity or closed-from computation of the optimum, generally require too restricted assumptions to be applied to BBO ( , ). The methods compatible with the BBO context are derivative-based optimization, which needs the analytical form of a differentiable utility function, optimization without a utility model, which only requires smoothness from that function and random search, which does not require anything but is generally inefficient.

The most natural optimization approach to implement policy search would consist in performing derivative-based optimization on the utility function in the policy parameter space (see e.g. ( , )). However, in BBO, the analytical form of this function is generally not available.

Given this difficulty, we consider four solutions:

1. item 1 1 item 1

using optimization without a utility model (Section ),

2. item 2 2 item 2

learning a surrogate model of the utility function in the space of policy parameters and performing analytical or derivative-based optimization using this model (Section ),

3. item 3 3 item 3

learning a surrogate model of the utility function in the state-action space, called a critic and doing the same as Solution (Section ),

4. item 4 4 item 4

learning a forward model of the system-environment interaction that predicts the next state given the current state and action, to generate samples without using the system, and then applying one of the above solutions based on the generated samples (Section ).

The first two approaches are based on episodic-samples, the third is based on step-samples and the fourth can be applied to both.

Todo:In general, I understand the above, but “outside” readers would greatly benefit from the following:

1. item 1 1 item 1

A concrete example (ideally from a Jan Peters or Marc Deisenroth paper, as they may review this ;-) I am thinking ball-in-cup, or perhaps even a simple maze. Then you could explain each of the items below briefly in the context of this example.

2. item 2 2 item 2

An image showing which part of the example is modelled To discuss:I have some ideas for this, we could discuss over skype.

1.3 subsection 1.3 1.3 §1.3 <tag close=" ">1.3</tag>Messages of the Review

From this perspective, our main messages are the following:

1. item 1 1 item 1

A sample can only be reused to improve a model. As a consequence, optimization without a utility model does not generally give rise to sample reuse.

2. item 2 2 item 2

The compared sample efficiency of learning a surrogate model of utility in the policy parameter space versus a critic mostly depends on the size and structure of the corresponding spaces, but the latter uses step-samples, which is inherently more sample efficient, and offers more opportunities for sample reuse.

3. item 3 3 item 3

Learning a critic offers the opportunity to learn useful intermediate representations.

4. item 4 4 item 4

On-line learning is generally faster than off-line learning, but it is also more unstable.

5. item 5 5 item 5

Learning a forward model is not enough to improve sample efficiency when using episodic-samples, because it does not provide the estimated utility of the generated samples. A more complete simulator providing this estimated utility over episodes is required.

6. item 6 6 item 6

In contrast with using episodic-samples, learning a forward model can improve sample efficiency when using step-samples, because the immediate utility of step-samples is easily accessible.

From these messages, it appears that the higher sample efficiency of deep RL methods results from several mechanisms: they use derivative-based optimization updates, they model the utility function in the state-action space, and they can be combined with a forward model. In addition, they benefit from massive sample reuse using a replay buffer and they can perform on-line learning. However, reserach is still very active in finding a way to manage a trade-off between bias and variance to efficiently control their intrinsic unstability.

1.4 subsection 1.4 1.4 §1.4 <tag close=" ">1.4</tag>Structure of the Review

To explain these various points, the paper is organised as follows. In Section , we formally define the episodic policy search problem, the notations and the main related concepts. In Section , we establish the taxonomy of methods depicted in Figure , based on elementary processes that play an important role in many BBO methods, namely regression and optimization, without going down to the level of algorithms. In Sections , and , we show how these methods are implemented in various episodic policy search algorithms. Then, in Section , we discuss the different elementary design choices that matter in terms of sample efficiency and reuse. Finally, Section summarizes the paper and provides some perspectives about current trends in the domain.

2 section 2 2 §2 <tag close=" ">2</tag>Episodic policy search

This section introduces the general episodic policy search framework and various ways to compute utility with formal notations. These notations are standard, readers familiar with RL ( , ) or episodic policy search ( , ) can skip this section.

2.1 subsection 2.1 2.1 §2.1 <tag close=" ">2.1</tag>System and environment interaction

We consider a system, such as a robot or a software agent, in interaction with its environment. Since the agent is learning with a computer, we consider an interaction along discrete times steps, as suggested in ( , ). This interaction is fully characterized by its current state $∈ x k X$ 1 footnote 1 1 footnote 1 Throughout the paper, we denote scalars as lowercase symbols ( $x$ ), vectors as bold lowercase symbols ( $x$ ) and matrices as bold uppercase symbols ( $X$ ). The time index is always $k$ and the iteration index is always $i$ . $< ., . >$ denotes a pair. .

At each time step $k$ , the system gets a state information $∈ x k X$ and performs an action $∈ u k U$ specified by a stochastic policy $→ θ$ , where $θ$ is a set of policy parameters taken from the policy parameter space $Θ$ . In a closed loop policy, $→ θ$ is a stochastic function of the states $x k$ , that is $∼ u k ⁢ → θ (| u k x k)$ which gives the probability of choosing action $u k$ given state $x k$ . In the open loop case, it is rather a function of the time steps $k$ , that is $∼ u k ⁢ → θ (| u k k)$ . Finally, a deterministic policy is a particular case where the probability is 1 for a specific action and 0 for the rest of actions.

The outcome of the action depends on the state and consists of two informations: the next state $∼ x + k 1 ⁢ p (| x + k 1 x k, u k)$ which is also a stochastic function of the current state and action, and an immediate utility $= j k ⁢ j (x k, u k)$ .

The aim of policy search is to optimize the policy parameters $θ$ with respect to some agregation of the immediate utilities $j k$ over trajectories of the system. The immediate utility function can be a scalar, or a vector $= j k ⁢ j (x k, u k)$ in the multi-objective case. In this case, secondary objectives may play a role in improving the sample efficiency of some policy search methods (we do not cover this topic here, see ( , ) for a survey).

To summarise, a step of the system-environment interaction from some state $x k$ can be specified as in Algorithm , where “\tikz[baseline=(char.base)] \node[shape=circle,draw,inner sep=1pt] (char) E; $→$ ” denotes information provided by the environment and “\tikz[baseline=(char.base)] \node[shape=circle,draw,inner sep=1pt] (char) S; $→$ ” denotes information provided by the system.

Algorithm 1 1 1 Algorithm 1

[htb] 1perform $_$ step( $x k, θ$ )Algorithm 1perform $_$ step( $x k, θ$ )\lx@orig@algorithmic \REQUIRE $x k$ : current state, $θ$ : policy parameters \STATE\tikz[baseline=(char.base)] \node[shape=circle,draw,inner sep=1pt] (char) S; $→$ $∼ u k ⁢ → θ (| u k x k)$ or $∼ u k ⁢ → θ (| u k k)$ \STATE\tikz[baseline=(char.base)] \node[shape=circle,draw,inner sep=1pt] (char) E; $→$ $= j k ⁢ j (x k, u k)$ \STATE\tikz[baseline=(char.base)] \node[shape=circle,draw,inner sep=1pt] (char) E; $→$ $∼ x + k 1 ⁢ p (| x + k 1 x k, u k)$ \RETURN $s k = < x k, u k, j k, x + k 1 >$

The steps of the system in its environment generate a set of samples $s k = < x k, u k, j k, x + k 1 >$ that we call step-samples throughout the paper.

2.2 subsection 2.2 2.2 §2.2 <tag close=" ">2.2</tag>Episodic policy search

In the episodic context, the system-environment interaction is initialized in some starting state $x 0$ and the current policy is applied until the system reaches one of $m$ predefined final states $∈ x d f, d {1, …, m}, ≥ m 1$ or some amount of time $k ⁢ m a x$ has elapsed. Predefined final states can be either goal states to be attained or destructive states that should be avoided. When $k ⁢ m a x$ has elapsed or a final state is reached, the system stops in state $x f$ at step $k f$ . The system-environment interaction between $x 0$ and $x f$ is called an episode.

We call trajectory the set of step-samples obtained along episode $e$ and we denote it with $τ e$ .

2.3 subsection 2.3 2.3 §2.3 <tag close=" ">2.3</tag>Utilities

Three notions of utility are useful in episodic policy search: the utility over a single trajectory, the utility over a set of trajectories from the same state, which corresponds to the Monte Carlo return from that state, and the utility over all states.

2.3.1 subsubsection 2.3.1 2.3.1 §2.3.1 <tag close=" ">2.3.1</tag>Utility over a trajectory

In the language of episodic policy search algorithms, running a policy to get the utility over a trajectory $⁢ J (τ)$ is often called performing a rollout. A rollout together with its utility is what we call an episodic-sample later on.

In some settings, the utility is considered null all along the trajectory and evaluated only at the final state. In particular, this is the case in many evolutionary algorithms, where the utility provided by the environment over an episode is called the fitness function and is not necessarily a function of states and actions along the episode ( , ).

2.3.2 subsubsection 2.3.2 2.3.2 §2.3.2 <tag close=" ">2.3.2</tag>Monte Carlo return

The system-environment interaction being stochastic, several episodes starting from the same initial state $x 0$ using the same policy can give rise to different trajectories and utilities. As a consequence, one should consider the expected utility over all possible trajectories $τ$ starting from $x 0$ , denoted $τ x 0$ . We write the corresponding expected utility as $¯ J (τ x 0) = I E {J (τ x 0)} τ$ . This expected utility being defined over an infinite set of episodes, in practice it needs to be approximated.

Methods that evaluate a policy just by sampling the utility over a large enough set of rollouts are called Monte Carlo methods. They can also be used to estimate the utility at any state along a trajectory to learn a critic, as covered in Section . The corresponding measure of utility comes with some variance. Averaging over enough trajectories is required to provide an accurate estimate of the return.

2.3.3 subsubsection 2.3.3 2.3.3 §2.3.3 <tag close=" ">2.3.3</tag>Global utility

The most general goal of policy search is to find an optimal policy starting from any initial state $x 0$ . This can be defined as optimizing

(1) Equation 1 1

= ⁢ J (θ) ∫ ∈ x 0 X ⁢ ¯ J (τ x 0) d x 0

Here again, computing this utility would require generating trajectories starting from an infinite number of initial states $x 0$ . In practice, one can only approximate it using a finite set of initial states. Thus an estimate of the global utility of some policy parameters $θ$ has two sources of variance, one resulting from the sampling of initial states $x 0$ and one resulting from the sampling of trajectories for a given $x 0$ . For some problems, one can train the system from episodes that all start from the same $x 0$ . With this option, it is not guaranteed that the policy will improve for other initial states than $x 0$ , thus it can only be applied if the system is later exploited from the same $x 0$ . In the other option, one starts from a finite set of initial states. This has a wider range of applications but this results in larger variance.

As shown in ( , ), () can also be written

(2) Equation 2 2

= ⁢ J (θ) ∫ ∈ τ T ⁢ p θ (τ) J (τ) d τ

where $T$ is the space of all possible trajectories and $⁢ p θ (τ)$ defines the probability of performing trajectory $τ$ under policy $→ θ$ . This formulation is used in episodic policy search methods using step-samples, as covered in Section .

Furthermore, an infinite horizon context can even be covered provided that the system can be run from a number of steps and that some utility can be obtained from running these steps.

2.4 subsection 2.4 2.4 §2.4 <tag close=" ">2.4</tag>Model-based versus model-free episodic policy search

A forward model is a model of the system-environment interaction which, given the current state and the current action, outputs a distribution of probabilities over potential next states, or just the most likely one. Using a forward model to accelerate learning of a critic is called model-based RL ( , ).

The forward model can be learned from $(x k, u k, x + k 1)$ samples as a function $^g$ using any regression algorithm described in Section , such that $∼^x + k 1 ⁢^g (| x + k 1 x k, u k)$ or $=^x + k 1 ⁢^g (x k, u k)$ . This model can then be used to generate synthetic trajectories $τ$ .

However, in order to use such synthetic samples in episodic policy search, the corresponding utility is also required. This can also be obtained either by learning from regression a model of the immediate utility function $=^j k ⁢^j (x k, u k)$ using $(x k, u k, j k)$ samples, or a model of utility over an episode $⁢^J (τ)$ . The former suffers from less variance than the latter, resulting in more use of model-based updates in methods using step-samples than in those using episodic-samples.

In principle, using synthetic samples can drastically improve the sample efficiency of episodic policy search methods. However, model-based episodic policy search suffers from inaccuracies in the models $^g$ , $^j$ and/or $^J$ , the inaccuracies themselves resulting from incomplete exploration and the intrinsic variance of the estimation process. As a consequence, these methods must be manipulated with care. This topic is well-covered in ( , ), we do not expand further about it here.

{tcolorbox}

[colback=red!10!white]Message 1: Learning a forward model can drastically improve the sample efficiency of episodic policy search methods, but it must be manipulated with care. It works better with methods using step-samples than methods using episodic-samples.

Todo:move what follows later.

A recent state-of-the-art presentation of model-based episodic policy search methods can be found in ( , ), where the black-DROPS algorithm is shown to be more data efficient than the standard pilco algorithm ( , ).

In deep RL, learning a forward model through regression may not be necessary, as a replay buffer seen as a “sample generator” can be seen a specific kind of forward model ( , ). This insight is used in svg, which offers a continuum between model-free and model-based methods ( , ). More standard model-based acceleration is also used in naf on top if its model-free mechanisms ( , ).

3 section 3 3 §3 <tag close=" ">3</tag>Sample efficiency factors in BBO

The role of this section is to establish the taxonomy of methods depicted in Figure . For doing so, we present elementary processes that play an important role in many BBO methods: regression and optimization. We do this independently from the episodic policy search context itself, thus we consider a general unknown function $f$ , often called the latent function and some input samples $ϕ$ taken from a space $Φ$ , without specifying the nature of $f$ nor $ϕ$ at this point. At this level of analysis, it is already possible to put forward some messages about sample efficiency in BBO. The corresponding policy search algorithms are then presented in the next sections.

The goal of optimization is to find the optimum of $f$ over $Φ$ , that is to find

= ϕ * ⁢ argmax ∈ ϕ Φ f (ϕ) 2 footnote 2 2 footnote 2 Throughout the paper, we consider maximization, it would be ⁢ a r g m i n in the case of minimization. .

In BBO, some estimate $⁢^f (ϕ)$ of $f$ at $θ$ must be obtained by sampling, that is by choosing a value for $ϕ$ and asking the system to return $⁢^f (ϕ)$ . A sample is the corresponding $< ϕ,^f (ϕ) >$ pair.

Since $f$ is not accessible in closed form, finding the optimum over $f$ cannot be performed analytically and the algorithms generally run iterations until some termination criterion is met. In this context, being sample efficient in BBO means going as close as possible to an optimum using as few samples as possible. The key question is “In what area does the optimum lie?”, so that an optimum can be found quickly by sampling that area.

Various optimization and regression processes can be used to answer this question. Regression and optimization are often intertwined since, from one side, regression is minimization of a loss function and, from the other side, the sample efficiency of optimization can be improved by learning a surrogate model with regression. For limiting cross-references, we start with optimization without a utility model and optimization using an analytical model without regression, then we present regression and finally we describe optimization with a surrogate utility model, which corresponds to the richest family and the most sample efficient methods.

3.1 subsection 3.1 3.1 §3.1 <tag close=" ">3.1</tag>Exploration policies in BBO

To take a higher level perspective about the above methods, we introduce a useful notion of exploration policy which is called “upper-level policy” in ( , ).

Interestingly, when the search space is the space of policy parameters $Θ$ , the exploration policy is the policy search method itself.

Under the same perspective, derivative-based policy search is performing greedy policy improvement from a model (thus greedy moves in the $Θ$ space), which makes it more straightforward and potentially more sample efficient, but also more prone to premature convergence into local minima. This sensitivity can result in unstability when greedy exploration is performed on a poor surrogate model.

Finally, exploration in BO methods is a richer and more expensive Bayesian inference process which can simultaneously optimize the policy and explore regions of large uncertainty, giving rise to a specific type of active learning.

3.2 subsection 3.2 3.2 §3.2 <tag close=" ">3.2</tag>Optimization without a utility model

Black box optimization consists in guessing where the optimum lies. In the absence of a model of utility, the guess can be purely random, as in random search. They can also be based on the assumption that the utility function shows some regularity which can be exploited through an implicit model given a memory of the previous samples. This is the case of population-based optimization, where it is assumed that the optimum should be close to the currently found best sample. Search is then performed by getting random samples around the current best samples. Random search, population-based optimization and a third method called finite differences are presented in Section .

3.3 subsection 3.3 3.3 §3.3 <tag close=" ">3.3</tag>Optimization with a model

Though the latent function is not analytically available in BBO, optimization with a surrogate utility model can be applied to an analytically available surrogate model (see Section ).

Below we distinguish two approaches: analytical optimization, and derivative-based optimization. In the former, the optimum is obtained through formal calculus whereas in the latter, it is found through numerical iterations. In contrast to optimization without a utility model, derivative-based optimization methods include no random search component, thus they perform greedy optimization. We distinguish two such methods: vanilla gradient optimization and natural gradient optimization.

3.3.1 subsubsection 3.3.1 3.3.1 §3.3.1 <tag close=" ">3.3.1</tag>Analytical optimization

Analytical optimization may be possible when the function to be optimized is available in closed form. It generally relies on the fact that the derivative of a function is null at its optima.Finding these optima can be solved analytically for some functions (e.g. quadratic and Gaussian functions).

{tcolorbox}

[colback=red!10!white]Message 2: Derivative-based optimization is iterative whereas analytical optimization is not.

3.3.2 subsubsection 3.3.2 3.3.2 §3.3.2 <tag close=" ">3.3.2</tag>Vanilla gradient optimization

Vanilla gradient optimization looks for a local optimum of the analytical derivative of $f$ .

Figure 2 Figure 2 2 Figure 2 2In derivative-based optimization, search starts from an initial parameter vector

ϕ 0

and converges to a local optimum

ϕ *

by following the gradient of the function to be optimized. Figure 2In derivative-based optimization, search starts from an initial parameter vector

ϕ 0

and converges to a local optimum

ϕ *

by following the gradient of the function to be optimized.

It computes the gradient of $f$ at $ϕ$ as the vector of its partial derivatives in all dimensions. This vector is tangent to the function at this point (see Figure ) and its length is controlled by a parameter called step size.

We note $= ∇ ϕ f ⁢ ∂ f (ϕ) ∂ ϕ$ the gradient of $f$ with respect to $ϕ$ and $∇ ϕ f | = ϕ ϕ i$ the value of this gradient at $ϕ i$ .

Given the previous sample $ϕ i$ , vanilla gradient optimization chooses the next sample $ϕ + i 1$ according to Algorithm , where $α i$ is the step size at iteration $i$ .

Algorithm 2 2 2 Algorithm 2

[hbt] 2vgo( $ϕ i, f$ ): One iteration of vanilla gradient optimization Algorithm 2vgo( $ϕ i, f$ ): One iteration of vanilla gradient optimization \lx@orig@algorithmic \REQUIRE $ϕ i$ : current best guess, $f$ latent function \STATE $= ϕ + i 1 + ϕ i α i . ∇ ϕ f | = ϕ ϕ i$ \RETURN $ϕ + i 1$

The value of $α i$ is critical: If it is taken too small, the algorithm converges too slowly. If it is taken too large, the algorithm may jump out of the local hill or bounce around the optimum. In practice, $α i$ should be large in the beginning of the optimization process and get smaller as the current best guess gets closer to the optimum.

3.3.3 subsubsection 3.3.3 3.3.3 §3.3.3 <tag close=" ">3.3.3</tag>Natural gradient optimization

Vanilla gradient optimization is fine as long as the input samples $ϕ$ are defined in a Euclidean space $Φ$ . When $ϕ$ is projected to $~ ϕ$ in a non-Euclidean space $~ Φ$ , the natural gradient is defined as

(3) Equation 3 3

= ⁢ ~ ∇ ~ ϕ ~ f (~ ϕ) ⁢ G - 1 (~ ϕ) ∇ ~ ϕ ~ f (~ ϕ)

where $G$ is a positive definite matrix characterizing the local curvature of $~ Φ$ . In the context of policy search methods, $G$ is known as the Fisher Information Matrix and noted $F$ . Natural gradient optimization has several beneficial properties that make it more data efficient than vanilla gradient optimization. The counterpart is that computing $F - 1$ can be demanding. However, in the context of policy search, there are several ways to approximate the natural gradient without computing $F - 1$ . All these aspects are covered in detail in ( , ).

{tcolorbox}

[colback=red!10!white]Message 3: Natural gradient optimization is computationally more intensive, but more data efficient than vanilla gradient optimization.

3.4 subsection 3.4 3.4 §3.4 <tag close=" ">3.4</tag>Regression

Parametric regression is covered in details in ( , ), here we just present the necessary concepts for our paper to be self-contained. Given a model $^f ω$ parameterized by $∈ ω Ω$ , the goal of parametric regression is to find the value of $ω$ that optimize a loss function, that is

= ω * ⁢ argmin ∈ ω Ω L o s s (f,^f ω) .

The simplest case for regression is when the model $^f ω$ is a linear function of the input. When it is not, one often defines a set of fixed feature functions and the model is a linear combination of these features, where $ω$ are the weigths. This defines a linear architecture. In some cases, the feature functions also contain parameters which are tuned by the regression process. This is the case of deep neural networks, for instance. This is also the case when the model is a unique Gaussian function, used in EDAs (see Section ), where the updated parameters are the average and the covariance of the feature function.

Given a set of $n ⁢ t s$ training samples $< ϕ j, f (ϕ j) >$ , $∈ j {1, n ⁢ t s}$ , the regression error is generally defined as

(4) Equation 4 4

= ⁢ ϵ (ω) ∑ = j 1 n ⁢ t s ⁢ L o s s (⁢ f (ϕ j), ⁢^f ω (ϕ j)),

where $⁢ L o s s$ can for instance be the squared error $(- ⁢ f (ϕ j) ⁢^f ω (ϕ j)) 2$ .

The values of $⁢ f (ϕ j)$ in () being known constants, the regression error $⁢ ϵ (ω)$ is an analytical function of $ω$ . In that context, minimizing the regression error can be performed analytically or through gradient descent, giving rise to batch and incremental regression respectively.

3.4.1 subsubsection 3.4.1 3.4.1 §3.4.1 <tag close=" ">3.4.1</tag>Batch regression

Batch regression is the analytical minimization of the regression error. Though the optimum is given in closed-form, it is a function of all the training samples, and computing this function can be computationally intensive, despite being analytical. This is the case in batch regression over a linear architecture when minimizing a squared error, for instance, where analytical resolution consists of a matrix inversion (see ( , ) for details). By contrast, this is not the case of the unique Gaussian feature function, where determining the average and covariance from training samples is straightforward.

3.4.2 subsubsection 3.4.2 3.4.2 §3.4.2 <tag close=" ">3.4.2</tag>Incremental regression

Incremental regression is the minimization of the regression error by updating $^f ω$ over iterations. It can be implemented by performing derivative-based optimization over $⁢ ϵ (ω)$ , for instance. It can benefit from the work done at previous iterations to improve $^f ω$ model at a lower cost than batch regression. It is used for instance in the “back-propagation of the gradient” algorithm to train deep neural networks.

{tcolorbox}

[colback=red!10!white]Message 4: Incremental regression is generally less computationally intensive than batch regression.

{tcolorbox}

[colback=red!10!white]Message 5: Back-propagation of the gradient used to train deep neural networks is incremental regression by performing derivative-based optimization over the network.

3.4.3 subsubsection 3.4.3 3.4.3 §3.4.3 <tag close=" ">3.4.3</tag>Sample reuse in regression

Batch and incremental regression can be called over several iterations in a loop and can use a different set of samples each time. When doing so, they can use as input either samples which were never used before, or samples already used in a previous iteration. The latter case defines sample reuse.

But using batch regression several times with the same samples provides an identical model each time. By contrast, when doing the same with incremental regression, the model improves at each iteration until it eventually starts to overfit. Thus sample reuse makes more sense in combination with incremental regression.

{tcolorbox}

[colback=red!10!white]Message 6: In contrast with batch regression, incremental regression methods benefit from sample reuse.

{tcolorbox}

[colback=red!10!white]Message 7: Deep RL benefits from sample reuse because it uses incremental regression methods.

With incremental regression, there is no guarantee to get $= ⁢^f ω (ϕ i) ⁢ f (ϕ i)$ for all samples $< ϕ i, f (ϕ i) >$ . The same is true about batch regression when using some regularization. However, the purpose of reusing a set of samples $< ϕ i, f (ϕ i) >$ is not so much to improve the accuracy of $^f ω$ for the corresponding input, but to improve the accuracy of $^f ω$ for unseen input $≠ ϕ j ϕ i$ .

Todo:why do I need the above insight?

3.5 subsection 3.5 3.5 §3.5 <tag close=" ">3.5</tag>Optimization with a surrogate utility model

In BBO, the analytical form of the utility function over $Θ$ is unknown. One can improve sample efficiency over optimization without a utility model by combining two processes: 1) learning a surrogate model of the utility function and 2) finding the optimum over this surrogate function by performing analytical or derivate-based optimization on this model.

The key feature of using a surrogate model is that it provides an estimate of the utility of an unseen sample. However, the purpose of getting this estimated utility “for free” is not to remove sampling. Rather, it helps determining where to sample next, by looking for an optimum $^θ *$ over $⁢^J (θ)$ without the need for any additional sample. In addition, using $^θ *$ as new sample for model learning improves the model preferentially in the area where the optimum may be.

{tcolorbox}

[colback=red!10!white]Message 8: Using a surrogate model in the context of BBO provides an estimate of the value of a sample without having to generate that sample.

Besides, choosing where to sample next is the basis of active learning.

{tcolorbox}

[colback=red!10!white]Message 9: Optimization using a surrogate model improves sample efficiency by implementing a basic form of active learning.

3.5.1 subsubsection 3.5.1 3.5.1 §3.5.1 <tag close=" ">3.5.1</tag>Optimization loops

There are two main ways to temporally organize regression and optimization into a loop: the iterative and the incremental loops.

1. item 1 1 item 1

In the iterative loop, a new surrogate model is computed at each iteration. Then optimization determines where to sample next, new samples are generated, a new surrogate model is computed using these new samples, and so on. This approach is more often used with the analytical resolution of regression.

2. item 2 2 item 2

The incremental loop is similar but, at each iteration, the surrogate model is incrementally updated rather than recomputed.

The sequential loop is particular case of the iterative loop, when a single iteration is performed: a first model is learned, generally out of many samples in a batch way, the optimum is determined based on this model, and the process stops. Besides, one may eventually reuse samples in the iterative loop but, as stated in Message , this brings no benefit and this is not computationally efficient.

In the incremental loop, it is frequent that a single new sample is used at each iteration. In that case, there is no sample reuse. The mini-batch approach often used in deep learning methods is a counterexample. This is only when using such mini-batches that samples are reused.

Todo:continue with model in $θ$ versus in $(x, u)$ .

Todo:Explain Table .

Table 1 Table 1 1 Table 1 1Classification of episodic policy search methods

Table 1Classification of episodic policy search methods
Algo	Model space	Model Improv.	Policy Improv.
EDAs	$θ$	local search	Analytical
Policy-gradient	$θ$ or $(x, ⁢ v u)$	iterative recomputation	Incremental, gradient-based
Critic-only	$(x, ⁢ v u)$	incremental	look for max
Bayes Opt.	$θ$	incremental	incremental
Actor-critic	$(x, ⁢ v u)$	incremental	incremental

In an actor-critic architecture, two derivative-based optimization processes are used, one for improving the model and one for finding the optimum policy guess on the resulting model.

3.5.2 subsubsection 3.5.2 3.5.2 §3.5.2 <tag close=" ">3.5.2</tag>Summary {tcolorbox}

[colback=red!10!white]Message 10: Optimization with a surrogate model can combine sample reuse brought by incremental regression and data efficiency brought by analytical optimization. This is the case of deep RL.

{tcolorbox}

[colback=red!10!white]Message 11: Being greedy, vanilla and natural gradient optimization are generally more sample efficient than optimization without a utility model, but they are also less robust to surrogate model inaccuracies.

4 section 4 4 §4 <tag close=" ">4</tag>Policy search without a utility model

Todo:rewrite for direct instantiation.

In the context of episodic policy search, the latent function to be optimized is the utility of the policy. When using episodic-samples, a sample is a $< θ,^J (θ) >$ pair, where $⁢^J (θ)$ is obtained by running a number of episodes with the system. Episodes can be run from the same state in the MC approach or from various states in the more general case.

Random search, population-based and finite difference methods can be directly instantiated by taking $ϕ$ = $θ$ and $f$ = $J$ .

4.1 subsection 4.1 4.1 §4.1 <tag close=" ">4.1</tag>Random search

The simplest BBO method randomly searches $Φ$ until it stumbles on a good enough $⁢^f (ϕ)$ . Its distinguishing feature is that the previous value $⁢^f (ϕ)$ has no impact on the choice of the next $ϕ$ .

Quite obviously, this family of methods is not sample efficient, but it requires no assumption at all on the function to be optimized. Therefore, it is an option only if the optimized function does not show any regularity that can be exploited. All other methods rely on the implicit assumption that the latent function presents some smoothness around optima.

4.2 subsection 4.2 4.2 §4.2 <tag close=" ">4.2</tag>Population-based optimization

Population-based BBO methods manage a limited population of individuals, and generate new individuals randomly in the vincinity of the previous elite individuals. It is based on the assumption that the optimum is close to these individuals but, in contrast with Estimation of Distribution Algorithms (EDAs), the smoothness of the latent function around the optimum is not exploited with an explicit model (see Section ). See ( , ) for further reading. The parameter $ϕ$ corresponding to an individual is often called its genotype and $⁢^f (ϕ)$ is called its fitness.

Since these methods use a random exploration component, they are not much data efficient. Furthermore, the value of an individual being known once and for all, they do not give rise to sample reuse.

4.3 subsection 4.3 4.3 §4.3 <tag close=" ">4.3</tag>Estimation of Distribution Algorithms

The standard perspective about EDAs is that they are a specific family of evolutionary strategies using a covariance matrix to control exploration ( , ). Under this perspective, EDAs are very similar to population-based optimization methods, where samples at iteration $i$ form a population and samples at iteration $+ i 1$ form the next generation of this population. However, EDAs use an explicit Gaussian model of the latent function where population-based methods use the population in as an implicit model.

Thus, EDAs are in fact a specific case of BBO with a surrogate model, where the surrogate model is a Gaussian function parametrized by the previous optimum guess and a covariance matrix. EDAs are iterative: samples of iteration $i$ are used to build the Gaussian model, then samples of iteration $+ i 1$ are drawn with a probability proportional to this Gaussian model value.

A particularity of EDAs is that the new optimum guess is not obtained by using derivative-based optimization, but by analytically finding the maximum of the Gaussian surrogate model.

Figure 3 Figure 3 3 Figure 3 3One iteration of EDAs. Red dot: current guess

ϕ i

. Blue ellipsoid: current sampling domain. Full blue dots: samples with a good evaluation. Empty blue dots: samples with a poor evaluation. Green dot: new guess

ϕ + i 1

. Green dotted ellipsoid: new sampling domain Figure 3One iteration of EDAs. Red dot: current guess

ϕ i

. Blue ellipsoid: current sampling domain. Full blue dots: samples with a good evaluation. Empty blue dots: samples with a poor evaluation. Green dot: new guess

ϕ + i 1

. Green dotted ellipsoid: new sampling domain

In Estimation of Distribution Algorithms (EDAs), $n s$ samples $ϕ j$ are drawn from a Gaussian distribution centered on the current guess $ϕ i$ and with covariance $Σ i$ . The evaluation of these samples determines the new optimum guess $ϕ + i 1$ and the new covariance matrix $Σ + i 1$ (Algorithm ). Along iterations, the ellipsoid defined by $Σ + i 1$ is progressively adjusted to the top part of the hill corresponding to the local optimum $ϕ *$ .

Algorithm 3 3 3 Algorithm 3

[htb] 3eda( $ϕ i$ , $Σ i$ , $n s$ ): one iteration of EDAs Algorithm 3eda( $ϕ i$ , $Σ i$ , $n s$ ): one iteration of EDAs \lx@orig@algorithmic \REQUIRE $ϕ i$ : current best guess, $Σ i$ : current covariance matrix, $n s$ : number of samples \STATE// generate samples and evaluate them \STATE $= S ∅$ \FOR $j$ in ${1, …, n s}$ \STATE $= ϕ j ⁢ N (ϕ i, Σ i)$ \STATE $= ⁢ f (ϕ j) ⁢ e v a l u a t e (ϕ j)$ \STATE $S → S + < ϕ j, > ⁢ f (ϕ j)$ \ENDFOR\STATE $= ϕ + i 1 ⁢ c o m p u t e_b e s t_g u e s s (S)$ \STATE $= Σ + i 1 ⁢ u p d a t e (S)$ \RETURN $ϕ + i 1$ , $Σ + i 1$

The $⁢ c o m p u t e_b e s t_g u e s s$ and $⁢ u p d a t e$ methods depend on the algorithm.

{tcolorbox}

[colback=red!10!white]Message 12: EDAs do not give rise to sample reuse because their surrogate model regression mechanism is analytical.

Sampling in EDAs corresponds to local stochastic exploration where the exploration process is driven by the update of the covariance matrix. In this process, the importance of spatial proximity between good samples is particularly obvious. This exploration policy can be characterized as uncorrelated when it only updates the diagonal of the covariance matrix and correlated when it updates the full covariance matrix. The latter is more efficient in small parameter spaces but computationnally more demanding and potentially inaccurate in larger spaces where more samples are required. See ( , ) for a discussion.

EDAs can be simply instantiated into a policy search algorithm by iterating Algorithm , using $θ$ as input $ϕ$ and $J$ as latent function $f$ . In that case, the $⁢ e v a l u a t e$ function of Algorithm gets the MC return, either from a single initial state or over the state space. The implications of this choice are discussed in Section .

Various instantiations of EDAs, such as cem, cma-es, pi $BB$ , pi $2$ -cma, nes, xnes, are covered in ( , ).

4.4 subsection 4.4 4.4 §4.4 <tag close=" ">4.4</tag>EM-based algorithms

The key and somewhat confusing feature of EDAs is that the same Gaussian function is both used as a utility model to determine the current best guess and as an exploration policy to generate new samples.

Interestingly, the rwr and crkr algorithms covered in ( , ) as EM-based methods are quite similar to EDAs, apart from the fact that the utility model they learn can be any function, learned through a locally weighted regression algorithm (see ( , ) for details).

Both rwr and crkr use Monte Carlo evaluation which corresponds to generating new samples and evaluating them in the first half of Algorithm .

Todo:finish that

The algorithm uses Gaussian exploration as in Algorithm to generate these samples, whereas uses a more sophisticated kernel-based representation to drive the exploration process.

Todo:mention EB-reps.

4.5 subsection 4.5 4.5 §4.5 <tag close=" ">4.5</tag>Finite difference methods

In finite difference methods, instead of being analytically computed, the gradient is estimated as the first order approximation of the Taylor expansion:

(5) Equation 5 5

∼ ∇ ϕ i f | = ϕ ϕ i ⁢ (⁢ δ ϕ i T δ ϕ i) - 1 (⁢ δ ϕ i T) (- ⁢ f (+ ϕ i ⁢ δ ϕ i) ⁢ f (ϕ i))

where $⁢ δ ϕ i$ is a small variation of $ϕ i$ . Solving () using a set of samples that relate $⁢ δ ϕ i$ to $- ⁢ f (+ ϕ i ⁢ δ ϕ i) ⁢ f (ϕ i)$ is a standard regression problem, but perturbations along each dimension of the input can be treated separately, which results in a very simple algorithm ( , ). The counterpart of this simplicity is that it suffers from a lot of variance. Besides, this algorithm is derivative-free and we classify it as using no surrogate model, even if it is based on a local approximation with a linear model of the gradient.

However, finite difference methods are limited to deterministic policies and suffer from a large variance. In simulation, they can be applied to stochastic policies by using identical noise along all trajectories ( , ).

4.6 subsection 4.6 4.6 §4.6 <tag close=" ">4.6</tag>Summary {tcolorbox}

[colback=red!10!white]Message 13: Policy search without a utility model is not much data efficient and does not generally give rise to sample reuse.

5 section 5 5 §5 <tag close=" ">5</tag>Policy search with a surrogate model in the space of policy parameters

Surprisingly, there is no algorithm which performs simple policy search with a surrogate model in the space of policy parameters. The only algorithms based on this idea are Bayesian optimization and rock $*$ , but they use a distribution over surrogate models which endow them with active learning capabilities.

Table 2 Table 2 2 Table 2 2List of episodic policy search algorithms using a surrogate model in the policy parameter space. Above the line, they were studied in ( , ), below they were not. Bayes. Opt.: Bayesian optimization. Model improvement: Iter. = iterative, Incr. = incremental. Policy improvement: A. = analytical, GB. = gradient-based, RWA = reward-weighted averaging, VG = vanilla gradient, NG = natural gradient, LS = local search, GS = global search. Policy update methods (following ( , )): MC-EM = Expectation-Maximization with Monte Carlo, IT = Information-theoretic method, SO = stochastic optimization, NSO = stochastic optimization with the natural gradient, PG = policy gradient.

Table 2List of episodic policy search algorithms using a surrogate model in the policy parameter space. Above the line, they were studied in ( , ), below they were not. Bayes. Opt.: Bayesian optimization. Model improvement: Iter. = iterative, Incr. = incremental. Policy improvement: A. = analytical, GB. = gradient-based, RWA = reward-weighted averaging, VG = vanilla gradient, NG = natural gradient, LS = local search, GS = global search. Policy update methods (following ( , )): MC-EM = Expectation-Maximization with Monte Carlo, IT = Information-theoretic method, SO = stochastic optimization, NSO = stochastic optimization with the natural gradient, PG = policy gradient.
	Model Improv.	Policy Improvement
cma-es ( , )	Iter.	A.	RWA	LS	NSO
pi $2$ -cma ( , )	Iter.	A.	RWA	LS	NSO
nes ( , )	Iter.	A.	NG	LS	NSO
rwr ( , )	Iter.	GB.	VG	LS	MC-EM
crkr ( , )	Iter.	GB.	VG	LS	MC-EM
EB-reps ( , )	Iter.	GB.	NG	LS	IT
cem ( , )	Iter.	GB.	VG	LS	SO
pi $BB$ ( , )	Iter.	GB.	RWA	LS	SO
xnes ( , )	Iter.	GB.	RWA	LS	NSO
Bayes. Opt. ( , )	Incr.	A.	VG	GS	PG
rock $*$ ( , )	Incr.	A.	RWA	GS	PG

5.1 subsection 5.1 5.1 §5.1 <tag close=" ">5.1</tag>Bayesian Optimization

Bayesian optimization (BO) is an instance of optimization with a surrogate model in $θ$ where, instead of learning one surrogate model, a distribution over probabilities of such surrogate models is updated through Bayesian inference. The distribution over surrogate models is initialized with some prior, and each new sample, considered as some new evidence, helps adjusting the model distribution towards a peak at the true value while keeping some information about uncertainty as the variance of the distribution.

A BO algorithm comes with some covariance function that determines how the information provided by a new sample influences the model distribution around this sample. That is, where EDAs necessarily assume a Gaussian relationship between the value of two samples, BO can assume more varied functions. BO also comes with an acquisition function used to choose the next sample given the current model distribution. A good acquisition function should take into account the expected value of the chosen sample as well as the uncertainty around this sample. Furthermore, it should be computationally as cheap as possible to find the optimum over the acquisition function, so as to control the computational cost of choosing the next sample.

By quickly reducing uncertainty, BO implements a form of active learning. As a consequence, it is very sample efficient when the parameter space is small enough, and it converges to a global optimum rather than a local one. However, given the necessity to optimize globally over the acquisition function, it scales poorly in the size of the parameter space. We do not expand further the presentation of BO here, see ( , ) for more details.

{tcolorbox}

[colback=red!10!white]Message 14: Bayesian Optimization is BBO with a surrogate model, where a distribution over surrogate models is managed. It performs active learning by trying to reduce uncertainty, and can perform global search, at the price of a lesser scalability.

Besides, finding the optimum can use derivative-based optimization, but it does not need to be local. This allows the algorithm to consider several local optima and find the global optimum.

The rock $*$ algorithm is one such instance ( , ). However, it uses cma-es to find the optimum over the model function. By doing so, it performs natural gradient optimization rather than vanilla gradient optimization. But it uses derivative-free optimization where it could use a derivative-based optimization approach, the model function being known.

5.2 subsection 5.2 5.2 §5.2 <tag close=" ">5.2</tag>Summary

Todo:There is no table for model-free methods

Table gives the most important distinctions between episodic policy search algorithms using a surrogate model in the policy parameter space. All episodic policy search algorithms using a surrogate model in the policy parameter space share the same architecture. Their model is linear in the features (the Gaussian utility model is a special case with just one feature), they use a deterministic policy, and they do not use a forward model. Quite obviously, they perform exploration in the policy parameter space. The on-policy versus off-policy and multi-steps versus single-step updates distinctions do not make sense in their case.

6 section 6 6 §6 <tag close=" ">6</tag>Policy search with a critic

Methods presented in Section learn a surrogate model in the space of policy parameters. The family of methods we are now presenting also learn a model of the utility function, but they do so in the space of states and eventually actions. This model is called a critic. We first give a quick overview of the mathematical justification of this approach before presenting the methods themselves.

6.1 subsection 6.1 6.1 §6.1 <tag close=" ">6.1</tag>From a utility model in the policy parameter space to a critic

Starting from (), page and after simple mathematical transformations explained in ( , ) and ( , ), the gradient over $θ$ of the global utility function $⁢ J (θ)$ can be rewritten

(6) Equation 6 6

∇ θ J (θ) = I E {∇ θ l o g p θ (τ) J (τ)} τ .

Furthermore, one can show that

(7) Equation 7 7

= ⁢ ∇ θ ⁢ l o g p θ (τ) ∑ = k 0 k f ⁢ ∇ θ ⁢ l o g π θ (| u k x k) .

Thus, by introducing the right-hand side part of () in (), we get

(8) Equation 8 8

∇ θ J (θ) = I E {∑ = k 0 k f ∇ θ l o g π θ (u k | x k) J (τ)} τ .

In (), the gradient of $⁢ J (θ)$ over $θ$ is transformed into a gradient of $→ θ$ over $(| u x)$ . This is advantageous because in general the former is not analytically known whereas the latter is, since the parametric policy is a function given by the user.

By using the policy gradient theorem ( , ) in the stochastic case, one can further show that () can be reformulated as:

∇ θ J (θ) = ∫ X d π θ (x) ∫ U ∇ θ → θ (u | x) (Q → θ (x, u)

= ⁢ ∇ θ J (θ) ∫ X ⁢ d π θ (x)

∫ U ∇ θ → θ (u | x) (Q → θ (x, u)

(9) Equation 9 9

- b → θ (x)) d x d u

- b → θ (x)) d x d u

where $⁢ d π θ (x)$ is the density of probability of being in state $x$ given the current policy $→ θ$ , $⁢ Q → θ (x, u)$ is the action value function and $⁢ b → θ (x)$ is a baseline function.

Various choices for $⁢ b → θ (x)$ give rise to various well-studied ways to compute the policy gradient. To keep as general as possible, we define a utility function $⁢ U (x, u)$ which can represent any function of the form $(- ⁢ Q → θ (x, u) ⁢ b → θ (x))$ .

Using $U$ , () becomes

(10) Equation 10 10

= ⁢ ∇ θ J (θ) ∫ X ⁢ d π θ (x) ∫ U ⁢ ∇ θ → θ (| u x) U (x, u) d x d u .

In (), $⁢ → θ (| u x)$ is given. Therefore, its gradient can be computed analytically, but this is not the case of $⁢ U (x, u)$ . To turn these equations into a practical algorithm, one thus needs to learn a model $^U η$ of $U$ .

The $⁢^U η (x, u)$ function is called a critic and methods combining an approximation of $U η$ and gradient descent on $→ θ$ are called actor-critic methods, the policy $→ θ$ being the actor.

Finally, in the case of a deterministic policy, instead of using (), one can compute the deterministic policy gradient ( , ) using

(11) Equation 11 11

= ⁢ ∇ θ J (θ) ⁢ ∇ u Q → θ (x, u) ∇ θ → θ (| u x) .

This can be advantageous because, the space of deterministic policies being smaller than the space of stochastic policies, searching the former is faster than searching the latter. However, a stochastic policy might be more appropriate when Markov property does not hold ( , ) or in adversarial contexts ( , ).

{tcolorbox}

[colback=red!10!white]Message 15: Learning a surrogate model of utility in the parameter space and ascending its gradient can be turned into learning a critic and ascending the utility gradient base on that critic

Todo:better rephrase the above.

6.2 subsection 6.2 6.2 §6.2 <tag close=" ">6.2</tag>Trading bias against variance

The insight in () is used in the reinforce algorithm ( , ). However, using () to compute the policy gradient suffers from a variance which grows with the length of the episodes.

One way to reduce this variance consists in adequately choosing the baseline function in (). In particular, the optimal baseline, that is the baseline that minimizes the variance without introducing bias is the value function $⁢ V → θ (x)$ (see e.g. ( , )). Thus, the $U$ function which optimally reduces the variance is $U = ⁢ A → θ (x, u) = - ⁢ Q → θ (x, u) ⁢ V → θ (x)$ , which is known as the advantage function.

Using () helps reducing the variance inherent to (), but it may introduce some bias, which means that the obtained policy may not be adequately optimized or may even diverge. In order to reduce the variance without introducing bias, there exists a compatibility condition between the features of $→ θ$ , and those of $^U η$ ( , ). In the case where the critic is represented with a linear architecture (see Section ), using compatible features and estimating the advantage function as a critic results in performing natural gradient optimization on the actor ( , ).

{tcolorbox}

[colback=red!10!white]Message 16: Transforming derivative-based optimization on $⁢^J (θ)$ into derivative-based optimization on $→ θ$ , using the optimal baseline, introducing an actor-critic approach and finally optimizing the natural gradient are four steps that all improve the sample efficiency of episodic policy search, mainly by reducing the variance, potentially at the price of some bias.

{tcolorbox}

[colback=red!10!white]Message 17: Most deep RL approaches build on these concepts to benefit from a high sample efficiency.

6.3 subsection 6.3 6.3 §6.3 <tag close=" ">6.3</tag>Learning a critic

There are two ways to learn a critic: using a bootstrap method or using batch regression methods. Besides, the former can be combined with using regression towards a target critic.

6.3.1 subsubsection 6.3.1 6.3.1 §6.3.1 <tag close=" ">6.3.1</tag>Bootstrap approximation of a critic

The temporal difference way to estimate $^U η$ is known as a bootstrap method ( , ). It can be explained as follows. Consider we get a new step-sample $s k = < x k, u k, j k, x + k 1 >$ . In the discounted reward case, by using Bellman’s principle, one can show that, if the critic $^U$ was accurate, we should have

= ⁢^U (x k, u k) + ⁢ j (x k, u k) ⁢ γ max u^U (x + k 1, u) .

If the equality does not hold, $^U$ in inaccurate. To correct it at $x k$ , one can use the temporal difference error or reward prediction error (RPE) $δ$ defined as

(12) Equation 12 12

= δ - + ⁢ j (x k, u k) ⁢ γ^U η (x + k 1, w) ⁢^U η (x k, u k)

with either $= w ⁢ argmax u^U η (x + k 1, u)$ , as in q-learning, or $= w u + k 1$ , as in Sarsa.

If the temporal difference error $δ$ is positive, $⁢ j (x k, u k)$ was greater than expected and $^U$ should be increased. If it is negative, it was smaller and $^U$ should be decreased. Thus one can improve $^U$ by applying

(13) Equation 13 13

← ⁢^U η (x k, u k) + ⁢^U η (x k, u k) ⁢ α δ

where $α$ is a learning rate.

Using a learning rate $α$ in () implies that the same sample can be reused many times. If () is used repeatedly with the same sample, the corresponding value of $δ$ will converge more or less quickly to 0 depending on $α$ . But, more importantly, $⁢^U η (x i, u i)$ should change depending on other $⁢^U η (x j, u j)$ it is connected to. This phenomenon is known as value propagation and is at the heart of the capability of bootstrap methods to reuse the same samples several times.

{tcolorbox}

[colback=red!10!white]Message 18: Bootstrap methods generally give rise to more sample reuse than standard regression methods.

6.3.2 subsubsection 6.3.2 6.3.2 §6.3.2 <tag close=" ">6.3.2</tag>Using a shuffled replay buffer

Given that samples can be reused many times in bootstrap methods, one can collect a large set of samples into a replay buffer and process them any number of times to learn a critic. However, using them in the order in which they are collected is detrimental to learning performance. Indeed, learning a model can be shown to perform better if the samples are independent and identically distributed (i.i.d.), which is not the case of the successive samples obtained along an episode. The correlation between successive samples is one of the sources of the instability of RL algorithms in continuous domains ( , ). The correlation can be removed by shuffling the samples into the replay buffer to draw them randomly. This idea played a key role in the success of the dqn algorithm ( , ) and is now used in most deep RL algorithms.

Finally, with respect to drawing the samples randomly, the sample efficiency of bootstrap methods can be further improved using prioritized experience replay ( , ).

{tcolorbox}

[colback=red!10!white]Message 19: Using a replay buffer dramatically improves sample reuse.

6.3.3 subsubsection 6.3.3 6.3.3 §6.3.3 <tag close=" ">6.3.3</tag>Regression towards a target critic

In addition to a replay buffer, deep RL methods introduced incremental regression towards a target critic which is periodically or smoothly updated. To understand this approach and its relationship to standard bootstrap learning, one should reconsider incremental regression (Section ).

In (), the goal is to minimize a positive loss $⁢ ϵ (ω i)$ defined as a function of the current model $^f ω i$ at known points $ϕ j$ . In (), which defines bootstrap learning, the goal is to drive to 0 the reward prediction error $δ$ .

Equation () can be made equivalent to () by applying the following transformations:

{⁢ l o s s (a, b) = - a b ⁢^f ω i (ϕ j) = ⁢^U η (x k, u k) ⁢ f (ϕ j) = + ⁢ j (x k, u k) ⁢ γ^U η (x + k 1, u k) ⁢ ϵ (ω i) = δ

and considering either a single sample or a collection of samples on both sides.

Under this perspective, () can be seen as a way to perform regression over $⁢^U η (x k, u k)$ toward a target value $+ ⁢ j (x k, u k) ⁢ γ^U η (x + k 1, u k)$ . However, this target value is itself a function of $^U η$ , thus it is modified each time $^U η$ is modified, whereas in standard regression, the target function is a constant function to be approximated. After dependencies between samples, this phenomenon is the other main source of instability of RL algorithms in continuous domains, resulting in potential divergence of $^U η$ ( , ).

To mitigate this instability, one can replace the term $^U η$ in the target function by another function $^U η ′$ . If $^U η ′$ is held constant, then the bootstrap learning problem is turned into a standard regression problem. But since in theory the target function should be a function of $^U η$ , one should rather periodically reset $^U η ′$ to the current $^U η$ , switching from a regression problem to another. This idea was first introduced in dqn ( , ) and then modified so that $^U η ′$ tracks $^U η$ with smoother variations in ddpg ( , ). Both mechanisms, in addition to shuffling the samples, improve the stability of learning the critic. Furthermore, the opportunity for sample reuse arising from bootstrap methods is transfered to solving successive regression problems with changing target networks.

{tcolorbox}

[colback=red!10!white]Message 20: Replay buffer shuffling and using a target critic improve the stability of incremental improvement.

6.3.4 subsubsection 6.3.4 6.3.4 §6.3.4 <tag close=" ">6.3.4</tag>Batch learning of a critic

Another way to learn a critic is through batch methods.

Todo:clarify below: they use regression and MC: MC to get sample values, then regression to generalize over the model. Why is there a step-size in enac, and not in PoWER? When the critic is represented as a linear architecture, finding the optimal critic parameters given a batch of samples can be cast as a standard regression problem, as used in nac and enac. The EM-based methods such as PoWER and pi $2$ rather rely on Monte Carlo sampling.

The details of the corresponding algorithms are well described in ( , ).

Batch methods are less sample efficient than bootstrap method as the former have to recompute the utility-to-go of each state from scratch at each iteration whereas the latter store these intermediate values into a memory and update them incrementally. Furthermore, they give rise to no sample reuse (See Message ). However, they are used in most iterative episodic policy search methods listed in Section .

6.4 subsection 6.4 6.4 §6.4 <tag close=" ">6.4</tag>Exploration policies in step-sample-based methods

When using step-samples, exploration can be performed in the policy parameter space, as in pepg, PoWER and pi $2$ , or in the state-action space, as in most other algorithms.

Besides, most exploration policies are specified as stochastic Gaussian exploration through a covariance matrix. When exploration is performed in the state-action space, the connection with policy parameter optimization is weaker. As a result, the covariance matrix can be kept fixed, as in nac and enac, or updated. When it is updated, a principled way to tune the exploration rate must be found.

Letting it decrease too fast may result in premature convergence. A well-founded alternative consists in applying large policy update, while constraining it by an upper-bound on the Kullback-Leibler divergence between the previous trajectory distribution and the updated one. Performing large steps prevents premature convergence. It is at the heart of the Relative Entropy Policy Search (reps) algorithm ( , ). When used in combination with a policy gradient method, it has been shown to ensure natural gradient updates ( , ).

The same method is also at the heart of the Trust Region Policy Optimization (trpo) algorithm ( , ). Indeed, the upper-bound on the Kullback-Leibler divergence prevents the new policy from moving too far away from the current policy, hence staying in the “trust region”. This is safer, particularly in the context of robotics where large jumps in the policy parameter space might be dangerous. See ( , ) for a mathematical derivation and for a discussion of the relationship to reps.

6.5 subsection 6.5 6.5 §6.5 <tag close=" ">6.5</tag>Policy search methods using a critic

In Section , we have shown that one can turn derivative-based optimization over $⁢ J (θ)$ into learning a critic $^U η$ in the $(x, u)$ space and using it to perform policy gradient ascent. In Section , we have listed various ways to learn such a critic. Then in Section , we have studied various exploration policies in various spaces. We are now ready to explain how one can implement policy search methods by performing derivative-based optimization over $→ θ$ based on the gradient of $^U η$ using () or ().

All the corresponding approaches are particular instantiations of episodic policy search with a surrogate model, with $= ϕ (x, u)$ and $= f ⁢^U η (x, u)$ . As in general BBO methods (Section ), we distinguish the iterative and the incremental instantiations.

6.5.1 subsubsection 6.5.1 6.5.1 §6.5.1 <tag close=" ">6.5.1</tag>Iterative approach

In Section we mentioned the possibility of a sequential approach where a surrogate model of the utility function is learned first, then derivative-based optimization is performed on this model. Actually, a key difference between using episodic-samples and step-samples appears in that case. Learning a model $⁢^J (θ)$ over $θ$ is a regression problem that can be performed easily by just sampling directly the $Θ$ space. By contrast, sampling the $× X U$ space is indirect, it requires to use some adequate policy. As a consequence, a sequential approach to episodic policy search with a surrogate model in $(x, u)$ would not work. In practice, many policy search algorithms alternate collecting new step-samples from the current policy to get a new critic $⁢^U η + i 1 (x, u)$ and use this new critic to improve the current policy.

Methods following the iterative approach can be characterized as policy iteration methods: they learn a new critic at each iteration ( , ). As outlined in Figure 2 of ( , ), algorithms from this family are similar to EDAs, apart from the fact that they model the utility function in the state-action space instead of the policy parameter space.

Among these methods, one must distinguish three families: likelihood ratio methods like reinforce and pepg, actor-critic methods like nac and enac and EM-based methods like PoWER and the variants of reps.

Though they derive from a different mathematical framework, likelihood ratio methods and EM-based methods are similar: they both use unbiased estimation of the gradient through Monte Carlo sampling and they are both mathematically designed so that the most rewarding trajectories get the highest probability. Besides, pepg is a likelihood ratio method that uses policy parameter perturbation as exploration policy, while PoWER is an EM-based method that does the same, so both methods are strongly related.

In likelihood ratio methods, the expectation over $⁢ p θ (τ)$ is approximated without bias as a sum over trajectories generated using $→ θ$ . In nac and enac, due to the compatibility condition, the features of the critic depend on the gradient of the current policy with respect to policy parameters, thus each time the policy is updated, new features must be computed for the critic. As a consequence, the critic must be learned again at each iteration with a batch method. In PoWER and the variants of reps, instead of storing a critic between iterations, a Monte Carlo estimation of the utility is used, which is dependent on the current policy. Interestingly, pi $2$ is also an iterative method, though it could in principle follow the incremental approach described in Section . According to ( , ), this is just because batch updates make it more stable.

The trpo algorithm also follows an iterative approach and can use a deep neural network representation, thus it can be classified as a deep RL method. Its key component is the upper-bounded exploration policy outlined in Section . With respect to reps, it also uses a conjugate gradient mechanism to improve data efficiency of policy optimization, and it seems to perform better than previous algorithms of the same family ( , ).

Finally, the Guided Policy Search (gps) algorithm ( , ) is another iterative deep RL method inspired from reinforce, but adding guiding trajectories and able to learn policies represented by large deep neural networks. It first learns a set of local open-loop policies using ilqg, a variant of the model-based Differential Dynamic Programming algorithm ( , ), then it uses importance sampling based on samples generated by these policies to learn a more global closed-loop policy. Importance sampling is a well-known mechanism to reduce the variance of sample based estimation by attributing weight to the samples depending on their effect on the next estimate ( , ). For a discussion on the link between importance sampling methods and likelihood ratio methods such as reinforce, see ( , ).

A key aspect of all the above methods is that they can use off-line processing of a batch of data collected in the previous iteration, but they require a significant amount of such data. All these methods can be characterized as on-policy, which is detrimental to their sample efficiency, but they do not suffer from bias. By contrast, the incremental approaches covered in the next section update a critic over iterations, providing further data efficiency and further opportunities for sample reuse at the price of some bias.

6.5.2 subsubsection 6.5.2 6.5.2 §6.5.2 <tag close=" ">6.5.2</tag>Incremental approach

The incremental approach corresponds to on-line learning, where the current step-sample is used at each step to improve the model of utility and the policy parameters. In contrast with the iterative approach, it updates a version of the critic at each time step instead of throwing it away and computing it anew each time a new policy is generated. This approach favors data efficiency because the policy is improved as soon as possible, which in turn helps generating better samples. The four inac algorithms proposed in ( , ) belong to this family.

However, on-line learning does not provide opportunities for sample reuse as long as the samples are used immediately rather than stored. The full sample efficiency of incremental approaches can be obtained by drawing from a replay buffer the samples used to improve the policy (see Section ).

This latter approach is the common structure of several deep RL algorithms: ddpg, naf, acer, Q-prop and pgql. Describing these algorithms in detail would require a paper in itself. Here, we just give a brief overview of these algorithms and refer the reader to the corresponding papers for detail, and to Table for a summary of the differences.

The ddpg algorithm ( , ) is based on the deterministic policy gradient theorem ( , ) calling upon (). The algorithm directly approximates a model $⁢^Q (x, u)$ as a deep neural network using a bootstrap method, and uses backpropagation of the gradient as an iterative method to perform the resulting derivative-based optimization over the weights of the network. In addition to the replay buffer shuffling and target network tricks described in Section , it also uses batch-normalization to stabilize gradient backpropagation in the networks ( , ). Finally, no care is taken in ddpg about the compatibility condition and the algorithm performs vanilla gradient descent.

The a3c algorithm is another actor-critic algorithm which brings several improvements over ddpg, such as natural gradient optimization by estimating the advantage function as a critic, the propagation of value over $n$ steps to limite the growth of the variance (called “ $n$ -step return” see ( , )), and the replacement of the replay buffer by the use of several parallel agents. Since it does not use a replay buffer, a3c is on-policy, in contrast with most other deep RL algorithms.

By contrast, the naf algorithm is a value iteration method as q-learning or Sarsa. In naf, the critic if a model of the advantage function $⁢ A → θ (x, u)$ , which guarantees a form of natural gradient optimization (see Section ). Furthermore, the model of this function is structured in such a way that the policy parameters $θ$ are a subset of the more global set of parameters $η$ of $^U η$ . By letting $^U η$ converge over iterations, the policy parameters $θ$ are themselves optimized. This more direct way to implement $= ⁢ δ θ ⁢ i m p r o v e m e n t_u s i n g (^U η + i 1)$ is at the price of a constraint on the stucture of the critic, which must be quadratic in the features of the state. Besides, the sample efficiency of naf is further improved in ( , ) by learning a forward model and switching to model-based RL.

While ddpg and naf learn a deterministic policy, the svg algorithm learns a stochastic one as a deterministic function of exogenous noise ( , ). A key feature of svg is that by adjusting a single parameter, it can switch from a model-free to a model-based approach.

Deep off-policy actor-critic algorithms such as ddpg and a3c are quite unstable because reducing the variance of the utility function estimation is obtained at the cost of some bias, due to the use of off-policy samples. A new family of algorithms address the bias variance trade-off in order to get more sample efficient and more stable deep episodic policy search. These algorithms have been characterized into a common interpolated policy gradient (ipg) framework ( , ).

The acer algorithm ( , ) builds on the a3c algorithm and adds three additional tricks to improve sample efficiency. First, it introduces a truncated importance sampling with bias correction mechanism. With respect to importance sampling, truncated importance sampling further reduces variance by truncating the largest weights, while bias correction reduces the bias inherent to off-policy actor-critic methods. Second, acer uses a stochastic dueling network architecture inspired from ( , ) to efficiently approximate the advantage function. Third, it proposes an efficient variant of the “trust region” exploration mechanism of the trpo algorithm described in Section . Finally, decorrelating the samples by shuffling them into the replay buffer prevents using a $n$ -step return from samples stored in that buffer, since temporal succession is lost. As a consequence, using a $n$ -step return and being off-policy may appear incompatible at first glance. However, the retrace algorithm ( , ) manages to perform off-policy $n$ -step return updates of a critic, and is used in the acer algorithm.

While acer incorporates several mechanisms to control the bias on top of a3c and the offpac algorithm ( , ), the Q-prop algorithm rather integrates the deterministic policy gradient equation () together with stochastic policy gradient equation from offpac into a single gradient equation using a control variate formalization ( , ). More importantly, the gradient equation of Q-prop integrates (on-policy) MC policy gradient methods and (off-policy) actor-critic methods into a single framework and can be seen as performing one or the other depending on some hyperparameters, or taking the best of both worlds. The result is a more stable algorithm that reduces the variance while controlling the bias, and that can incorporate the most recent advances of both MC policy gradient methods and actor-critic methods.

Todo:pgql ( , )

{tcolorbox}

[colback=red!10!white]Message 21: Incremental improvement is generally more sample efficient than iterative improvement, but it can be unstable.

{tcolorbox}

[colback=red!10!white]Message 22: By using a replay buffer, all the deep RL algorithms above combine the advantages of incremental and iterative learning, as discussed in Section .

6.6 subsection 6.6 6.6 §6.6 <tag close=" ">6.6</tag>Summary

Todo:More black dots means more sample efficiency.

All episodic policy search algorithms using a critic perform local search.

Table 3 Table 3 3 Table 3 3List of episodic policy search algorithms using a critic. Above the line, they were studied in ( , ), below they were not. The classification criteria are explained and discussed in Section . For the policy update methods, the labels are as follows (following ( , )). LR: likelihood ratio. MC-EM: Expectation-Maximization with Monte Carlo. Var.: Variational method. PI: path integral. IT: Information-theoretic method. SO: stochastic optimization NSO: stochastic optimization with the natural gradient. PG: policy gradient. NPG: natural policy gradient.

Table 3List of episodic policy search algorithms using a critic. Above the line, they were studied in ( , ), below they were not. The classification criteria are explained and discussed in Section . For the policy update methods, the labels are as follows (following ( , )). LR: likelihood ratio. MC-EM: Expectation-Maximization with Monte Carlo. Var.: Variational method. PI: path integral. IT: Information-theoretic method. SO: stochastic optimization NSO: stochastic optimization with the natural gradient. PG: policy gradient. NPG: natural policy gradient.
	Architecture		Explo.		Model Improv.		Policy Improv.
	Utility model: non-linear ( $∙$ ), linear ( $∘$ )	Forward model: yes ( $∙$ ), no ( $∘$ ), both ( $⋆$ )	In $(o, u)$ ( $∙$ ), in $θ$ ( $∘$ )	Greedy ( $∙$ ), stochastic search ( $∘$ ), advanced ( $⋆$ )	Incremental ( $∙$ ), iterative ( $∘$ ), both ( $⋆$ )	Off-policy ( $∙$ ), on-policy ( $∘$ ), both ( $⋆$ )	Gradient-based ( $∙$ ), analytical ( $∘$ )	Gradient: natural ( $∙$ ), vanilla ( $∘$ )	Policy update method
PoWER ( , )	$∘$	$∘$	$∘$	$∘$	$∘$	$∘$	$∙$	$∘$	MC-EM
vips ( , )	$∘$	$∘$	$∘$	$∘$	$∘$	$∘$	$∙$	$∘$	Var.
pi $2$ ( , )	$∘$	$∘$	$∙$	$∘$	.	$∘$	$∙$	$∘$	PI
reinforce ( , )	$∘$	$∘$	$∙$	$∙$	$∘$	$∘$	$∙$	$∘$	LR
g(po)mdp ( , )	$∘$	$∘$	$∙$	$∙$	$∘$	$∘$	$∙$	$∘$	LR
pepg ( , )	$∘$	$∘$	$∘$	$∘$	$∘$	.	$∘$	$∘$	LR
nac ( , )	$∘$	$∘$	$∙$	$∙$	.	$∘$	$∙$	$∙$	NPG
enac ( , )	$∘$	$∘$	$∙$	$∙$	.	$∘$	$∙$	$∙$	NPG
SB-reps ( , )	$∘$	$∘$	$∙$	$⋆$	$∘$	$∘$	$∙$	$∘$	IT
hireps ( , )	$∘$	$∘$	$∙$	$⋆$	$∘$	$∘$	$∙$	$∘$	IT
inac ( , )	$∘$	$∘$	$∙$	$∙$	$∙$	$∙$	$∙$	$∙$	NPG
gps ( , )	$∙$	$∘$	$∙$	$⋆$	$∘$	$∙$	$∙$	$∘$	PG
trpo ( , )	$∘$	$∘$	$∙$	$∙$	$∘$	$∘$	$∙$	$∙$	NPG
ddpg ( , )	$∙$	$∘$	$∙$	$∙$	$∘$	$∙$	$∙$	$∘$	PG
a3c ( , )	$∙$	$∘$	$∙$	$∙$	$∙$	$∘$	$∙$	$∙$	NPG
naf ( , )	$∙$	$∙$	$∙$	$∙$	$∙$	$∙$	$∙$	$∙$	NPG
svg ( , )	$∙$	$⋆$	$∙$	$∙$	$∙$	$∘$	$∙$	$∙$	NPG
acer ( , )	$∙$	$∘$	$∙$	$∙$	$∙$	$∙$	$∙$	$∙$	NPG
Q-prop ( , )	$∙$	$∘$	$∙$	$∙$	$⋆$	$⋆$	$∙$	$∙$	NPG
pgql ( , )	$∙$	$∘$	$∙$	$∙$	$⋆$	$⋆$	$∙$	$∙$	NPG

Todo:I would use abbreviations in each cell of the table. Now, I have to switch between the cell, see a black dot, and go to the top to (re)read what it means. Todo:Yes, but see the last column: Here, you have to read the caption, that’s painful too…

7 section 7 7 §7 <tag close=" ">7</tag>Discussion

In the previous sections we have presented methods learning a utility model in the policy parameter space and methods doing so in the state and action space separately. It this final discussion, we do two things: 1) we discuss the implications of choosing one space rather than the other and 2) we highlight some properties which make deep RL methods more sample efficient than the previous generation of policy search methods.

7.1 subsection 7.1 7.1 §7.1 <tag close=" ">7.1</tag>Learning the utility function in the policy parameter space versus a critic

Learning the utility function in the policy parameter space versus a critic is a fundamental distinction in episodic policy search methods. These two approaches correspond respectively to the episode-based and the step-based evaluation strategies outlined in ( , ). Several elements speak in favor of the higher sample efficiency of the latter approach.

1. item 1 1 item 1

Learning a critic with a bootstrap method can give rise to more sample reuse than learning a model of the utility function in $Θ$ .

2. item 2 2 item 2

Learning from each step-sample separately makes a better use of the information available from a trajectory than learning from episodic-samples.

3. item 3 3 item 3

The critic estimates the utility-to-go from the current state to the final state. Thus it provides an estimate of the utility of the whole trajectory without having to run it. Therefore, caching such values and using bootstrap updates should in principle be more sample efficient than using MC updates.

4. item 4 4 item 4

Under well specified constraints corresponding to the Markov property ( , ), improving the policy locally is guaranteed to improve it globally, which facilitates optimization over trajectories.

However, other aspects must be considered.

7.1.1 subsubsection 7.1.1 7.1.1 §7.1.1 <tag close=" ">7.1.1</tag>Size and structure of the corresponding spaces

A key factor of sample efficiency is the size of the $× X U$ space with respect to $Θ$ . Indeed, whatever the model, learning a model is more expensive when the model domain is larger. This insight is at the heart of quality-diversity methods which sample policies in a hand designed space that is smaller than the $Θ$ space ( , ). Another important factor is the structure of the relationships between points in the $× X U$ space and in $Θ$ , which depends on how the policy is parametrized. Policies are often parametrized so that the relationship is smooth enough, and deep neural network seem to generally induce a smooth structure too, but some policy parametrizations may induce a large jump in the $× X U$ space for a small variation in $Θ$ . In the latter case, searching directly in $Θ$ may prove more efficient. The two above factors might be dominant over all others.

In robotics, the size of $Θ$ is usually keet small so as to keep optimization fast enough, using dmps or other open-loop policies ( , ). Moreover, with open-loop policies, the immediate utility of an action at each state is not a useful information since the controller does not take the state as input. As a consequence, using methods based on step-samples for dmps does not make much sense. However, open-loop policies and dmps suffer from some drawbacks, such as a limited robustness to perturbations ( , ).

The emergence of deep RL methods using large neural networks as policy representation may change the picture as they make it possible to learn large closed-loop policies. Furthermore, in that context, the size of $Θ$ can become larger than that of the $× X U$ space, which speaks in favor of learning a critic. Besides, a utility function modelled in a larger space may suffer to fewer local minima, which may be true both of $⁢^J (θ)$ and of a critic.

7.1.2 subsubsection 7.1.2 7.1.2 §7.1.2 <tag close=" ">7.1.2</tag>Learning a hierarchical representation

Finally, the state-action spaces may naturally exhibit a hierarchical structure, which is not so obvious of the policy parameter space. As a consequence, methods modelling utility in the $× X U$ space may benefit from learning intermediate representations at several levels in the hierarchy so as to reduce the dimensionality of the policy search problem. Learning such intermediate and more compact representations can be performed off-line, which corresponds to the perspective of the DREAM project and is illustrated in ( , ).

7.1.3 subsubsection 7.1.3 7.1.3 §7.1.3 <tag close=" ">7.1.3</tag>Policy parameter perturbation versus action perturbation

Policy parameter perturbation and action perturbation correspond respectively to the episode-based and the step-based exploration strategies outlined in ( , ). Interestingly, Table shows that pepg, PoWER and pi $2$ use an episode-based exploration strategy whereas they use a step-based evaluation strategy.

In several surveys about episodic policy search for robotics, policy parameter perturbation methods are considered superior to action perturbation methods ( , ). However, although this analysis is backed-up with a few mathematical arguments, we now believe this is true mostly when $Θ$ is smaller than the $× X U$ space, with the same implications as already discussed in Section .

By the way, all the recent deep RL methods use action perturbation, sometimes together with a Kullback-Leibler divergence constraint to ensure natural gradient updates, as is the case in trpo and acer.

( , )

7.1.4 subsubsection 7.1.4 7.1.4 §7.1.4 <tag close=" ">7.1.4</tag>Single starting state versus multiple starting states

Methods based on step-samples are conceptually designed to face the context of sampling from various initial states. Nevertheless, they can still be applied to the context with a single initial state.

In the context of deep RL, using multiple starting states has been shown to be highly beneficial to the quality of the obtained policy in ( , ), because it favors exploration and finds more general solutions.

7.2 subsection 7.2 7.2 §7.2 <tag close=" ">7.2</tag>Tuning a step size versus updating a covariance matrix

Todo:should move in the main text

In derivative-based optimization methods, the step size has to be adjusted so that the derivative-based optimization process does not jump outside the current hill. In that respect, EDAs using covariance matrix adaptation remove the necessity for step-size tuning because they adapt the search to the shape of the hill.

In the case of methods using step-samples, the picture is more complex. Some derivative-based optimization methods such as enac use a constant step-size, which is not efficient ( , ).

Todo:work on 2 paragraphs below.

Other derivative-based optimization methods like EM-based optimization and deep RL methods remove the need for tuning a step-size by calling upon the same kind of stochastic search as derivative-free optimization methods, or by using a bound on the Kullback-Leibler divergence. This latter approach combines the sample efficiency of greedy optimization using natural gradient optimization with a principled way to tune the step size. The most representative algorithms in that respect are trpo and acer.

In some algorithms, following the terminology of ( , ), the exploration policy is truly an “upper-level policy” that takes some local context as input. The corresponding algorithms have been identified as “advanced” with a star symbol in Table , but the topic is not covered here. An exception is the gps algorithm whose advanced exploration comes from the integration of guiding samples.

7.3 subsection 7.3 7.3 §7.3 <tag close=" ">7.3</tag>Incremental versus iterative improvement

The distinction between iterative methods, which collect a batch of samples for computing a new policy, and incremental methods, which update a critic in parallel to the policy, is central to our survey. Importantly, all methods using episodic-samples are iterative. In that respect, we have outlined that they share many similarities with iterative methods based on step-samples. Furthermore, most methods covered in the previous episodic policy search surveys were iterative, whereas most deep RL methods are incremental.

From one side, iterative methods are more stable. Indeed, in the incremental context, if one does not wait for convergence of the critic before applying derivative-based optimization on the policy, then utility estimation depends on policy improvement and vice versa, leading to a potential divergence of the policy search process. Thus iterative methods are adapted to the context of off-line processing of a new policy, when the agent alternates between periods of activity and periods of improvement, as in the DREAM project (see Section ).

But from the other side, in the context of on-line learning, incremental updates of the critic and the policy can lead to a much better sample reuse than iterative updates. Indeed, updating a policy as soon as possible can be more sample efficient because a better policy generates better step-samples, which may result in further improvement in the current policy.

Thus, from our perspective, by providing more accurate non-linear estimation methods and several tricks to stabilize the update of the critic, deep RL methods have much contributed to the emergence of incremental methods, resulting in a better sample efficiency of episodic policy search methods.

A side message from our survey is that the formerly well established actor-critic versus direct policy gradient distinction is not the most adequate when one considers sample efficiency questions. Indeed, nac and enac are classified as actor-critic but they are iterative methods which recompute a new critic at each iteration, in sharp contrast with the more recent incremental deep RL methods.

7.4 subsection 7.4 7.4 §7.4 <tag close=" ">7.4</tag>Off-policy versus on-policy updates

In on-policy methods like Sarsa, the critic estimates the utility of the current policy, whereas in off-policy methods like q-learning, it estimates the utility of an optimal policy. As a consequence, in the on-policy case the samples used to learn the critic must come from the current policy itself whereas in the off-policy case, they can come from any policy.

In most iterative policy gradient methods, the samples are discarded from one iteration to the next and these methods are generally on-policy. By contrast, incremental methods using a replay buffer are off-policy, while those which do not use one are generally on-policy, as is the case of a3c. Besides, using importance sampling is a well known method to turn an on-policy method into an off-policy one, see e.g. ( , ).

When learning a critic incrementally, using off-policy updates is more flexible because the samples can come from any policy, but these off-policy updates introduce bias in the estimation of the critic. As a result, off-policy methods like ddpg and naf are more sample efficient because they use a replay buffer, but they are also more prone to divergence. In that respect, a key contribution of acer and Q-prop is that they provide an off-policy, sample efficient update method which strongly controls the bias, resulting in more stability.

7.5 subsection 7.5 7.5 §7.5 <tag close=" ">7.5</tag>Higher sample efficiency of recent deep RL methods

In former actor-critic methods like nac and enac, a linear architecture was used for the critic, which facilitates some calculations but can also result in a poor approximation of utility in the state action space. This can itself prevent optimization of the policy. In that respect, as shown in the first column of Table , a key contribution of deep RL algorithms is that they brought efficient techniques to perform non-linear approximation of the critic.

7.5.1 subsubsection 7.5.1 7.5.1 §7.5.1 <tag close=" ">7.5.1</tag>Natural gradient versus vanilla gradient

As noted in Section , natural gradient optimization is generally more sample efficient than vanilla gradient optimization ( , ).

The counterpart is that, in principle, it requires computing or estimating $F$ and inverting it, or directly estimating the inverse. Thus, it is a way to improve sample efficiency through more expansive computations. However, many episodic policy search methods manage to perform exact or approximate natural gradient optimization without calling upon the computation of $F - 1$ .

In the family of derivative-free optimization methods, this is the case of cma-es, pi $BB$ and pi $2$ -cma, which perform approximate natural gradient optimization by using reward-weighted averaging (see ( , ) for more details). By contrast, nes and xnes explicitly approximate $F - 1$ .

In the family of linear actor-critic methods, this is the case of nac, enac and three inac algorithms, where it has been shown that using compatible features and estimating the advantage function as a critic directly results in performing natural gradient optimization on the actor without the need for computing $F - 1$ ( , ). By contrast, the fourth inac algorithm estimates $F$ and inverts it.

In the family of deep RL methods, this is also the case of naf and a3c, which learn a model of the advantage function, of trpo and acer, which constrain the exploration with a Kullback-Leibler divergence constraint, and of Q-prop, which uses compatible features. By contrast, ddpg still relies of the vanilla gradient.

7.5.2 subsubsection 7.5.2 7.5.2 §7.5.2 <tag close=" ">7.5.2</tag>Global search versus local search

Policy search is generally local, but learning a surrogate model of the utility function opens the way to global improvement, at the cost of a more expensive search for the optimum. Thus global search can only outperform local search if there are several local optima. Actually, the number of local optima may strongly depend on the size of the parameter space, and one may hypothesize that, the larger the parameter space, the fewer local optima ( , ).

A method like trpo is designed so that the performance always increases, thus it is unable to switch from a local optimum to another. The fact that trpo performs well ( , ) on several classical control problems suggests that, for a large enough $Θ$ space, the utility functions in such problems only present one local optimum in $Θ$ , or several equivalent local optima. Whether this is true of more complex problems needs to be further investigated.

{tcolorbox}

[colback=red!10!white]Message 23: Deep RL, which learns a surrogate model and massively use derivative-based optimization of their deep neural network representations of the critic and the policy, are very sample efficient.

8 section 8 8 §8 <tag close=" ">8</tag>Conclusion

In ( , ), the authors have shown that episodic policy search applied to robotics was shifting from derivative-based actor-critic methods to EDAs. Part of this shift was due to the use of open-loop dmps ( , ) as a policy representation, but another part resulted from the higher efficiency of EDAs methods by that time.

The most salient drawbacks of the former actor-critic methods is that they could not estimate the critic accurately due to the linear architecture. In addition, enac was suffering from its constant step-size, while PoWER and pi $2$ removed the step size but at the price of the higher computational cost and variance of Monte Carlo estimation.

The emergence of deep RL methods has completely changed this picture. The better approximation capability of non-linear critics and the incorporation of an adapted step size in derivative-based optimization has renewed the interest for actor-critic architectures which reduce the variance of policy gradient estimation at the price of some controlled bias and which can perform incrementally. The target network trick has also mitigated the intrinsic instability of approximating a critic. Finally, using a replay buffer has dramatically improved the sample efficiency of the methods.

So far, the focus of empirical comparisons has been more on final performance of the learned controller than on sample efficiency of the learning process. For instance, a empirical comparison of some of the algorithms studied here is presented in ( , ), but only final performance is compared and the reasons why an algorithm performs better than another is not analyzed in detail.

More recently, the higher sample efficiency of deep RL methods with respect to EDAs has started to be empirically studied in ( , ). Another work focuses on the fact that derivative-free optimization methods are still competitive in terms of performance reached, but shows that they are less sample efficient by about one order of magnitude ( , ). Interestingly, this recent work started to investigate in which application contexts EDAs are a competitive alternative to deep RL methods, an effort that should be pursued in the future.

( , )

In most of these previous empirical studies, a broad conceptual analysis centered on the algorithms was missing. The goal of this paper was to lay the foundations for such an analysis.

Beyond the algorithms covered here, the current effort in the domain is on improving the stability of the policy search process while better controlling the bias-variance compromize ( , ). Another line of thought uses dedicated off-line processes to build higher level discrete state and action representations from the policy improvement process, so as to shift from episodic policy search to solving discrete RL problems ( , ).

In this paper, we have given a broad overview of episodic policy search methods. In the future it would be useful to build on this general perspective to write a survey focusing more specifically on the most recent deep RL methods and their mathematical derivation in connection to previous policy search work. Such a more technical survey is currently missing.

Besides, we have restricted our study to single task episodic policy search methods. Beyond episodic policy search, several research lines are making fast progress on the more general lifelong learning context, where the agent must learn how to achieve a growing number of tasks throughout its lifetime. Among other things, addressing the lifelong learning context would require specific studies of active learning, transfer learning, multi-task learning mechanisms, which has been left for future work.

Acknowledgments

This work was supported by the European Commission, within the DREAM project, and has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 640891. We thank David Filliat for valuable feedback about an earlier version of this article.

References