<?xml version="1.0" encoding="UTF-8"?>
<?latexml searchpaths="/home/japhy/scienceReplication.artiswrong.com/paper_files/arxiv/2105.11977/latex_extracted"?>
<?latexml class="elsarticle" options="preprint,twocolumn,3p,authoryear"?>
<?latexml package="amssymb"?>
<?latexml package="hyperref"?>
<?latexml package="ucs"?>
<?latexml package="comment"?>
<?latexml package="amsmath"?>
<?latexml package="pbox"?>
<!--  %fait des boites --><?latexml package="tcolorbox"?>
<?latexml package="caption"?>
<?latexml package="capt-of"?>
<!--  %“usepackage–subcaption˝ % incompatible with subfig --><!--  %“usepackage[numbers]–natbib˝ --><?latexml package="bm"?>
<?latexml package="graphicx"?>
<?latexml package="array"?>
<?latexml package="multirow"?>
<?latexml package="colortbl"?>
<?latexml package="tikz"?>
<?latexml package="xspace"?>
<?latexml package="subfig" options="caption=false,font=footnotesize"?>
<!--  %%% Math commands --><!--  %probability --><!--  %expectation --><!--  %expectation, without substacking #1 --><!--  %expectation, without brackets --><!--  %max --><!--  %argmax --><!--  %argmin --><!--  %trace operator --><!--  %error function --><!--  %diagonal matrix --><!--  %arcsinh --><!--  %Kullback-Leibler divergence --><!--  %identity matrix --><!--  %**** sample˙efficiency.tex Line 50 **** --><!--  %**** sample˙efficiency.tex Line 75 **** --><!--  %**** sample˙efficiency.tex Line 100 **** --><!--  %**** sample˙efficiency.tex Line 125 **** --><!--  %**** sample˙efficiency.tex Line 150 **** --><!--  %**** sample˙efficiency.tex Line 175 **** --><!--  %discount factor --><!--  %discount factor --><!--  %**** sample˙efficiency.tex Line 200 **** --><!--  %“newcommand–“eps˝–“vec–“epsilon˝˝ --><!--  %**** sample˙efficiency.tex Line 225 **** --><!--  %partial derivative --><!--  %**** sample˙efficiency.tex Line 250 **** --><!--  %%% MDP and RL specific commands --><!--  %policy --><!--  %policy-space --><!--  %policy parameters (vector) --><!--  %discount factor --><?latexml package="algorithmic"?>
<?latexml package="algorithm"?>
<!--  %**** sample˙efficiency.tex Line 300 **** --><!--  %“def“sharedaffiliation–% --><!--  %“end–tabular˝ --><!--  %“begin–tabular˝–c˝˝ --><!--  %(or ‘‘in the era of Deep RL’’?) --><?latexml graphicspath="/home/sigaud/Bureau/Images/"?>
<?latexml graphicspath="/home/japhy/scienceReplication.artiswrong.com/paper_files/arxiv/2105.11977/latex_extracted/images"?>
<?latexml RelaxNGSchema="LaTeXML"?>
<document xmlns="http://dlmf.nist.gov/LaTeXML">
  <resource src="LaTeXML.css" type="text/css"/>
  <resource src="ltx-article.css" type="text/css"/>
  <title>Episodic Policy Search Algorithms: a sample efficiency perspective</title>
  <creator role="author">
    <personname>Olivier Sigaud</personname>
    <contact role="address">Sorbonne Universités, UPMC Univ Paris 06, CNRS UMR 7222,<break/>Institut des Systèmes Intelligents et de Robotique, F-75005 Paris, France<break/><text font="typewriter">olivier.sigaud@isir.upmc.fr</text>    +33 (0) 1 44 27 88 53
</contact>
  </creator>
  <creator role="author">
    <personname>Freek Stulp</personname>
    <contact role="address">German Aerospace Center (DLR), Institute of Robotics and Mechatronics, Wessling, Germany<break/><text font="typewriter">freek.stulp@dlr.de</text>
</contact>
  </creator>
  <abstract name="Abstract">
    <p>Episodic policy search is currently the focus of intensive research driven by the recent success of deep reinforcement learning (RL) algorithms.
In this paper we present a broad survey of episodic policy search methods, from optimization without a utility model and Bayesian Optimisation to derivative-based optimization and deep RL algorithms.
We build a unified and didactical perspective to explain why deep RL algorithms are generally more sample efficient than previous methods, and we provide a conceptual survey of the most recent algorithms, without going into the details of mathematical derivations.</p>
  </abstract>
  <classification scheme="keywords">
episodic policy search, sample efficiency, deep reinforcement learning
</classification>
<!--  %**** sample˙efficiency.tex Line 325 **** 
     %“maketitle
     %% –“renewcommand–“thefootnote˝–˝
     %% “footnotetext–
     %% Olivier Sigaud (Professor in Computer Science) is with: Sorbonne Universit“’es, UPMC Univ Paris 06, CNRS UMR 7222, Institut des Syst“‘emes Intelligents et de Robotique, F-75005 Paris, France Contact: –“tt olivier.sigaud@isir.upmc.fr˝ +33 (0) 1 44 27 88 53
     %% ˝
     %**** sample˙efficiency.tex Line 350 ****
     %% ˝%
     %“newcolumntype–M˝[1]–¿–“centering˝m–#1˝˝
     %“tableofcontents-->  <section inlist="toc" labels="LABEL:sec:intro" xml:id="S1">
    <tags>
      <tag>1</tag>
      <tag role="autoref">section 1</tag>
      <tag role="refnum">1</tag>
      <tag role="typerefnum">§1</tag>
    </tags>
    <title><tag close=" ">1</tag>Introduction</title>
    <para xml:id="S1.p1">
      <p>Autonomous systems are systems which know what to do in their domain without external control.
Generally, their behavior is specified through a <text font="italic">policy</text>. The policy of a robot, for instance, is defined through a controller which determines actions to take or signals to send to the actuators in any state of the robot and its environment.</p>
    </para>
    <para xml:id="S1.p2">
      <p>A lot of robot policies are designed by hand, but this manual design is only viable for systems acting in well-structured environments and to achieve well-specified tasks. When those conditions are not met, one can let the system find its own policy by exploring various behaviors and exploiting those that perform well with respect to some predefined <text font="italic">utility</text> function. This is called <text font="italic">policy search</text>, a particular case of <text font="italic">reinforcement learning</text> (RL) <cite class="ltx_citemacro_citep">(<bibref bibrefs="sutton98" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> where the action is continuous.
More precisely, the goal of policy search is to optimize a policy when the utility of the resulting behaviors is not known in advance.
In practice, a policy search algorithm runs the system with the current policy to generate <text font="italic">trajectories</text> made of several state and action <text font="italic">steps</text> and gets the utility as a return. This approach is called <text font="italic">black box</text> optimization (BBO). BBO algorithms receive the outcome of running the system as a set of <text font="italic">samples</text>.
Actually, there are two possible kinds of samples: samples corresponding to a single step of the system, that we call <text font="italic">step-samples</text> later on, and samples corresponding to a complete trajectory, that we call <text font="italic">episodic-samples</text>.
The observed utility of these samples can then be used to choose a better policy, and the process is repeated until some satisfactory set of behaviors is found.
In general, policies are represented with a parametrized function, and policy search explores the space of <text font="italic">policy parameters</text>.
<!--  %“todo–reuse and adapt Fig. 1 of “cite–stulp13paladyn˝.˝ --></p>
    </para>
    <para xml:id="S1.p3">
      <p>The main limitation of policy search is that, if policies are executed on a real robot, evaluating the policy is costly, mainly because it takes time and leads to wear and tear for the robot. For this reason, policy search methods for robotics should optimize the policy whilst minimizing the number of policy executions required to do so. A policy search method that achieves the same improvement with fewer policy executions in comparison to another method is more <text font="italic">sample efficient</text>.
<!--  %**** sample˙efficiency.tex Line 425 **** --></p>
    </para>
    <para xml:id="S1.p4">
      <p>The cost of processing the samples is often negligible with respect to the cost of running the system. If this is so, one may collect some samples from few experiments and then process them <text font="italic">off-line</text> – i.e. without running again the system – for a potentially long duration (up to hours, days, or even weeks), so as to improve its behavior. So the processing cost of an off-line policy search method may not matter much, whereas its scalability to large spaces does, because one may not afford a method that would require months or years to process a high-dimensional data set.</p>
    </para>
    <subsection inlist="toc" xml:id="S1.SS1">
      <tags>
        <tag>1.1</tag>
        <tag role="autoref">subsection 1.1</tag>
        <tag role="refnum">1.1</tag>
        <tag role="typerefnum">§1.1</tag>
      </tags>
      <title><tag close=" ">1.1</tag>Scope and Contributions</title>
      <para xml:id="S1.SS1.p1">
        <p>This paper provides a review of policy search methods, under the specific constraints of sample efficiency outlined above.
More precisely, we scrutinize algorithms under the perspective of the use they make of collected samples.
We consider two aspects:
1) data efficiency: extracting more information from available data (definition taken from <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2011pilco" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>);
2) sample reuse: being able to improve a policy several times by using the same samples more than once, which is also known as <text font="italic">experience replay</text>.</p>
      </para>
      <para xml:id="S1.SS1.p2">
        <p>Furthermore, we focus on the case where the behavior expected from a system has a well-defined end point or duration, called the <text font="italic">episodic</text> policy search problem, and where the system is learning to solve a <text font="italic">single task</text>. That is, we do not cover the broader domain of <text font="italic">lifelong learning</text>, where a robot must learn how to perform various tasks over a potentially infinite horizon <cite class="ltx_citemacro_citep">(<bibref bibrefs="thrun1995lifelong" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S1.SS1.p3">
        <p>Additionally, though a subset of policy search methods are based on RL, we do not cover recent work on RL with discrete actions such as <text font="smallcaps">dqn</text> and some of its successors <cite class="ltx_citemacro_citep">(<bibref bibrefs="mnih2015human,van2015deep" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.
Finally, we restrict ourselves to the case where samples are the unique source of information for improving the policy.
That is, we do not consider the interactive context where a human user can provide external guidance <cite class="ltx_citemacro_citep">(<bibref bibrefs="najar2016training" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>,
either through feedback, shaping or demonstration <cite class="ltx_citemacro_citep">(<bibref bibrefs="argall09" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S1.SS1.p4">
        <p>Three surveys about policy search for robotics have been published a few years ago <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey,stulp13paladyn,kober2013reinforcement" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.
With respect to these previous surveys, we bring the following contributions:</p>
        <itemize xml:id="S1.I1">
          <item xml:id="S1.I1.i1">
            <tags>
              <tag>1.</tag>
              <tag role="autoref">item 1</tag>
              <tag role="refnum">1</tag>
              <tag role="typerefnum">item 1</tag>
            </tags>
            <para xml:id="S1.I1.i1.p1">
              <p>we focus on sample efficiency aspects, with the general ambition to explain which classes of algorithms are the most sample efficient and why.</p>
            </para>
          </item>
          <item xml:id="S1.I1.i2">
            <tags>
              <tag>2.</tag>
              <tag role="autoref">item 2</tag>
              <tag role="refnum">2</tag>
              <tag role="typerefnum">item 2</tag>
            </tags>
            <para xml:id="S1.I1.i2.p1">
              <p>we cover a broader range of episodic policy search algorithms, including optimization without a utility model, Bayesian optimization (BO) and deep RL which are currently the matter of intensive research,
<!--  %**** sample˙efficiency.tex Line 450 **** -->into a unifying perspective, giving rise to a more didactical understanding of the various families of algorithms. In particular, we cover more than 15 additional algorithms, most of which are more recent than <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite>, as summarized in Tables <ref labelref="LABEL:tab:classif1"/>, page <ref labelref="LABEL:tab:classif1"/> and <ref labelref="LABEL:tab:classif2"/>, page <ref labelref="LABEL:tab:classif2"/>. The counterpart of this breadth is that we cannot give a detailed account of these algorithms nor their mathematical derivation. We rather investigate the elementary optimization, exploration and model learning processes at the roots of these methods, and refer the reader to <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite> for the mathematical derivation and description of many algorithms, to <cite class="ltx_citemacro_citep">(<bibref bibrefs="stulp15NN" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite> for a survey of regression, and to <cite class="ltx_citemacro_citep">(<bibref bibrefs="grondman2012survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite> for a detailed presentation of <text font="italic">natural gradient</text> concepts playing an important role in the domain.</p>
            </para>
          </item>
        </itemize>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S1.SS2">
      <tags>
        <tag>1.2</tag>
        <tag role="autoref">subsection 1.2</tag>
        <tag role="refnum">1.2</tag>
        <tag role="typerefnum">§1.2</tag>
      </tags>
      <title><tag close=" ">1.2</tag>Perspective of the Review</title>
      <para xml:id="S1.SS2.p1">
        <p>We consider the distinction between <text font="italic">episodic-samples</text> and <text font="italic">step-samples</text> as crucial for our perspective.
This distinction exactly matches the philogenetic RL versus ontogenetic RL distinction in <cite class="ltx_citemacro_citep">(<bibref bibrefs="togelius2009ontogenetic" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S1.SS2.p2">
        <p>As a consequence of our focus on sample efficiency, this survey is structured as depicted in Figure <ref labelref="LABEL:fig:orga_paper"/>.</p>
      </para>
      <figure inlist="lof" labels="LABEL:fig:orga_paper" placement="hbt!" xml:id="S1.F1">
        <tags>
          <tag>Figure 1</tag>
          <tag role="autoref">Figure 1</tag>
          <tag role="refnum">1</tag>
          <tag role="typerefnum">Figure 1</tag>
        </tags>
        <graphics class="ltx_centering" graphic="tree2-svg" options="width=780.516pt" xml:id="S1.F1.g1"/>
        <toccaption class="ltx_centering"><tag close=" ">1</tag>Simplified classification of the algorithms covered in the paper. Algorithms not covered in <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> are in blue.
From the left to the right, algorithms are classified in increasing order of sample efficiency.
<text color="#0000B3" font="bold" framecolor="#0000B3" framed="rectangle">Todo:</text><text color="#0000B3" font="italic">This is an informative picture. But the pedantic graphics guy in me would of course want do some fine-tuning for the camera-ready version ;-)</text>
</toccaption>
        <caption class="ltx_centering"><tag close=": ">Figure 1</tag>Simplified classification of the algorithms covered in the paper. Algorithms not covered in <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> are in blue.
From the left to the right, algorithms are classified in increasing order of sample efficiency.
<text color="#0000B3" font="bold" framecolor="#0000B3" framed="rectangle">Todo:</text><text color="#0000B3" font="italic">This is an informative picture. But the pedantic graphics guy in me would of course want do some fine-tuning for the camera-ready version ;-)</text>
</caption>
      </figure>
<!--  %“todo–We refrain from speaking about model-based/model-free optimization to avoid confusion with the classical “rl distinction about using a –“em forward˝ model or not.˝ -->      <para xml:id="S1.SS2.p3">
        <p>As stated above, policy search is an instance of BBO.
The most efficient approaches to optimization, based on convexity or closed-from computation of the optimum, generally require too restricted assumptions to be applied to BBO <cite class="ltx_citemacro_citep">(<bibref bibrefs="gill1981practical" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.
The methods compatible with the BBO context are derivative-based optimization, which needs the analytical form of a differentiable utility function, optimization without a utility model, which only requires smoothness from that function and random search, which does not require anything but is generally inefficient.</p>
      </para>
<!--  %**** sample˙efficiency.tex Line 475 **** -->      <para xml:id="S1.SS2.p4">
        <p>The most natural optimization approach to implement policy search would consist in performing derivative-based optimization on the utility function in the policy parameter space (see e.g. <cite class="ltx_citemacro_citep">(<bibref bibrefs="sutton88" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>).
However, in BBO, the analytical form of this function is generally not available.</p>
      </para>
      <para class="ltx_noindent" xml:id="S1.SS2.p5">
        <p>Given this difficulty, we consider four solutions:</p>
        <enumerate xml:id="S1.I2">
          <item labels="LABEL:sol1" xml:id="S1.I2.i1">
            <tags>
              <tag>1.</tag>
              <tag role="autoref">item 1</tag>
              <tag role="refnum">1</tag>
              <tag role="typerefnum">item 1</tag>
            </tags>
            <para xml:id="S1.I2.i1.p1">
              <p>using optimization without a utility model (Section <ref labelref="LABEL:sec:mfo"/>),</p>
            </para>
          </item>
          <item labels="LABEL:sol2" xml:id="S1.I2.i2">
            <tags>
              <tag>2.</tag>
              <tag role="autoref">item 2</tag>
              <tag role="refnum">2</tag>
              <tag role="typerefnum">item 2</tag>
            </tags>
            <para xml:id="S1.I2.i2.p1">
              <p>learning a surrogate model of the utility function in the space of policy parameters and performing analytical or derivative-based optimization using this model (Section <ref labelref="LABEL:sec:theta"/>),</p>
            </para>
          </item>
          <item labels="LABEL:sol3" xml:id="S1.I2.i3">
            <tags>
              <tag>3.</tag>
              <tag role="autoref">item 3</tag>
              <tag role="refnum">3</tag>
              <tag role="typerefnum">item 3</tag>
            </tags>
            <para xml:id="S1.I2.i3.p1">
              <p>learning a surrogate model of the utility function in the state-action space, called a <text font="italic">critic</text> and doing the same as Solution <ref labelref="LABEL:sol2"/> (Section <ref labelref="LABEL:sec:psss"/>),</p>
            </para>
          </item>
          <item labels="LABEL:sol4" xml:id="S1.I2.i4">
            <tags>
              <tag>4.</tag>
              <tag role="autoref">item 4</tag>
              <tag role="refnum">4</tag>
              <tag role="typerefnum">item 4</tag>
            </tags>
            <para xml:id="S1.I2.i4.p1">
              <p>learning a <text font="italic">forward model</text> of the system-environment interaction that predicts the next state given the current state and action, to generate samples without using the system, and then applying one of the above solutions based on the generated samples (Section <ref labelref="LABEL:sec:MB"/>).</p>
            </para>
          </item>
        </enumerate>
        <p>The first two approaches are based on episodic-samples, the third is based on step-samples and the fourth can be applied to both.</p>
      </para>
<!--  %**** sample˙efficiency.tex Line 500 **** -->      <para xml:id="S1.SS2.p6">
        <p><text color="#0000B3" font="bold" framecolor="#0000B3" framed="rectangle">Todo:</text><text color="#0000B3" font="italic">In general, I understand the above, but “outside” readers would greatly benefit from the following:</text></p>
        <itemize xml:id="S1.I3">
          <item xml:id="S1.I3.i1">
            <tags>
              <tag><text color="#0000B3">1.</text></tag>
              <tag role="autoref"><text color="#0000B3" font="italic">item 1</text></tag>
              <tag role="refnum"><text color="#0000B3">1</text></tag>
              <tag role="typerefnum"><text color="#0000B3">item 1</text></tag>
            </tags>
            <para xml:id="S1.I3.i1.p1">
              <p><text color="#0000B3" font="italic">A concrete example (ideally from a Jan Peters or Marc Deisenroth paper, as they may review this ;-) I am thinking ball-in-cup, or perhaps even a simple maze. Then you could explain each of the items below briefly in the context of this example.</text></p>
            </para>
          </item>
          <item xml:id="S1.I3.i2">
            <tags>
              <tag><text color="#0000B3">2.</text></tag>
              <tag role="autoref"><text color="#0000B3" font="italic">item 2</text></tag>
              <tag role="refnum"><text color="#0000B3">2</text></tag>
              <tag role="typerefnum"><text color="#0000B3">item 2</text></tag>
            </tags>
            <para xml:id="S1.I3.i2.p1">
              <p><text color="#0000B3" font="italic">An image showing which part of the example is modelled </text><text color="#009900" font="bold" framecolor="#009900" framed="rectangle">To discuss:</text><text color="#009900" font="normal">I have some ideas for this, we could discuss over skype</text><text color="#0000B3" font="italic">.</text></p>
            </para>
          </item>
        </itemize>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S1.SS3">
      <tags>
        <tag>1.3</tag>
        <tag role="autoref">subsection 1.3</tag>
        <tag role="refnum">1.3</tag>
        <tag role="typerefnum">§1.3</tag>
      </tags>
      <title><tag close=" ">1.3</tag>Messages of the Review</title>
      <para xml:id="S1.SS3.p1">
        <p>From this perspective, our main messages are the following:</p>
        <enumerate xml:id="S1.I4">
          <item xml:id="S1.I4.i1">
            <tags>
              <tag>1.</tag>
              <tag role="autoref">item 1</tag>
              <tag role="refnum">1</tag>
              <tag role="typerefnum">item 1</tag>
            </tags>
            <para xml:id="S1.I4.i1.p1">
              <p>A sample can only be reused to improve a model. As a consequence, optimization without a utility model does not generally give rise to sample reuse.</p>
            </para>
          </item>
          <item xml:id="S1.I4.i2">
            <tags>
              <tag>2.</tag>
              <tag role="autoref">item 2</tag>
              <tag role="refnum">2</tag>
              <tag role="typerefnum">item 2</tag>
            </tags>
            <para xml:id="S1.I4.i2.p1">
              <p>The compared sample efficiency of learning a surrogate model of utility in the policy parameter space versus a critic mostly depends on the size and structure of the corresponding spaces, but the latter uses step-samples, which is inherently more sample efficient, and offers more opportunities for sample reuse.</p>
            </para>
          </item>
          <item xml:id="S1.I4.i3">
            <tags>
              <tag>3.</tag>
              <tag role="autoref">item 3</tag>
              <tag role="refnum">3</tag>
              <tag role="typerefnum">item 3</tag>
            </tags>
            <para xml:id="S1.I4.i3.p1">
              <p>Learning a critic offers the opportunity to learn useful intermediate representations.</p>
            </para>
          </item>
          <item xml:id="S1.I4.i4">
            <tags>
              <tag>4.</tag>
              <tag role="autoref">item 4</tag>
              <tag role="refnum">4</tag>
              <tag role="typerefnum">item 4</tag>
            </tags>
            <para xml:id="S1.I4.i4.p1">
              <p>On-line learning is generally faster than off-line learning, but it is also more unstable.</p>
            </para>
          </item>
          <item xml:id="S1.I4.i5">
            <tags>
              <tag>5.</tag>
              <tag role="autoref">item 5</tag>
              <tag role="refnum">5</tag>
              <tag role="typerefnum">item 5</tag>
            </tags>
            <para xml:id="S1.I4.i5.p1">
              <p>Learning a forward model is not enough to improve sample efficiency when using episodic-samples, because it does not provide the estimated utility of the generated samples. A more complete simulator providing this estimated utility over episodes is required.
<!--  %% “item 
     %%   Solution~“ref–sol2˝ is more data efficient than Solution~“ref–sol1˝ in general, but does not provide additional opportunities for sample reuse. Sample efficiency is further improved if the model of utility is used to efficiently choose the next sample, which corresponds to active learning methods.
     %% “item
     %%   Solutions~“ref–sol2˝ and “ref–sol3˝ are similar in that they are both “dbo methods using a learned model of the utility function.
     %%   They differ in the space where the utility function is learned and in the way it is learned (regression versus bootstrap).--></p>
            </para>
          </item>
          <item xml:id="S1.I4.i6">
            <tags>
              <tag>6.</tag>
              <tag role="autoref">item 6</tag>
              <tag role="refnum">6</tag>
              <tag role="typerefnum">item 6</tag>
            </tags>
            <para xml:id="S1.I4.i6.p1">
              <p>In contrast with using episodic-samples, learning a forward model can improve sample efficiency when using step-samples, because the immediate utility of step-samples is easily accessible.</p>
            </para>
          </item>
        </enumerate>
<!--  %**** sample˙efficiency.tex Line 525 **** -->      </para>
      <para xml:id="S1.SS3.p2">
        <p>From these messages, it appears that the higher sample efficiency of deep RL methods results from several mechanisms:
they use derivative-based optimization updates, they model the utility function in the state-action space, and they can be combined with a forward model.
In addition, they benefit from massive sample reuse using a replay buffer and they can perform on-line learning.
However, reserach is still very active in finding a way to manage a trade-off between bias and variance to efficiently control their intrinsic unstability.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S1.SS4">
      <tags>
        <tag>1.4</tag>
        <tag role="autoref">subsection 1.4</tag>
        <tag role="refnum">1.4</tag>
        <tag role="typerefnum">§1.4</tag>
      </tags>
      <title><tag close=" ">1.4</tag>Structure of the Review</title>
      <para xml:id="S1.SS4.p1">
        <p>To explain these various points, the paper is organised as follows.
In Section <ref labelref="LABEL:sec:notations"/>, we formally define the episodic policy search problem, the notations and the main related concepts.
In Section <ref labelref="LABEL:sec:bbo"/>, we establish the taxonomy of methods depicted in Figure <ref labelref="LABEL:fig:orga_paper"/>, based on elementary processes that play an important role in many BBO methods, namely regression and optimization, without going down to the level of algorithms.
In Sections <ref labelref="LABEL:sec:mfo"/>, <ref labelref="LABEL:sec:theta"/> and <ref labelref="LABEL:sec:psss"/>, we show how these methods are implemented in various episodic policy search algorithms.
Then, in Section <ref labelref="LABEL:sec:choices"/>, we discuss the different elementary design choices that matter in terms of sample efficiency and reuse.
Finally, Section <ref labelref="LABEL:sec:conclu"/> summarizes the paper and provides some perspectives about current trends in the domain.</p>
      </para>
    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:notations" xml:id="S2">
    <tags>
      <tag>2</tag>
      <tag role="autoref">section 2</tag>
      <tag role="refnum">2</tag>
      <tag role="typerefnum">§2</tag>
    </tags>
    <title><tag close=" ">2</tag>Episodic policy search</title>
    <para xml:id="S2.p1">
      <p>This section introduces the general episodic policy search framework and various ways to compute utility with formal notations.
These notations are standard, readers familiar with RL <cite class="ltx_citemacro_citep">(<bibref bibrefs="sutton98" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> or episodic policy search <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey,stulp13paladyn,kober2013reinforcement" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> can skip this section.</p>
    </para>
    <subsection inlist="toc" xml:id="S2.SS1">
      <tags>
        <tag>2.1</tag>
        <tag role="autoref">subsection 2.1</tag>
        <tag role="refnum">2.1</tag>
        <tag role="typerefnum">§2.1</tag>
      </tags>
      <title><tag close=" ">2.1</tag>System and environment interaction</title>
      <para xml:id="S2.SS1.p1">
        <p>We consider a system, such as a robot or a software agent, in interaction with its environment.
Since the agent is learning with a computer, we consider an interaction along discrete times steps, as suggested in <cite class="ltx_citemacro_citep">(<bibref bibrefs="peters2008reinforcement" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.
<!--  %**** sample˙efficiency.tex Line 550 **** -->This interaction is fully characterized by its current state <Math mode="inline" tex="\mathbf{\bm{x}}_{k}~{}\in~{}\mathcal{X}" text="x _ k element-of X" xml:id="S2.SS1.p1.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="element-of" name="in" role="RELOP" rpadding="3.3pt">∈</XMTok>
                <XMApp rpadding="3.3pt">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMTok font="caligraphic" role="UNKNOWN">X</XMTok>
              </XMApp>
            </XMath>
          </Math><note mark="1" role="footnote" xml:id="footnote1"><tags>
              <tag>1</tag>
              <tag role="autoref">footnote 1</tag>
              <tag role="refnum">1</tag>
              <tag role="typerefnum">footnote 1</tag>
            </tags>Throughout the paper, we denote scalars as lowercase symbols (<Math mode="inline" tex="x" text="x" xml:id="footnote1.m1">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">x</XMTok>
              </XMath>
            </Math>), vectors as bold lowercase symbols (<Math mode="inline" tex="\mathbf{\bm{x}}" text="x" xml:id="footnote1.m2">
              <XMath>
                <XMTok font="bold" role="UNKNOWN">x</XMTok>
              </XMath>
            </Math>) and matrices as bold uppercase symbols (<Math mode="inline" tex="\mathbf{\bm{X}}" text="X" xml:id="footnote1.m3">
              <XMath>
                <XMTok font="bold" role="UNKNOWN">X</XMTok>
              </XMath>
            </Math>). The time index is always <Math mode="inline" tex="k" text="k" xml:id="footnote1.m4">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">k</XMTok>
              </XMath>
            </Math> and the iteration index is always <Math mode="inline" tex="i" text="i" xml:id="footnote1.m5">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">i</XMTok>
              </XMath>
            </Math>. <Math mode="inline" tex="&lt;.,.&gt;" xml:id="footnote1.m6">
              <XMath>
                <XMTok meaning="less-than" role="RELOP">&lt;</XMTok>
                <XMTok role="PERIOD">.</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok role="PERIOD">.</XMTok>
                <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
              </XMath>
            </Math> denotes a pair.
</note>.</p>
      </para>
      <para xml:id="S2.SS1.p2">
        <p>At each time step <Math mode="inline" tex="k" text="k" xml:id="S2.SS1.p2.m1">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">k</XMTok>
            </XMath>
          </Math>, the system gets a state information <Math mode="inline" tex="\mathbf{\bm{x}}_{k}~{}\in~{}\mathcal{X}" text="x _ k element-of X" xml:id="S2.SS1.p2.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="element-of" name="in" role="RELOP" rpadding="3.3pt">∈</XMTok>
                <XMApp rpadding="3.3pt">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMTok font="caligraphic" role="UNKNOWN">X</XMTok>
              </XMApp>
            </XMath>
          </Math> and performs an action <Math mode="inline" tex="\mathbf{\bm{u}}_{k}\in\mathcal{U}" text="u _ k element-of U" xml:id="S2.SS1.p2.m3">
            <XMath>
              <XMApp>
                <XMTok meaning="element-of" name="in" role="RELOP">∈</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">u</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMTok font="caligraphic" role="UNKNOWN">U</XMTok>
              </XMApp>
            </XMath>
          </Math> specified by a stochastic policy <Math mode="inline" tex="\pol{{}_{\mathbf{\bm{\theta}}}}" text="pol@(_theta)" xml:id="S2.SS1.p2.m4">
            <XMath>
              <XMApp>
                <XMTok name="pol" role="OVERACCENT">→</XMTok>
                <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                  <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>, where <Math mode="inline" tex="{\mathbf{\bm{\theta}}}" text="theta" xml:id="S2.SS1.p2.m5">
            <XMath>
              <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
            </XMath>
          </Math> is a set of policy parameters taken from the policy parameter space <Math mode="inline" tex="\Theta" text="Theta" xml:id="S2.SS1.p2.m6">
            <XMath>
              <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
            </XMath>
          </Math>. In a closed loop policy, <Math mode="inline" tex="\pol{{}_{\mathbf{\bm{\theta}}}}" text="pol@(_theta)" xml:id="S2.SS1.p2.m7">
            <XMath>
              <XMApp>
                <XMTok name="pol" role="OVERACCENT">→</XMTok>
                <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                  <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> is a stochastic function of the states <Math mode="inline" tex="\mathbf{\bm{x}}_{k}" text="x _ k" xml:id="S2.SS1.p2.m8">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold" role="UNKNOWN">x</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
              </XMApp>
            </XMath>
          </Math>, that is <Math mode="inline" tex="\mathbf{\bm{u}}_{k}\sim\pol{_{\mathbf{\bm{\theta}}}}(\mathbf{\bm{u}}_{k}|%&#10;\mathbf{\bm{x}}_{k})" text="u _ k similar-to pol@(theta@()) * conditional@(u _ k, x _ k)" xml:id="S2.SS1.p2.m9">
            <XMath>
              <XMApp>
                <XMTok meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">u</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok name="pol" role="OVERACCENT">→</XMTok>
                    <XMApp role="POSTSUBSCRIPT" scriptpos="2">
                      <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMDual>
                    <XMRef idref="S2.SS1.p2.m9.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S2.SS1.p2.m9.1">
                        <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">u</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                        </XMApp>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">x</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> which gives the probability of choosing action <Math mode="inline" tex="\mathbf{\bm{u}}_{k}" text="u _ k" xml:id="S2.SS1.p2.m10">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold" role="UNKNOWN">u</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
              </XMApp>
            </XMath>
          </Math> given state <Math mode="inline" tex="\mathbf{\bm{x}}_{k}" text="x _ k" xml:id="S2.SS1.p2.m11">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold" role="UNKNOWN">x</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
              </XMApp>
            </XMath>
          </Math>. In the open loop case, it is rather a function of the time steps <Math mode="inline" tex="k" text="k" xml:id="S2.SS1.p2.m12">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">k</XMTok>
            </XMath>
          </Math>, that is <Math mode="inline" tex="\mathbf{\bm{u}}_{k}\sim\pol{_{\mathbf{\bm{\theta}}}}(\mathbf{\bm{u}}_{k}|k)" text="u _ k similar-to pol@(theta@()) * conditional@(u _ k, k)" xml:id="S2.SS1.p2.m13">
            <XMath>
              <XMApp>
                <XMTok meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">u</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok name="pol" role="OVERACCENT">→</XMTok>
                    <XMApp role="POSTSUBSCRIPT" scriptpos="2">
                      <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMDual>
                    <XMRef idref="S2.SS1.p2.m13.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S2.SS1.p2.m13.1">
                        <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">u</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                        </XMApp>
                        <XMTok font="italic" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>.
Finally, a deterministic policy is a particular case where the probability is 1 for a specific action and 0 for the rest of actions.</p>
      </para>
      <para xml:id="S2.SS1.p3">
        <p>The outcome of the action depends on the state and consists of two informations: the next state <Math mode="inline" tex="\mathbf{\bm{x}}_{k+1}\sim~{}p(\mathbf{\bm{x}}_{k+1}|\mathbf{\bm{x}}_{k},%&#10;\mathbf{\bm{u}}_{k})" text="x _ (k + 1) similar-to p * conditional@(x _ (k + 1), list@(x _ k, u _ k))" xml:id="S2.SS1.p3.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="similar-to" name="sim" role="RELOP" rpadding="3.3pt">∼</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">p</XMTok>
                  <XMDual>
                    <XMRef idref="S2.SS1.p3.m1.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S2.SS1.p3.m1.1">
                        <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">x</XMTok>
                          <XMApp>
                            <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                            <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="list"/>
                            <XMRef idref="S2.SS1.p3.m1.1.1"/>
                            <XMRef idref="S2.SS1.p3.m1.1.2"/>
                          </XMApp>
                          <XMWrap>
                            <XMApp xml:id="S2.SS1.p3.m1.1.1">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="bold" role="UNKNOWN">x</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                            </XMApp>
                            <XMTok role="PUNCT">,</XMTok>
                            <XMApp xml:id="S2.SS1.p3.m1.1.2">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="bold" role="UNKNOWN">u</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                            </XMApp>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> which is also a stochastic function of the current state and action, and an immediate utility <Math mode="inline" tex="j_{k}=~{}j(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k})" text="j _ k = j * open-interval@(x _ k, u _ k)" xml:id="S2.SS1.p3.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP" rpadding="3.3pt">=</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">j</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">j</XMTok>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S2.SS1.p3.m2.1"/>
                      <XMRef idref="S2.SS1.p3.m2.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S2.SS1.p3.m2.1">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">x</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMApp xml:id="S2.SS1.p3.m2.2">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">u</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>.</p>
      </para>
      <para xml:id="S2.SS1.p4">
        <p>The aim of policy search is to optimize the policy parameters <Math mode="inline" tex="{\mathbf{\bm{\theta}}}" text="theta" xml:id="S2.SS1.p4.m1">
            <XMath>
              <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
            </XMath>
          </Math> with respect to some agregation of the immediate utilities <Math mode="inline" tex="j_{k}" text="j _ k" xml:id="S2.SS1.p4.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">j</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
              </XMApp>
            </XMath>
          </Math> over trajectories of the system.
The immediate utility function can be a scalar, or a vector <Math mode="inline" tex="\mathbf{\bm{j}}_{k}=\mathbf{\bm{j}}(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k})" text="j _ k = j * open-interval@(x _ k, u _ k)" xml:id="S2.SS1.p4.m3">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">j</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="bold" role="UNKNOWN">j</XMTok>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S2.SS1.p4.m3.1"/>
                      <XMRef idref="S2.SS1.p4.m3.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S2.SS1.p4.m3.1">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">x</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMApp xml:id="S2.SS1.p4.m3.2">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">u</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> in the multi-objective case.
In this case, secondary objectives may play a role in improving the sample efficiency of some policy search methods (we do not cover this topic here, see <cite class="ltx_citemacro_citep">(<bibref bibrefs="doncieux2014beyond" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> for a survey).</p>
      </para>
      <para xml:id="S2.SS1.p5">
        <p>To summarise, a step of the system-environment interaction from some state <Math mode="inline" tex="\mathbf{\bm{x}}_{k}" text="x _ k" xml:id="S2.SS1.p5.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold" role="UNKNOWN">x</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
              </XMApp>
            </XMath>
          </Math> can be specified as in Algorithm <ref labelref="LABEL:alg:perform_step"/>, where “<ERROR class="undefined">\tikz</ERROR>[baseline=(char.base)]
<ERROR class="undefined">\node</ERROR>[shape=circle,draw,inner sep=1pt] (char) <text fontsize="50%">E</text>; <Math mode="inline" tex="\rightarrow" text="rightarrow" xml:id="S2.SS1.p5.m2">
            <XMath>
              <XMTok name="rightarrow" role="ARROW">→</XMTok>
            </XMath>
          </Math>” denotes information provided by the environment and “<ERROR class="undefined">\tikz</ERROR>[baseline=(char.base)]
<ERROR class="undefined">\node</ERROR>[shape=circle,draw,inner sep=1pt] (char) <text fontsize="50%">S</text>; <Math mode="inline" tex="\rightarrow" text="rightarrow" xml:id="S2.SS1.p5.m3">
            <XMath>
              <XMTok name="rightarrow" role="ARROW">→</XMTok>
            </XMath>
          </Math>” denotes information provided by the system.</p>
      </para>
      <theorem class="ltx_theorem_algorithm" labels="LABEL:alg:perform_step" xml:id="alg1">
        <tags>
          <tag>Algorithm 1</tag>
          <tag role="autoref">1</tag>
          <tag role="refnum">1</tag>
          <tag role="typerefnum">Algorithm 1</tag>
        </tags>
        <para xml:id="alg1.p1">
          <p>[htb]
<toccaption><tag close=" ">1</tag>perform<Math mode="inline" tex="\_" text="_" xml:id="alg1.p1.m1">
                <XMath>
                  <XMTok role="UNKNOWN">_</XMTok>
                </XMath>
              </Math>step(<Math mode="inline" tex="\mathbf{\bm{x}}_{k},{\mathbf{\bm{\theta}}}" text="list@(x _ k, theta)" xml:id="alg1.p1.m2">
                <XMath>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="list"/>
                      <XMRef idref="alg1.p1.m2.2"/>
                      <XMRef idref="alg1.p1.m2.1"/>
                    </XMApp>
                    <XMWrap>
                      <XMApp xml:id="alg1.p1.m2.2">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">x</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="alg1.p1.m2.1">θ</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMath>
              </Math>)</toccaption><caption><tag close=" ">Algorithm 1</tag>perform<Math mode="inline" tex="\_" text="_" xml:id="alg1.p1.m3">
                <XMath>
                  <XMTok role="UNKNOWN">_</XMTok>
                </XMath>
              </Math>step(<Math mode="inline" tex="\mathbf{\bm{x}}_{k},{\mathbf{\bm{\theta}}}" text="list@(x _ k, theta)" xml:id="alg1.p1.m4">
                <XMath>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="list"/>
                      <XMRef idref="alg1.p1.m4.2"/>
                      <XMRef idref="alg1.p1.m4.1"/>
                    </XMApp>
                    <XMWrap>
                      <XMApp xml:id="alg1.p1.m4.2">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">x</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="alg1.p1.m4.1">θ</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMath>
              </Math>)</caption><ERROR class="undefined">\lx@orig@algorithmic</ERROR>
<ERROR class="undefined">\REQUIRE</ERROR><Math mode="inline" tex="\mathbf{\bm{x}}_{k}" text="x _ k" xml:id="alg1.p1.m5">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
              </XMath>
            </Math>: current state, <Math mode="inline" tex="{\mathbf{\bm{\theta}}}" text="theta" xml:id="alg1.p1.m6">
              <XMath>
                <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
              </XMath>
            </Math>: policy parameters
<ERROR class="undefined">\STATE</ERROR><ERROR class="undefined">\tikz</ERROR>[baseline=(char.base)]
<ERROR class="undefined">\node</ERROR>[shape=circle,draw,inner sep=1pt] (char) <text fontsize="50%">S</text>; <Math mode="inline" tex="\rightarrow" text="rightarrow" xml:id="alg1.p1.m7">
              <XMath>
                <XMTok name="rightarrow" role="ARROW">→</XMTok>
              </XMath>
            </Math> <Math mode="inline" tex="\mathbf{\bm{u}}_{k}\sim\pol{_{\mathbf{\bm{\theta}}}}(\mathbf{\bm{u}}_{k}|%&#10;\mathbf{\bm{x}}_{k})" text="u _ k similar-to pol@(theta@()) * conditional@(u _ k, x _ k)" xml:id="alg1.p1.m8">
              <XMath>
                <XMApp>
                  <XMTok meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold" role="UNKNOWN">u</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMApp>
                      <XMTok name="pol" role="OVERACCENT">→</XMTok>
                      <XMApp role="POSTSUBSCRIPT" scriptpos="2">
                        <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMDual>
                      <XMRef idref="alg1.p1.m8.1"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="alg1.p1.m8.1">
                          <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="bold" role="UNKNOWN">u</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                          </XMApp>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="bold" role="UNKNOWN">x</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math> or <Math mode="inline" tex="\mathbf{\bm{u}}_{k}\sim\pol{_{\mathbf{\bm{\theta}}}}(\mathbf{\bm{u}}_{k}|k)" text="u _ k similar-to pol@(theta@()) * conditional@(u _ k, k)" xml:id="alg1.p1.m9">
              <XMath>
                <XMApp>
                  <XMTok meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold" role="UNKNOWN">u</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMApp>
                      <XMTok name="pol" role="OVERACCENT">→</XMTok>
                      <XMApp role="POSTSUBSCRIPT" scriptpos="2">
                        <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMDual>
                      <XMRef idref="alg1.p1.m9.1"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="alg1.p1.m9.1">
                          <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="bold" role="UNKNOWN">u</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                          </XMApp>
                          <XMTok font="italic" role="UNKNOWN">k</XMTok>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>
<ERROR class="undefined">\STATE</ERROR><ERROR class="undefined">\tikz</ERROR>[baseline=(char.base)]
<ERROR class="undefined">\node</ERROR>[shape=circle,draw,inner sep=1pt] (char) <text fontsize="50%">E</text>; <Math mode="inline" tex="\rightarrow" text="rightarrow" xml:id="alg1.p1.m10">
              <XMath>
                <XMTok name="rightarrow" role="ARROW">→</XMTok>
              </XMath>
            </Math> <Math mode="inline" tex="j_{k}=j(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k})" text="j _ k = j * open-interval@(x _ k, u _ k)" xml:id="alg1.p1.m11">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">j</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">j</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="open-interval"/>
                        <XMRef idref="alg1.p1.m11.1"/>
                        <XMRef idref="alg1.p1.m11.2"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="alg1.p1.m11.1">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">x</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                        </XMApp>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMApp xml:id="alg1.p1.m11.2">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">u</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>
<ERROR class="undefined">\STATE</ERROR><ERROR class="undefined">\tikz</ERROR>[baseline=(char.base)]
<ERROR class="undefined">\node</ERROR>[shape=circle,draw,inner sep=1pt] (char) <text fontsize="50%">E</text>; <Math mode="inline" tex="\rightarrow" text="rightarrow" xml:id="alg1.p1.m12">
              <XMath>
                <XMTok name="rightarrow" role="ARROW">→</XMTok>
              </XMath>
            </Math> <Math mode="inline" tex="\mathbf{\bm{x}}_{k+1}\sim p(\mathbf{\bm{x}}_{k+1}|\mathbf{\bm{x}}_{k},\mathbf{%&#10;\bm{u}}_{k})" text="x _ (k + 1) similar-to p * conditional@(x _ (k + 1), list@(x _ k, u _ k))" xml:id="alg1.p1.m13">
              <XMath>
                <XMApp>
                  <XMTok meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold" role="UNKNOWN">x</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">p</XMTok>
                    <XMDual>
                      <XMRef idref="alg1.p1.m13.1"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="alg1.p1.m13.1">
                          <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="bold" role="UNKNOWN">x</XMTok>
                            <XMApp>
                              <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                              <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                            </XMApp>
                          </XMApp>
                          <XMDual>
                            <XMApp>
                              <XMTok meaning="list"/>
                              <XMRef idref="alg1.p1.m13.1.1"/>
                              <XMRef idref="alg1.p1.m13.1.2"/>
                            </XMApp>
                            <XMWrap>
                              <XMApp xml:id="alg1.p1.m13.1.1">
                                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                <XMTok font="bold" role="UNKNOWN">x</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                              </XMApp>
                              <XMTok role="PUNCT">,</XMTok>
                              <XMApp xml:id="alg1.p1.m13.1.2">
                                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                <XMTok font="bold" role="UNKNOWN">u</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                              </XMApp>
                            </XMWrap>
                          </XMDual>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>
<ERROR class="undefined">\RETURN</ERROR><Math mode="inline" tex="s_{k}=&lt;\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k},j_{k},\mathbf{\bm{x}}_{k+1}&gt;" xml:id="alg1.p1.m14">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">s</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok meaning="less-than" role="RELOP">&lt;</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">u</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">j</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMApp>
                <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
              </XMath>
            </Math></p>
        </para>
      </theorem>
<!--  %**** sample˙efficiency.tex Line 575 **** -->      <para xml:id="S2.SS1.p6">
        <p>The steps of the system in its environment generate a set of samples <Math mode="inline" tex="s_{k}=&lt;\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k},j_{k},\mathbf{\bm{x}}_{k+1}&gt;" xml:id="S2.SS1.p6.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
              </XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMTok meaning="less-than" role="RELOP">&lt;</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold" role="UNKNOWN">x</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
              </XMApp>
              <XMTok role="PUNCT">,</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold" role="UNKNOWN">u</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
              </XMApp>
              <XMTok role="PUNCT">,</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">j</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
              </XMApp>
              <XMTok role="PUNCT">,</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold" role="UNKNOWN">x</XMTok>
                <XMApp>
                  <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                </XMApp>
              </XMApp>
              <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
            </XMath>
          </Math> that we call <text font="italic">step-samples</text> throughout the paper.</p>
      </para>
<!--  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->    </subsection>
    <subsection inlist="toc" labels="LABEL:sec:eps" xml:id="S2.SS2">
      <tags>
        <tag>2.2</tag>
        <tag role="autoref">subsection 2.2</tag>
        <tag role="refnum">2.2</tag>
        <tag role="typerefnum">§2.2</tag>
      </tags>
      <title><tag close=" ">2.2</tag>Episodic policy search</title>
      <para xml:id="S2.SS2.p1">
        <p>In the episodic context, the system-environment interaction is initialized in some starting state <Math mode="inline" tex="\mathbf{\bm{x}}_{0}" text="x _ 0" xml:id="S2.SS2.p1.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold" role="UNKNOWN">x</XMTok>
                <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
              </XMApp>
            </XMath>
          </Math> and the current policy is applied until the system reaches one of
<Math mode="inline" tex="m" text="m" xml:id="S2.SS2.p1.m2">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">m</XMTok>
            </XMath>
          </Math> predefined final states <Math mode="inline" tex="\mathbf{\bm{x}}^{d}_{f},~{}d\in\{1,\ldots,m\},m~{}\geq~{}1" text="formulae@(list@((x ^ d) _ f, d) element-of set@(1, ldots, m), m &gt;= 1)" xml:id="S2.SS2.p1.m3">
            <XMath>
              <XMDual>
                <XMApp>
                  <XMTok meaning="formulae"/>
                  <XMRef idref="S2.SS2.p1.m3.5"/>
                  <XMRef idref="S2.SS2.p1.m3.6"/>
                </XMApp>
                <XMWrap>
                  <XMApp xml:id="S2.SS2.p1.m3.5">
                    <XMTok meaning="element-of" name="in" role="RELOP">∈</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="list"/>
                        <XMRef idref="S2.SS2.p1.m3.5.1"/>
                        <XMRef idref="S2.SS2.p1.m3.4"/>
                      </XMApp>
                      <XMWrap>
                        <XMApp xml:id="S2.SS2.p1.m3.5.1">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMApp>
                            <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="bold" role="UNKNOWN">x</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">d</XMTok>
                          </XMApp>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">f</XMTok>
                        </XMApp>
                        <XMTok role="PUNCT" rpadding="3.3pt">,</XMTok>
                        <XMTok font="italic" role="UNKNOWN" xml:id="S2.SS2.p1.m3.4">d</XMTok>
                      </XMWrap>
                    </XMDual>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="set"/>
                        <XMRef idref="S2.SS2.p1.m3.1"/>
                        <XMRef idref="S2.SS2.p1.m3.2"/>
                        <XMRef idref="S2.SS2.p1.m3.3"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">{</XMTok>
                        <XMTok meaning="1" role="NUMBER" xml:id="S2.SS2.p1.m3.1">1</XMTok>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMTok name="ldots" role="ID" xml:id="S2.SS2.p1.m3.2">…</XMTok>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMTok font="italic" role="UNKNOWN" xml:id="S2.SS2.p1.m3.3">m</XMTok>
                        <XMTok role="CLOSE" stretchy="false">}</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S2.SS2.p1.m3.6">
                    <XMTok meaning="greater-than-or-equals" name="geq" role="RELOP" rpadding="3.3pt">≥</XMTok>
                    <XMTok font="italic" role="UNKNOWN" rpadding="3.3pt">m</XMTok>
                    <XMTok meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math> or some amount of time <Math mode="inline" tex="k_{max}" text="k _ (m * a * x)" xml:id="S2.SS2.p1.m4">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">k</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">m</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> has elapsed.
Predefined final states can be either goal states to be attained or destructive states that should be avoided.
When <Math mode="inline" tex="k_{max}" text="k _ (m * a * x)" xml:id="S2.SS2.p1.m5">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">k</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">m</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> has elapsed or a final state is reached, the system stops in state <Math mode="inline" tex="\mathbf{\bm{x}}_{f}" text="x _ f" xml:id="S2.SS2.p1.m6">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold" role="UNKNOWN">x</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">f</XMTok>
              </XMApp>
            </XMath>
          </Math> at step <Math mode="inline" tex="k_{f}" text="k _ f" xml:id="S2.SS2.p1.m7">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">k</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">f</XMTok>
              </XMApp>
            </XMath>
          </Math>.
The system-environment interaction between <Math mode="inline" tex="\mathbf{\bm{x}}_{0}" text="x _ 0" xml:id="S2.SS2.p1.m8">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold" role="UNKNOWN">x</XMTok>
                <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
              </XMApp>
            </XMath>
          </Math> and <Math mode="inline" tex="\mathbf{\bm{x}}_{f}" text="x _ f" xml:id="S2.SS2.p1.m9">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold" role="UNKNOWN">x</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">f</XMTok>
              </XMApp>
            </XMath>
          </Math> is called an <text font="italic">episode</text>.</p>
      </para>
      <para xml:id="S2.SS2.p2">
        <p>We call <text font="italic">trajectory</text> the set of step-samples obtained along episode <Math mode="inline" tex="e" text="e" xml:id="S2.SS2.p2.m1">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">e</XMTok>
            </XMath>
          </Math> and we denote it with <Math mode="inline" tex="\tau_{e}" text="tau _ e" xml:id="S2.SS2.p2.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">e</XMTok>
              </XMApp>
            </XMath>
          </Math>.</p>
      </para>
<!--  %% The way to generate a trajectory by running the system along an episode is described in Algorithm~“ref–alg:run˙episode˝. 
     %% “begin–algorithm˝[htb]
     %%   “caption–run$“˙$episode($“vx˙0$,$“params$)˝
     %%   “begin–algorithmic˝“label–alg:run˙episode˝
     %%     “REQUIRE $“vx˙0$: initial state, $“params$: policy parameters
     %%     “STATE $“tau=“emptyset$: trajectory memory
     %%     “STATE $k=0$: current time step
     %%     “WHILE –episode termination criterion not met˝
     %%       “STATE $s˙k = perform“˙step(“vx˙k,“params)$
     %%       “STATE append $s˙k$ to $“tau$
     %%       “STATE // update state with $“vx˙–k+1˝  “sim p(“vx˙–k+1˝—“vx˙k,“vu˙k)$
     %**** sample˙efficiency.tex Line 600 ****
     %%       “STATE $k “leftarrow k+1$
     %%       “ENDWHILE
     %%       “RETURN $“tau$
     %%   “end–algorithmic˝
     %% “end–algorithm˝
     %% Algorithm~“ref–alg:run˙episode˝ does not account for “onl learning.
     %% In the “onl learning setting, the current policy $“pol˙“params$ is improved at each step $k$, thus the algorithm should also return $“params$.
     %% This “onl learning version can only be applied in methods using “ssas, as covered in Section~“ref–sec:onl˝.-->    </subsection>
    <subsection inlist="toc" xml:id="S2.SS3">
      <tags>
        <tag>2.3</tag>
        <tag role="autoref">subsection 2.3</tag>
        <tag role="refnum">2.3</tag>
        <tag role="typerefnum">§2.3</tag>
      </tags>
      <title><tag close=" ">2.3</tag>Utilities</title>
      <para xml:id="S2.SS3.p1">
        <p>Three notions of utility are useful in episodic policy search: the utility over a single trajectory, the utility over a set of trajectories from the same state, which corresponds to the <text font="italic">Monte Carlo return</text> from that state, and the utility over all states.</p>
      </para>
      <subsubsection inlist="toc" labels="LABEL:sec:uti_eps" xml:id="S2.SS3.SSS1">
        <tags>
          <tag>2.3.1</tag>
          <tag role="autoref">subsubsection 2.3.1</tag>
          <tag role="refnum">2.3.1</tag>
          <tag role="typerefnum">§2.3.1</tag>
        </tags>
        <title><tag close=" ">2.3.1</tag>Utility over a trajectory</title>
<!--  %% Given some aggregation function $Z$ of immediate utilities, the utility over a trajectory $“tau$ can be written 
     %% $$J(“tau) = Z(j˙0, “ldots, j˙–k˙f˝),$$
     %% where $j˙k$ denotes the immediate utility value $j$ contained in “ssa $s˙k$.
     %% When the immediate utility function outputs a scalar, two standard aggregation functions are often used.
     %% The first is the discounted sum
     %**** sample˙efficiency.tex Line 625 ****
     %% “begin–equation˝
     %%   “label–eq:discounted˙utility˝
     %%   J(“tau) = “sum˙–k=0˝^–k˙f˝ “gamma^–k ˝ j˙k,
     %% “end–equation˝
     %% where $“gamma$ is a discount factor in $[0,1]$.
     %% The second is the average utility over the episode
     %% “begin–equation˝
     %%   “label–eq:average˙utility˝
     %%   J(“tau) = “frac–1˝–k˙f+1˝ “sum˙–k=0˝^–k˙f˝ j˙k.
     %% “end–equation˝
     %% “noindent
     %% Equations~“eqref–eq:discounted˙utility˝ and “eqref–eq:average˙utility˝ define the aggregated utility over a single episode.
     %% This utility is obtained by calling Algorithm~“ref–alg:get˙utility˝.
     %% “begin–algorithm˝[htb]
     %%   “caption–get“˙“epsa($“vx˙0$,$“params$)˝
     %%   “begin–algorithmic˝“label–alg:get˙utility˝
     %%     “REQUIRE $“vx˙0$: initial state, $“params$: policy parameters
     %%     “STATE $“tau$=run“˙episode$(“vx˙0,“params)$ //Algorithm~“ref–alg:run˙episode˝
     %%     “STATE $J(“tau)$ = compute“˙utility($“tau$) using e.g. “eqref–eq:discounted˙utility˝ or “eqref–eq:average˙utility˝
     %%       “RETURN $J(“tau)$
     %%   “end–algorithmic˝
     %**** sample˙efficiency.tex Line 650 ****
     %% “end–algorithm˝-->        <para xml:id="S2.SS3.SSS1.p1">
          <p>In the language of episodic policy search algorithms, running a policy to get the utility over a trajectory <Math mode="inline" tex="J(\tau)" text="J * tau" xml:id="S2.SS3.SSS1.p1.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">J</XMTok>
                  <XMDual>
                    <XMRef idref="S2.SS3.SSS1.p1.m1.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="italic" name="tau" role="UNKNOWN" xml:id="S2.SS3.SSS1.p1.m1.1">τ</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math> is often called <text font="italic">performing a rollout</text>.
A rollout together with its utility is what we call an <text font="italic">episodic-sample</text> later on.</p>
        </para>
        <para xml:id="S2.SS3.SSS1.p2">
          <p>In some settings, the utility is considered null all along the trajectory and evaluated only at the final state.
In particular, this is the case in many evolutionary algorithms, where the utility provided by the environment over an episode
is called the <text font="italic">fitness function</text> and is not necessarily a function of states and actions along the episode <cite class="ltx_citemacro_citep">(<bibref bibrefs="doncieux2015evolutionary" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.</p>
        </para>
<!--  %%%%%%%%%%%%%%%%%%% -->      </subsubsection>
      <subsubsection inlist="toc" labels="LABEL:sec:mc" xml:id="S2.SS3.SSS2">
        <tags>
          <tag>2.3.2</tag>
          <tag role="autoref">subsubsection 2.3.2</tag>
          <tag role="refnum">2.3.2</tag>
          <tag role="typerefnum">§2.3.2</tag>
        </tags>
        <title><tag close=" ">2.3.2</tag>Monte Carlo return</title>
        <para xml:id="S2.SS3.SSS2.p1">
          <p>The system-environment interaction being stochastic, several episodes starting from the same initial state <Math mode="inline" tex="\mathbf{\bm{x}}_{0}" text="x _ 0" xml:id="S2.SS3.SSS2.p1.m1">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                </XMApp>
              </XMath>
            </Math> using the same policy can give rise to different trajectories and utilities. As a consequence, one should consider the expected utility over all possible trajectories <Math mode="inline" tex="\tau" text="tau" xml:id="S2.SS3.SSS2.p1.m2">
              <XMath>
                <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
              </XMath>
            </Math> starting from <Math mode="inline" tex="\mathbf{\bm{x}}_{0}" text="x _ 0" xml:id="S2.SS3.SSS2.p1.m3">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                </XMApp>
              </XMath>
            </Math>, denoted <Math mode="inline" tex="\tau_{\mathbf{\bm{x}}_{0}}" text="tau _ x _ 0" xml:id="S2.SS3.SSS2.p1.m4">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                    <XMTok font="bold" fontsize="70%" role="UNKNOWN">x</XMTok>
                    <XMTok fontsize="50%" meaning="0" role="NUMBER">0</XMTok>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>.
We write the corresponding expected utility as <Math mode="inline" tex="\bar{J}(\tau_{\mathbf{\bm{x}}_{0}})={{\rm I\!E}}{}_{\tau}\{J(\tau_{\mathbf{\bm%&#10;{x}}_{0}})\}" xml:id="S2.SS3.SSS2.p1.m5">
              <XMath>
                <XMApp>
                  <XMTok name="bar" role="OVERACCENT" stretchy="false">¯</XMTok>
                  <XMTok font="italic" role="UNKNOWN">J</XMTok>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                      <XMTok font="bold" fontsize="70%" role="UNKNOWN">x</XMTok>
                      <XMTok fontsize="50%" meaning="0" role="NUMBER">0</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok role="UNKNOWN" rpadding="-1.7pt">I</XMTok>
                <XMTok role="UNKNOWN">E</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="pre1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">{</XMTok>
                    <XMTok font="italic" role="UNKNOWN">J</XMTok>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                          <XMTok font="bold" fontsize="70%" role="UNKNOWN">x</XMTok>
                          <XMTok fontsize="50%" meaning="0" role="NUMBER">0</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                    <XMTok role="CLOSE" stretchy="false">}</XMTok>
                  </XMWrap>
                  <XMTok font="italic" fontsize="70%" name="tau" role="UNKNOWN">τ</XMTok>
                </XMApp>
              </XMath>
            </Math>.
This expected utility being defined over an infinite set of episodes, in practice it needs to be approximated.
<!--  %% One can consider the average utility over a set of $n˙e$ episodes: 
     %% “begin–equation˝
     %%   “label–eq:average˙return˝
     %%   “hat–J˝(“tau˙–“vx˙0˝) = “frac–1˝–n˙e˝ “sum˙–j=1˝^–n˙e˝ J(“tau˙–“vx˙0˝)
     %% “end–equation˝--></p>
        </para>
<!--  %% which is computed using Algorithm~“ref–alg:mc˙return˝. 
     %% “begin–algorithm˝[htb]
     %**** sample˙efficiency.tex Line 675 ****
     %%   “caption–get“˙MC“˙return($“vx˙0$,$“params$)˝
     %%   “begin–algorithmic˝“label–alg:mc˙return˝
     %%     “REQUIRE $“vx˙0$: initial state, $“params$: policy parameters
     %%     “STATE $J=0$
     %%     “FOR –$j$ in $“–1,“ldots,n˙e“˝$˝
     %%     “STATE $J=J+$get$“˙$“epsa$(“vx˙0,“params)$
     %%     “ENDFOR
     %%       “RETURN $J/n˙e$
     %%   “end–algorithmic˝
     %% “end–algorithm˝-->        <para xml:id="S2.SS3.SSS2.p2">
          <p>Methods that evaluate a policy just by sampling the utility over a large enough set of rollouts are called <text font="italic">Monte Carlo</text> methods.
They can also be used to estimate the utility at any state along a trajectory to learn a critic, as covered in Section <ref labelref="LABEL:sec:psss"/>.
The corresponding measure of utility comes with some variance.
Averaging over enough trajectories is required to provide an accurate estimate of the return.</p>
        </para>
<!--  %%%%%%%%%%%%%%%%%%% -->      </subsubsection>
      <subsubsection inlist="toc" labels="LABEL:sec:glob_uti_eps" xml:id="S2.SS3.SSS3">
        <tags>
          <tag>2.3.3</tag>
          <tag role="autoref">subsubsection 2.3.3</tag>
          <tag role="refnum">2.3.3</tag>
          <tag role="typerefnum">§2.3.3</tag>
        </tags>
        <title><tag close=" ">2.3.3</tag>Global utility</title>
        <para xml:id="S2.SS3.SSS3.p1">
          <p>The most general goal of policy search is to find an optimal policy starting from any initial state <Math mode="inline" tex="\mathbf{\bm{x}}_{0}" text="x _ 0" xml:id="S2.SS3.SSS3.p1.m1">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                </XMApp>
              </XMath>
            </Math>.
This can be defined as optimizing</p>
          <equation labels="LABEL:eq:glob_return" xml:id="S2.E1">
            <tags>
              <tag>(1)</tag>
              <tag role="autoref">Equation 1</tag>
              <tag role="refnum">1</tag>
            </tags>
            <Math mode="display" tex="J({\mathbf{\bm{\theta}}})=\int_{\mathbf{\bm{x}}_{0}\in\mathcal{X}}\bar{J}(\tau%&#10;_{\mathbf{\bm{x}}_{0}})d\mathbf{\bm{x}}_{0}" text="J * theta = (integral _ (x _ 0 element-of X))@(bar@(J) * tau _ x _ 0 * differential-d@(x _ 0))" xml:id="S2.E1.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">J</XMTok>
                    <XMDual>
                      <XMRef idref="S2.E1.m1.1"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S2.E1.m1.1">θ</XMTok>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                  <XMApp>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok mathstyle="display" meaning="integral" name="int" role="INTOP">∫</XMTok>
                      <XMApp>
                        <XMTok fontsize="70%" meaning="element-of" name="in" role="RELOP">∈</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                          <XMTok font="bold" fontsize="70%" role="UNKNOWN">x</XMTok>
                          <XMTok fontsize="50%" meaning="0" role="NUMBER">0</XMTok>
                        </XMApp>
                        <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">X</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok name="bar" role="OVERACCENT" stretchy="false">¯</XMTok>
                        <XMTok font="italic" role="UNKNOWN">J</XMTok>
                      </XMApp>
                      <XMDual>
                        <XMRef idref="S2.E1.m1.2"/>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMApp xml:id="S2.E1.m1.2">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                              <XMTok font="bold" fontsize="70%" role="UNKNOWN">x</XMTok>
                              <XMTok fontsize="50%" meaning="0" role="NUMBER">0</XMTok>
                            </XMApp>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                      <XMApp>
                        <XMTok font="italic" meaning="differential-d" role="DIFFOP">d</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">x</XMTok>
                          <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                        </XMApp>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>
          </equation>
        </para>
        <para xml:id="S2.SS3.SSS3.p2">
          <p>Here again, computing this utility would require generating trajectories starting from an infinite number of initial states <Math mode="inline" tex="\mathbf{\bm{x}}_{0}" text="x _ 0" xml:id="S2.SS3.SSS3.p2.m1">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                </XMApp>
              </XMath>
            </Math>.
In practice, one can only approximate it using a finite set of initial states.
Thus an estimate of the global utility of some policy parameters <Math mode="inline" tex="{\mathbf{\bm{\theta}}}" text="theta" xml:id="S2.SS3.SSS3.p2.m2">
              <XMath>
                <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
              </XMath>
            </Math> has two sources of variance,
one resulting from the sampling of initial states <Math mode="inline" tex="\mathbf{\bm{x}}_{0}" text="x _ 0" xml:id="S2.SS3.SSS3.p2.m3">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                </XMApp>
              </XMath>
            </Math> and one resulting from the sampling of trajectories for a given <Math mode="inline" tex="\mathbf{\bm{x}}_{0}" text="x _ 0" xml:id="S2.SS3.SSS3.p2.m4">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                </XMApp>
              </XMath>
            </Math>.
For some problems, one can train the system from episodes that all start from the same <Math mode="inline" tex="\mathbf{\bm{x}}_{0}" text="x _ 0" xml:id="S2.SS3.SSS3.p2.m5">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                </XMApp>
              </XMath>
            </Math>.
With this option, it is not guaranteed that the policy will improve for other initial states than <Math mode="inline" tex="\mathbf{\bm{x}}_{0}" text="x _ 0" xml:id="S2.SS3.SSS3.p2.m6">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                </XMApp>
              </XMath>
            </Math>,
thus it can only be applied if the system is later exploited from the same <Math mode="inline" tex="\mathbf{\bm{x}}_{0}" text="x _ 0" xml:id="S2.SS3.SSS3.p2.m7">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                </XMApp>
              </XMath>
            </Math>.
In the other option, one starts from a finite set of initial states.
This has a wider range of applications but this results in larger variance.</p>
        </para>
<!--  %% “begin–equation˝ 
     %%   “label–eq:approx˙glob˙return˝
     %% J(“params) =  “sum˙–j=1˝^–n˙o˝ “hat–J˝(“tau˙–“vx˙0˝^j)d“vx˙0.
     %% “end–equation˝
     %% This can be computed using Algorithm~“ref–alg:get˙return˝.
     %% “begin–algorithm˝[htb]
     %%   “caption–get“˙expected“˙utility($“params$)˝
     %%   “begin–algorithmic˝“label–alg:get˙return˝
     %%     “REQUIRE $“params$: policy parameters
     %%     “STATE $J=0$
     %%     “FOR –$j$ in $“–1,“ldots,n˙o“˝$˝
     %**** sample˙efficiency.tex Line 725 ****
     %%     “STATE $“vx˙0 = draw“˙initial“˙state()$
     %%     “STATE $J=J+get“˙MC“˙return(“vx˙0,“params)$
     %%     “ENDFOR
     %%       “RETURN $J/n˙o$
     %%   “end–algorithmic˝
     %% “end–algorithm˝-->        <para xml:id="S2.SS3.SSS3.p3">
          <p>As shown in <cite class="ltx_citemacro_citep">(<bibref bibrefs="williams1992" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>, (<ref labelref="LABEL:eq:glob_return"/>) can also be written</p>
          <equation labels="LABEL:eq:glob_return_tau" xml:id="S2.E2">
            <tags>
              <tag>(2)</tag>
              <tag role="autoref">Equation 2</tag>
              <tag role="refnum">2</tag>
            </tags>
            <Math mode="display" tex="J({\mathbf{\bm{\theta}}})=\int_{\tau\in\mathcal{T}}p_{\mathbf{\bm{\theta}}}(%&#10;\tau)J(\tau)d\tau" text="J * theta = (integral _ (tau element-of T))@(p _ theta * tau * J * tau * differential-d@(tau))" xml:id="S2.E2.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">J</XMTok>
                    <XMDual>
                      <XMRef idref="S2.E2.m1.1"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S2.E2.m1.1">θ</XMTok>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                  <XMApp>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok mathstyle="display" meaning="integral" name="int" role="INTOP">∫</XMTok>
                      <XMApp>
                        <XMTok fontsize="70%" meaning="element-of" name="in" role="RELOP">∈</XMTok>
                        <XMTok font="italic" fontsize="70%" name="tau" role="UNKNOWN">τ</XMTok>
                        <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">T</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" role="UNKNOWN">p</XMTok>
                        <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                      </XMApp>
                      <XMDual>
                        <XMRef idref="S2.E2.m1.2"/>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="italic" name="tau" role="UNKNOWN" xml:id="S2.E2.m1.2">τ</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                      <XMTok font="italic" role="UNKNOWN">J</XMTok>
                      <XMDual>
                        <XMRef idref="S2.E2.m1.3"/>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="italic" name="tau" role="UNKNOWN" xml:id="S2.E2.m1.3">τ</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                      <XMApp>
                        <XMTok font="italic" meaning="differential-d" role="DIFFOP">d</XMTok>
                        <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>
          </equation>
        </para>
        <para xml:id="S2.SS3.SSS3.p4">
          <p>where <Math mode="inline" tex="\mathcal{T}" text="T" xml:id="S2.SS3.SSS3.p4.m1">
              <XMath>
                <XMTok font="caligraphic" role="UNKNOWN">T</XMTok>
              </XMath>
            </Math> is the space of all possible trajectories and <Math mode="inline" tex="p_{\mathbf{\bm{\theta}}}(\tau)" text="p _ theta * tau" xml:id="S2.SS3.SSS3.p4.m2">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">p</XMTok>
                    <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                  </XMApp>
                  <XMDual>
                    <XMRef idref="S2.SS3.SSS3.p4.m2.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="italic" name="tau" role="UNKNOWN" xml:id="S2.SS3.SSS3.p4.m2.1">τ</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math> defines the probability of performing trajectory <Math mode="inline" tex="\tau" text="tau" xml:id="S2.SS3.SSS3.p4.m3">
              <XMath>
                <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
              </XMath>
            </Math> under policy <Math mode="inline" tex="\pol{{}_{\mathbf{\bm{\theta}}}}" text="pol@(_theta)" xml:id="S2.SS3.SSS3.p4.m4">
              <XMath>
                <XMApp>
                  <XMTok name="pol" role="OVERACCENT">→</XMTok>
                  <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                    <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>.
This formulation is used in episodic policy search methods using step-samples, as covered in Section <ref labelref="LABEL:sec:psss"/>.</p>
        </para>
<!--  %“todo–define evaluate function.˝ -->        <para xml:id="S2.SS3.SSS3.p5">
          <p>Furthermore, an infinite horizon context can even be covered provided that the system can be run from a number of steps and that some utility can be obtained from running these steps.</p>
        </para>
<!--  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->      </subsubsection>
    </subsection>
    <subsection inlist="toc" labels="LABEL:mes:mb1 LABEL:sec:MB" xml:id="S2.SS4">
      <tags>
        <tag>2.4</tag>
        <tag role="autoref">subsection 2.4</tag>
        <tag role="refnum">2.4</tag>
        <tag role="typerefnum">§2.4</tag>
      </tags>
      <title><tag close=" ">2.4</tag>Model-based versus model-free episodic policy search</title>
<!--  %**** sample˙efficiency.tex Line 750 **** -->      <para xml:id="S2.SS4.p1">
        <p>A <text font="italic">forward model</text> is a model of the system-environment interaction which, given the current state and the current action,
outputs a distribution of probabilities over potential next states, or just the most likely one.
Using a forward model to accelerate learning of a critic is called <text font="italic">model-based RL</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="sutton91dyna" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S2.SS4.p2">
        <p>The forward model can be learned from <Math mode="inline" tex="(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k},\mathbf{\bm{x}}_{k+1})" text="vector@(x _ k, u _ k, x _ (k + 1))" xml:id="S2.SS4.p2.m1">
            <XMath>
              <XMDual>
                <XMApp>
                  <XMTok meaning="vector"/>
                  <XMRef idref="S2.SS4.p2.m1.1"/>
                  <XMRef idref="S2.SS4.p2.m1.2"/>
                  <XMRef idref="S2.SS4.p2.m1.3"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp xml:id="S2.SS4.p2.m1.1">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold" role="UNKNOWN">x</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S2.SS4.p2.m1.2">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold" role="UNKNOWN">u</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S2.SS4.p2.m1.3">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold" role="UNKNOWN">x</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math> samples as a function <Math mode="inline" tex="\hat{g}" text="hat@(g)" xml:id="S2.SS4.p2.m2">
            <XMath>
              <XMApp>
                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                <XMTok font="italic" role="UNKNOWN">g</XMTok>
              </XMApp>
            </XMath>
          </Math> using any regression algorithm described in Section <ref labelref="LABEL:sec:regression"/>, such that <Math mode="inline" tex="\hat{\mathbf{\bm{x}}}_{k+1}\sim\hat{g}({\mathbf{\bm{x}}}_{k+1}|\mathbf{\bm{x}}%&#10;_{k},\mathbf{\bm{u}}_{k})" text="(hat@(x)) _ (k + 1) similar-to hat@(g) * conditional@(x _ (k + 1), list@(x _ k, u _ k))" xml:id="S2.SS4.p2.m3">
            <XMath>
              <XMApp>
                <XMTok meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">g</XMTok>
                  </XMApp>
                  <XMDual>
                    <XMRef idref="S2.SS4.p2.m3.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S2.SS4.p2.m3.1">
                        <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">x</XMTok>
                          <XMApp>
                            <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                            <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="list"/>
                            <XMRef idref="S2.SS4.p2.m3.1.1"/>
                            <XMRef idref="S2.SS4.p2.m3.1.2"/>
                          </XMApp>
                          <XMWrap>
                            <XMApp xml:id="S2.SS4.p2.m3.1.1">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="bold" role="UNKNOWN">x</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                            </XMApp>
                            <XMTok role="PUNCT">,</XMTok>
                            <XMApp xml:id="S2.SS4.p2.m3.1.2">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="bold" role="UNKNOWN">u</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                            </XMApp>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> or <Math mode="inline" tex="\hat{\mathbf{\bm{x}}}_{k+1}=\hat{g}(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k})" text="(hat@(x)) _ (k + 1) = hat@(g) * open-interval@(x _ k, u _ k)" xml:id="S2.SS4.p2.m4">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">g</XMTok>
                  </XMApp>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S2.SS4.p2.m4.1"/>
                      <XMRef idref="S2.SS4.p2.m4.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S2.SS4.p2.m4.1">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">x</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMApp xml:id="S2.SS4.p2.m4.2">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">u</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>.
This model can then be used to generate synthetic trajectories <Math mode="inline" tex="\tau" text="tau" xml:id="S2.SS4.p2.m5">
            <XMath>
              <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
            </XMath>
          </Math>.</p>
      </para>
      <para xml:id="S2.SS4.p3">
        <p>However, in order to use such synthetic samples in episodic policy search, the corresponding utility is also required. This can also be obtained either by learning from regression
a model of the immediate utility function <Math mode="inline" tex="\hat{j}_{k}=\hat{j}(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k})" text="(hat@(j)) _ k = hat@(j) * open-interval@(x _ k, u _ k)" xml:id="S2.SS4.p3.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">j</XMTok>
                  </XMApp>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">j</XMTok>
                  </XMApp>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S2.SS4.p3.m1.1"/>
                      <XMRef idref="S2.SS4.p3.m1.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S2.SS4.p3.m1.1">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">x</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMApp xml:id="S2.SS4.p3.m1.2">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">u</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> using <Math mode="inline" tex="(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k},j_{k})" text="vector@(x _ k, u _ k, j _ k)" xml:id="S2.SS4.p3.m2">
            <XMath>
              <XMDual>
                <XMApp>
                  <XMTok meaning="vector"/>
                  <XMRef idref="S2.SS4.p3.m2.1"/>
                  <XMRef idref="S2.SS4.p3.m2.2"/>
                  <XMRef idref="S2.SS4.p3.m2.3"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp xml:id="S2.SS4.p3.m2.1">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold" role="UNKNOWN">x</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S2.SS4.p3.m2.2">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold" role="UNKNOWN">u</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S2.SS4.p3.m2.3">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">j</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math> samples,
or a model of utility over an episode <Math mode="inline" tex="\hat{J}(\tau)" text="hat@(J) * tau" xml:id="S2.SS4.p3.m3">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">J</XMTok>
                </XMApp>
                <XMDual>
                  <XMRef idref="S2.SS4.p3.m3.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" name="tau" role="UNKNOWN" xml:id="S2.SS4.p3.m3.1">τ</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math>.
The former suffers from less variance than the latter, resulting in more use of model-based updates in methods using step-samples than in those using episodic-samples.
<!--  %but the latter can be obtained from a –“em simulator˝ may provide the synthetic utility as a non-differentiable function, thus may be useful in a context where “dbo does not apply. --></p>
      </para>
      <para xml:id="S2.SS4.p4">
        <p>In principle, using synthetic samples can drastically improve the sample efficiency of episodic policy search methods.
However, model-based episodic policy search suffers from inaccuracies in the models <Math mode="inline" tex="\hat{g}" text="hat@(g)" xml:id="S2.SS4.p4.m1">
            <XMath>
              <XMApp>
                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                <XMTok font="italic" role="UNKNOWN">g</XMTok>
              </XMApp>
            </XMath>
          </Math>, <Math mode="inline" tex="\hat{j}" text="hat@(j)" xml:id="S2.SS4.p4.m2">
            <XMath>
              <XMApp>
                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                <XMTok font="italic" role="UNKNOWN">j</XMTok>
              </XMApp>
            </XMath>
          </Math> and/or <Math mode="inline" tex="\hat{J}" text="hat@(J)" xml:id="S2.SS4.p4.m3">
            <XMath>
              <XMApp>
                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                <XMTok font="italic" role="UNKNOWN">J</XMTok>
              </XMApp>
            </XMath>
          </Math>,
the inaccuracies themselves resulting from incomplete exploration and the intrinsic variance of the estimation process.
As a consequence, these methods must be manipulated with care.
This topic is well-covered in <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, we do not expand further about it here.</p>
      </para>
      <ERROR class="undefined">{tcolorbox}</ERROR>
      <para xml:id="S2.SS4.p5">
        <p>[colback=red!10!white]<text font="bold">Message 1:</text>
Learning a forward model can drastically improve the sample efficiency of episodic policy search methods, but it must be manipulated with care.
It works better with methods using step-samples than methods using episodic-samples.</p>
      </para>
      <para xml:id="S2.SS4.p6">
        <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="italic">move what follows later.</text>
<!--  %**** sample˙efficiency.tex Line 775 **** --></p>
      </para>
      <para xml:id="S2.SS4.p7">
        <p>A recent state-of-the-art presentation of model-based episodic policy search methods can be found in <cite class="ltx_citemacro_citep">(<bibref bibrefs="chatzilygeroudis2017black" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, where the <text font="smallcaps">black-DROPS</text> algorithm is shown to be more data efficient than the standard <text font="smallcaps">pilco</text> algorithm <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2011pilco" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S2.SS4.p8">
        <p>In deep RL, learning a forward model through regression may not be necessary, as a replay buffer seen as a “sample generator” can be seen a specific kind of forward model <cite class="ltx_citemacro_citep">(<bibref bibrefs="vanseijen2015deeper" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. This insight is used in <text font="smallcaps">svg</text>, which offers a continuum between model-free and model-based methods <cite class="ltx_citemacro_citep">(<bibref bibrefs="heess2015learning" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.
More standard model-based acceleration is also used in <text font="smallcaps">naf</text> on top if its model-free mechanisms <cite class="ltx_citemacro_citep">(<bibref bibrefs="gu2016continuous" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
<!--  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:bbo" xml:id="S3">
    <tags>
      <tag>3</tag>
      <tag role="autoref">section 3</tag>
      <tag role="refnum">3</tag>
      <tag role="typerefnum">§3</tag>
    </tags>
    <title><tag close=" ">3</tag>Sample efficiency factors in BBO</title>
    <para xml:id="S3.p1">
      <p>The role of this section is to establish the taxonomy of methods depicted in Figure <ref labelref="LABEL:fig:orga_paper"/>.
For doing so, we present elementary processes that play an important role in many BBO methods: regression and optimization.
We do this independently from the episodic policy search context itself, thus we consider a general unknown function <Math mode="inline" tex="f" text="f" xml:id="S3.p1.m1">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">f</XMTok>
          </XMath>
        </Math>, often called the <text font="italic">latent</text> function
and some input samples <Math mode="inline" tex="\mathbf{\bm{\phi}}" text="phi" xml:id="S3.p1.m2">
          <XMath>
            <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
          </XMath>
        </Math> taken from a space <Math mode="inline" tex="\Phi" text="Phi" xml:id="S3.p1.m3">
          <XMath>
            <XMTok name="Phi" role="UNKNOWN">Φ</XMTok>
          </XMath>
        </Math>, without specifying the nature of <Math mode="inline" tex="f" text="f" xml:id="S3.p1.m4">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">f</XMTok>
          </XMath>
        </Math> nor <Math mode="inline" tex="\mathbf{\bm{\phi}}" text="phi" xml:id="S3.p1.m5">
          <XMath>
            <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
          </XMath>
        </Math> at this point.
At this level of analysis, it is already possible to put forward some messages about sample efficiency in BBO.
The corresponding policy search algorithms are then presented in the next sections.</p>
    </para>
    <para xml:id="S3.p2">
      <p>The goal of optimization is to find the optimum of <Math mode="inline" tex="f" text="f" xml:id="S3.p2.m1">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">f</XMTok>
          </XMath>
        </Math> over <Math mode="inline" tex="\Phi" text="Phi" xml:id="S3.p2.m2">
          <XMath>
            <XMTok name="Phi" role="UNKNOWN">Φ</XMTok>
          </XMath>
        </Math>, that is to find</p>
    </para>
    <para xml:id="S3.p3">
      <equation xml:id="S3.Ex1">
        <Math mode="display" tex="\mathbf{\bm{\phi}}^{*}=\textmd{argmax}_{\mathbf{\bm{\phi}}\in\Phi}~{}f(\mathbf%&#10;{\bm{\phi}})\lx@note{footnote}{Throughout the paper, we consider maximization,%&#10; it would be $argmin$ in the case of minimization.}." text="phi ^ * = [argmax] _ (phi element-of Phi) * f * phi * [2footnote 22footnote 2Throughout the paper, we consider maximization, it would be ⁢argmin in the case of minimization.]" xml:id="S3.Ex1.m1">
          <XMath>
            <XMDual>
              <XMRef idref="S3.Ex1.m1.2"/>
              <XMWrap>
                <XMApp xml:id="S3.Ex1.m1.2">
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                    <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMApp rpadding="3.3pt">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMText>argmax</XMText>
                      <XMApp>
                        <XMTok fontsize="70%" meaning="element-of" name="in" role="RELOP">∈</XMTok>
                        <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                        <XMTok fontsize="70%" name="Phi" role="UNKNOWN">Φ</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMTok font="italic" role="UNKNOWN">f</XMTok>
                    <XMDual>
                      <XMRef idref="S3.Ex1.m1.1"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMTok font="bold italic" name="phi" role="UNKNOWN" xml:id="S3.Ex1.m1.1">ϕ</XMTok>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                    <XMText><note mark="2" role="footnote" xml:id="footnote2"><tags>
                          <tag>2</tag>
                          <tag role="autoref">footnote 2</tag>
                          <tag role="refnum">2</tag>
                          <tag role="typerefnum">footnote 2</tag>
                        </tags>Throughout the paper, we consider maximization, it would be <Math mode="inline" tex="argmin" text="a * r * g * m * i * n" xml:id="footnote2.m1">
                          <XMath>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMTok font="italic" role="UNKNOWN">a</XMTok>
                              <XMTok font="italic" role="UNKNOWN">r</XMTok>
                              <XMTok font="italic" role="UNKNOWN">g</XMTok>
                              <XMTok font="italic" role="UNKNOWN">m</XMTok>
                              <XMTok font="italic" role="UNKNOWN">i</XMTok>
                              <XMTok font="italic" role="UNKNOWN">n</XMTok>
                            </XMApp>
                          </XMath>
                        </Math> in the case of minimization.</note></XMText>
                  </XMApp>
                </XMApp>
                <XMTok role="PERIOD">.</XMTok>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math>
      </equation>
    </para>
    <para xml:id="S3.p4">
      <p>In BBO, some estimate <Math mode="inline" tex="\hat{f}(\mathbf{\bm{\phi}})" text="hat@(f) * phi" xml:id="S3.p4.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMApp>
                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                <XMTok font="italic" role="UNKNOWN">f</XMTok>
              </XMApp>
              <XMDual>
                <XMRef idref="S3.p4.m1.1"/>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN" xml:id="S3.p4.m1.1">ϕ</XMTok>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math> of <Math mode="inline" tex="f" text="f" xml:id="S3.p4.m2">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">f</XMTok>
          </XMath>
        </Math> at <Math mode="inline" tex="\mathbf{\bm{\theta}}" text="theta" xml:id="S3.p4.m3">
          <XMath>
            <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
          </XMath>
        </Math> must be obtained by sampling, that is by choosing a value for <Math mode="inline" tex="\mathbf{\bm{\phi}}" text="phi" xml:id="S3.p4.m4">
          <XMath>
            <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
          </XMath>
        </Math> and asking the system to return <Math mode="inline" tex="\hat{f}(\mathbf{\bm{\phi}})" text="hat@(f) * phi" xml:id="S3.p4.m5">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMApp>
                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                <XMTok font="italic" role="UNKNOWN">f</XMTok>
              </XMApp>
              <XMDual>
                <XMRef idref="S3.p4.m5.1"/>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN" xml:id="S3.p4.m5.1">ϕ</XMTok>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math>.
A sample is the corresponding <Math mode="inline" tex="&lt;\mathbf{\bm{\phi}},\hat{f}(\mathbf{\bm{\phi}})&gt;" xml:id="S3.p4.m6">
          <XMath>
            <XMTok meaning="less-than" role="RELOP">&lt;</XMTok>
            <XMTok font="bold italic" name="phi" role="UNKNOWN" xml:id="S3.p4.m6.2">ϕ</XMTok>
            <XMTok role="PUNCT">,</XMTok>
            <XMApp>
              <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
              <XMTok font="italic" role="UNKNOWN">f</XMTok>
            </XMApp>
            <XMWrap>
              <XMTok role="OPEN" stretchy="false">(</XMTok>
              <XMTok font="bold italic" name="phi" role="UNKNOWN" xml:id="S3.p4.m6.1">ϕ</XMTok>
              <XMTok role="CLOSE" stretchy="false">)</XMTok>
            </XMWrap>
            <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
          </XMath>
        </Math> pair.</p>
    </para>
<!--  %**** sample˙efficiency.tex Line 800 **** -->    <para xml:id="S3.p5">
      <p>Since <Math mode="inline" tex="f" text="f" xml:id="S3.p5.m1">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">f</XMTok>
          </XMath>
        </Math> is not accessible in closed form, finding the optimum over <Math mode="inline" tex="f" text="f" xml:id="S3.p5.m2">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">f</XMTok>
          </XMath>
        </Math> cannot be performed analytically and the algorithms generally run iterations until some termination criterion is met.
In this context, being sample efficient in BBO means going as close as possible to an optimum using as few samples as possible.
The key question is “In what area does the optimum lie?”, so that an optimum can be found quickly by sampling that area.</p>
    </para>
    <para xml:id="S3.p6">
      <p>Various optimization and regression processes can be used to answer this question.
Regression and optimization are often intertwined since, from one side, regression is minimization of a loss function and, from the other side, the sample efficiency of optimization can be improved by learning a <text font="italic">surrogate</text> model with regression. For limiting cross-references, we start with optimization without a utility model and optimization using an analytical model without regression, then we present regression and finally we describe optimization with a surrogate utility model, which corresponds to the richest family and the most sample efficient methods.</p>
    </para>
    <subsection inlist="toc" labels="LABEL:sec:sumIV" xml:id="S3.SS1">
      <tags>
        <tag>3.1</tag>
        <tag role="autoref">subsection 3.1</tag>
        <tag role="refnum">3.1</tag>
        <tag role="typerefnum">§3.1</tag>
      </tags>
      <title><tag close=" ">3.1</tag>Exploration policies in BBO</title>
      <para xml:id="S3.SS1.p1">
        <p>To take a higher level perspective about the above methods, we introduce a useful notion of <text font="italic">exploration policy</text> which is called “upper-level policy” in <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S3.SS1.p2">
        <p>Interestingly, when the search space is the space of policy parameters <Math mode="inline" tex="\Theta" text="Theta" xml:id="S3.SS1.p2.m1">
            <XMath>
              <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
            </XMath>
          </Math>, the exploration policy is the policy search method itself.</p>
      </para>
      <para xml:id="S3.SS1.p3">
        <p>Under the same perspective, derivative-based policy search is performing <text font="italic">greedy</text> policy improvement from a model (thus greedy moves in the <Math mode="inline" tex="\Theta" text="Theta" xml:id="S3.SS1.p3.m1">
            <XMath>
              <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
            </XMath>
          </Math> space), which makes it more straightforward and potentially more sample efficient, but also more prone to premature convergence into local minima. This sensitivity can result in unstability when greedy exploration is performed on a poor surrogate model.</p>
      </para>
      <para xml:id="S3.SS1.p4">
        <p>Finally, exploration in BO methods is a richer and more expensive Bayesian inference process which can simultaneously optimize the policy and explore regions of large uncertainty, giving rise to a specific type of active learning.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S3.SS2">
      <tags>
        <tag>3.2</tag>
        <tag role="autoref">subsection 3.2</tag>
        <tag role="refnum">3.2</tag>
        <tag role="typerefnum">§3.2</tag>
      </tags>
      <title><tag close=" ">3.2</tag>Optimization without a utility model</title>
      <para xml:id="S3.SS2.p1">
        <p>Black box optimization consists in guessing where the optimum lies.
In the absence of a model of utility, the guess can be purely random, as in <text font="italic">random search</text>.
They can also be based on the assumption that the utility function shows some regularity which can be exploited through an <text font="italic">implicit model</text> given a memory of the previous samples.
This is the case of <text font="italic">population-based optimization</text>, where it is assumed that the optimum should be close to the currently found best sample. Search is then performed by getting random samples around the current best samples.
Random search, population-based optimization and a third method called <text font="italic">finite differences</text> are presented in Section <ref labelref="LABEL:sec:mfo"/>.
<!--  %**** sample˙efficiency.tex Line 825 **** --></p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S3.SS3">
      <tags>
        <tag>3.3</tag>
        <tag role="autoref">subsection 3.3</tag>
        <tag role="refnum">3.3</tag>
        <tag role="typerefnum">§3.3</tag>
      </tags>
      <title><tag close=" ">3.3</tag>Optimization with a model</title>
      <para xml:id="S3.SS3.p1">
        <p>Though the latent function is not analytically available in BBO, optimization with a surrogate utility model can be applied to an analytically available surrogate model (see Section <ref labelref="LABEL:sec:surrogate"/>).</p>
      </para>
      <para xml:id="S3.SS3.p2">
        <p>Below we distinguish two approaches: analytical optimization, and derivative-based optimization.
In the former, the optimum is obtained through formal calculus whereas in the latter, it is found through numerical iterations.
In contrast to optimization without a utility model, derivative-based optimization methods include no random search component, thus they perform <text font="italic">greedy</text> optimization.
We distinguish two such methods: vanilla gradient optimization and natural gradient optimization.</p>
      </para>
      <subsubsection inlist="toc" labels="LABEL:mes:an_it" xml:id="S3.SS3.SSS1">
        <tags>
          <tag>3.3.1</tag>
          <tag role="autoref">subsubsection 3.3.1</tag>
          <tag role="refnum">3.3.1</tag>
          <tag role="typerefnum">§3.3.1</tag>
        </tags>
        <title><tag close=" ">3.3.1</tag>Analytical optimization</title>
        <para xml:id="S3.SS3.SSS1.p1">
          <p>Analytical optimization may be possible when the function to be optimized is available in closed form.
It generally relies on the fact that the derivative of a function is null at its optima.<!--  %, and that the function is smooth around these optima. -->Finding these optima can be solved analytically for some functions (e.g. quadratic and Gaussian functions).</p>
        </para>
        <ERROR class="undefined">{tcolorbox}</ERROR>
        <para xml:id="S3.SS3.SSS1.p2">
          <p>[colback=red!10!white]<text font="bold">Message 2:</text>
Derivative-based optimization is iterative whereas analytical optimization is not.</p>
        </para>
      </subsubsection>
      <subsubsection inlist="toc" xml:id="S3.SS3.SSS2">
        <tags>
          <tag>3.3.2</tag>
          <tag role="autoref">subsubsection 3.3.2</tag>
          <tag role="refnum">3.3.2</tag>
          <tag role="typerefnum">§3.3.2</tag>
        </tags>
        <title><tag close=" ">3.3.2</tag>Vanilla gradient optimization</title>
        <para xml:id="S3.SS3.SSS2.p1">
          <p>Vanilla gradient optimization looks for a local optimum of the analytical derivative of <Math mode="inline" tex="f" text="f" xml:id="S3.SS3.SSS2.p1.m1">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">f</XMTok>
              </XMath>
            </Math>.<!--  %or a surrogate model of “f. --></p>
        </para>
<!--  %**** sample˙efficiency.tex Line 850 **** -->        <figure inlist="lof" labels="LABEL:fig:vgo" placement="hbtp" xml:id="S3.F2">
          <tags>
            <tag>Figure 2</tag>
            <tag role="autoref">Figure 2</tag>
            <tag role="refnum">2</tag>
            <tag role="typerefnum">Figure 2</tag>
          </tags>
          <graphics class="ltx_centering" graphic="gradient-above-svg" options="width=303.534pt" xml:id="S3.F2.g1"/>
          <toccaption class="ltx_centering"><tag close=" ">2</tag>In derivative-based optimization, search starts from an initial parameter vector <Math mode="inline" tex="\mathbf{\bm{\phi}}_{0}" text="phi _ 0" xml:id="S3.F2.m1">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                </XMApp>
              </XMath>
            </Math> and converges to a local optimum <Math mode="inline" tex="\mathbf{\bm{\phi}}^{*}" text="phi ^ *" xml:id="S3.F2.m2">
              <XMath>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                </XMApp>
              </XMath>
            </Math> by following the gradient of the function to be optimized.</toccaption>
          <caption class="ltx_centering"><tag close=": ">Figure 2</tag>In derivative-based optimization, search starts from an initial parameter vector <Math mode="inline" tex="\mathbf{\bm{\phi}}_{0}" text="phi _ 0" xml:id="S3.F2.m3">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                </XMApp>
              </XMath>
            </Math> and converges to a local optimum <Math mode="inline" tex="\mathbf{\bm{\phi}}^{*}" text="phi ^ *" xml:id="S3.F2.m4">
              <XMath>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                </XMApp>
              </XMath>
            </Math> by following the gradient of the function to be optimized.</caption>
        </figure>
        <para xml:id="S3.SS3.SSS2.p2">
          <p>It computes the gradient of <Math mode="inline" tex="f" text="f" xml:id="S3.SS3.SSS2.p2.m1">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">f</XMTok>
              </XMath>
            </Math> at <Math mode="inline" tex="\mathbf{\bm{\phi}}" text="phi" xml:id="S3.SS3.SSS2.p2.m2">
              <XMath>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
              </XMath>
            </Math> as the vector of its partial derivatives in all dimensions.
This vector is tangent to the function at this point (see  Figure <ref labelref="LABEL:fig:vgo"/>) and its length is controlled by a parameter called <text font="italic">step size</text>.</p>
        </para>
        <para xml:id="S3.SS3.SSS2.p3">
          <p>We note <Math mode="inline" tex="\nabla_{\mathbf{\bm{\phi}}}f=\frac{\partial f(\mathbf{\bm{\phi}})}{\partial%&#10;\mathbf{\bm{\phi}}}" text="(nabla _ phi)@(f) = (partial-differential@(f) * phi) / partial-differential@(phi)" xml:id="S3.SS3.SSS2.p3.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                      <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                    </XMApp>
                    <XMTok font="italic" role="UNKNOWN">f</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok mathstyle="text" meaning="divide" role="FRACOP"/>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok fontsize="70%" meaning="partial-differential" name="partial" role="OPERATOR">∂</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">f</XMTok>
                      </XMApp>
                      <XMDual>
                        <XMRef idref="S3.SS3.SSS2.p3.m1.1"/>
                        <XMWrap>
                          <XMTok fontsize="70%" role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN" xml:id="S3.SS3.SSS2.p3.m1.1">ϕ</XMTok>
                          <XMTok fontsize="70%" role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="partial-differential" name="partial" role="OPERATOR">∂</XMTok>
                      <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                    </XMApp>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math> the gradient of <Math mode="inline" tex="f" text="f" xml:id="S3.SS3.SSS2.p3.m2">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">f</XMTok>
              </XMath>
            </Math> with respect to <Math mode="inline" tex="\mathbf{\bm{\phi}}" text="phi" xml:id="S3.SS3.SSS2.p3.m3">
              <XMath>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
              </XMath>
            </Math> and <Math mode="inline" tex="\nabla_{\mathbf{\bm{\phi}}}f|_{\mathbf{\bm{\phi}}=\mathbf{\bm{\phi}}_{i}}" text="evaluated-at@((nabla _ phi)@(f), phi = phi _ i)" xml:id="S3.SS3.SSS2.p3.m4">
              <XMath>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="evaluated-at"/>
                    <XMRef idref="S3.SS3.SSS2.p3.m4.2"/>
                    <XMRef idref="S3.SS3.SSS2.p3.m4.1"/>
                  </XMApp>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMWrap>
                      <XMApp xml:id="S3.SS3.SSS2.p3.m4.2">
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                          <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                        </XMApp>
                        <XMTok font="italic" role="UNKNOWN">f</XMTok>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">|</XMTok>
                    </XMWrap>
                    <XMApp xml:id="S3.SS3.SSS2.p3.m4.1">
                      <XMTok fontsize="70%" meaning="equals" role="RELOP">=</XMTok>
                      <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                        <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                        <XMTok font="italic" fontsize="50%" role="UNKNOWN">i</XMTok>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                </XMDual>
              </XMath>
            </Math> the value of this gradient at <Math mode="inline" tex="\mathbf{\bm{\phi}}_{i}" text="phi _ i" xml:id="S3.SS3.SSS2.p3.m5">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                </XMApp>
              </XMath>
            </Math>.</p>
        </para>
        <para xml:id="S3.SS3.SSS2.p4">
          <p>Given the previous sample <Math mode="inline" tex="\mathbf{\bm{\phi}}_{i}" text="phi _ i" xml:id="S3.SS3.SSS2.p4.m1">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                </XMApp>
              </XMath>
            </Math>, vanilla gradient optimization chooses the next sample <Math mode="inline" tex="\mathbf{\bm{\phi}}_{i+1}" text="phi _ (i + 1)" xml:id="S3.SS3.SSS2.p4.m2">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math> according to Algorithm <ref labelref="LABEL:alg:vgo"/>, where <Math mode="inline" tex="\alpha_{i}" text="alpha _ i" xml:id="S3.SS3.SSS2.p4.m3">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" name="alpha" role="UNKNOWN">α</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                </XMApp>
              </XMath>
            </Math> is the step size at iteration <Math mode="inline" tex="i" text="i" xml:id="S3.SS3.SSS2.p4.m4">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">i</XMTok>
              </XMath>
            </Math>.</p>
        </para>
        <theorem class="ltx_theorem_algorithm" labels="LABEL:alg:vgo" xml:id="alg2">
          <tags>
            <tag>Algorithm 2</tag>
            <tag role="autoref">2</tag>
            <tag role="refnum">2</tag>
            <tag role="typerefnum">Algorithm 2</tag>
          </tags>
          <para xml:id="alg2.p1">
            <p>[hbt]
<toccaption><tag close=" ">2</tag>vgo(<Math mode="inline" tex="\mathbf{\bm{\phi}}_{i},f" text="list@(phi _ i, f)" xml:id="alg2.p1.m1">
                  <XMath>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="list"/>
                        <XMRef idref="alg2.p1.m1.2"/>
                        <XMRef idref="alg2.p1.m1.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMApp xml:id="alg2.p1.m1.2">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                        </XMApp>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMTok font="italic" role="UNKNOWN" xml:id="alg2.p1.m1.1">f</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMath>
                </Math>): One iteration of vanilla gradient optimization </toccaption><caption><tag close=" ">Algorithm 2</tag>vgo(<Math mode="inline" tex="\mathbf{\bm{\phi}}_{i},f" text="list@(phi _ i, f)" xml:id="alg2.p1.m2">
                  <XMath>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="list"/>
                        <XMRef idref="alg2.p1.m2.2"/>
                        <XMRef idref="alg2.p1.m2.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMApp xml:id="alg2.p1.m2.2">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                        </XMApp>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMTok font="italic" role="UNKNOWN" xml:id="alg2.p1.m2.1">f</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMath>
                </Math>): One iteration of vanilla gradient optimization </caption><ERROR class="undefined">\lx@orig@algorithmic</ERROR>
<ERROR class="undefined">\REQUIRE</ERROR><Math mode="inline" tex="\mathbf{\bm{\phi}}_{i}" text="phi _ i" xml:id="alg2.p1.m3">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  </XMApp>
                </XMath>
              </Math>: current best guess, <Math mode="inline" tex="f" text="f" xml:id="alg2.p1.m4">
                <XMath>
                  <XMTok font="italic" role="UNKNOWN">f</XMTok>
                </XMath>
              </Math> latent function
<ERROR class="undefined">\STATE</ERROR><Math mode="inline" tex="\mathbf{\bm{\phi}}_{i+1}=\mathbf{\bm{\phi}}_{i}+\alpha_{i}.\nabla_{\mathbf{\bm%&#10;{\phi}}}f|_{\mathbf{\bm{\phi}}=\mathbf{\bm{\phi}}_{i}}" text="formulae@(phi _ (i + 1) = phi _ i + alpha _ i, evaluated-at@((nabla _ phi)@(f), phi = phi _ i))" xml:id="alg2.p1.m5">
                <XMath>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="formulae"/>
                      <XMRef idref="alg2.p1.m5.2"/>
                      <XMRef idref="alg2.p1.m5.3"/>
                    </XMApp>
                    <XMWrap>
                      <XMApp xml:id="alg2.p1.m5.2">
                        <XMTok meaning="equals" role="RELOP">=</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                          <XMApp>
                            <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                            <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMApp>
                          <XMTok meaning="plus" role="ADDOP">+</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                          </XMApp>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="italic" name="alpha" role="UNKNOWN">α</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                      <XMTok role="PERIOD">.</XMTok>
                      <XMDual xml:id="alg2.p1.m5.3">
                        <XMApp>
                          <XMTok meaning="evaluated-at"/>
                          <XMRef idref="alg2.p1.m5.3.1"/>
                          <XMRef idref="alg2.p1.m5.1"/>
                        </XMApp>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMWrap>
                            <XMApp xml:id="alg2.p1.m5.3.1">
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                                <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                              </XMApp>
                              <XMTok font="italic" role="UNKNOWN">f</XMTok>
                            </XMApp>
                            <XMTok role="CLOSE" stretchy="false">|</XMTok>
                          </XMWrap>
                          <XMApp xml:id="alg2.p1.m5.1">
                            <XMTok fontsize="70%" meaning="equals" role="RELOP">=</XMTok>
                            <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                              <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                              <XMTok font="italic" fontsize="50%" role="UNKNOWN">i</XMTok>
                            </XMApp>
                          </XMApp>
                        </XMApp>
                      </XMDual>
                    </XMWrap>
                  </XMDual>
                </XMath>
              </Math>
<ERROR class="undefined">\RETURN</ERROR><Math mode="inline" tex="\mathbf{\bm{\phi}}_{i+1}" text="phi _ (i + 1)" xml:id="alg2.p1.m6">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                </XMath>
              </Math></p>
          </para>
        </theorem>
        <para class="ltx_noindent" xml:id="S3.SS3.SSS2.p5">
<!--  %When solving a minimization (resp. maximization) problem, $“alpha˙i$ should be negative (resp. positive). -->          <p>The value of <Math mode="inline" tex="\alpha_{i}" text="alpha _ i" xml:id="S3.SS3.SSS2.p5.m1">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" name="alpha" role="UNKNOWN">α</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                </XMApp>
              </XMath>
            </Math> is critical: If it is taken too small, the algorithm converges too slowly.
<!--  %**** sample˙efficiency.tex Line 875 **** -->If it is taken too large, the algorithm may jump out of the local hill or bounce around the optimum.
In practice, <Math mode="inline" tex="\alpha_{i}" text="alpha _ i" xml:id="S3.SS3.SSS2.p5.m2">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" name="alpha" role="UNKNOWN">α</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                </XMApp>
              </XMath>
            </Math> should be large in the beginning of the optimization process and get smaller as the current best guess gets closer to the optimum.</p>
        </para>
      </subsubsection>
      <subsubsection inlist="toc" labels="LABEL:mes:vg_ng LABEL:sec:ngo" xml:id="S3.SS3.SSS3">
        <tags>
          <tag>3.3.3</tag>
          <tag role="autoref">subsubsection 3.3.3</tag>
          <tag role="refnum">3.3.3</tag>
          <tag role="typerefnum">§3.3.3</tag>
        </tags>
        <title><tag close=" ">3.3.3</tag>Natural gradient optimization</title>
        <para xml:id="S3.SS3.SSS3.p1">
          <p>Vanilla gradient optimization is fine as long as the input samples <Math mode="inline" tex="\mathbf{\bm{\phi}}" text="phi" xml:id="S3.SS3.SSS3.p1.m1">
              <XMath>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
              </XMath>
            </Math> are defined in a Euclidean space <Math mode="inline" tex="\Phi" text="Phi" xml:id="S3.SS3.SSS3.p1.m2">
              <XMath>
                <XMTok name="Phi" role="UNKNOWN">Φ</XMTok>
              </XMath>
            </Math>.
When <Math mode="inline" tex="\mathbf{\bm{\phi}}" text="phi" xml:id="S3.SS3.SSS3.p1.m3">
              <XMath>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
              </XMath>
            </Math> is projected to <Math mode="inline" tex="\tilde{\mathbf{\bm{\phi}}}" text="tilde@(phi)" xml:id="S3.SS3.SSS3.p1.m4">
              <XMath>
                <XMApp>
                  <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                </XMApp>
              </XMath>
            </Math> in a non-Euclidean space <Math mode="inline" tex="\tilde{\Phi}" text="tilde@(Phi)" xml:id="S3.SS3.SSS3.p1.m5">
              <XMath>
                <XMApp>
                  <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                  <XMTok name="Phi" role="UNKNOWN">Φ</XMTok>
                </XMApp>
              </XMath>
            </Math>, the <text font="italic">natural gradient</text> is defined as</p>
        </para>
        <para class="ltx_noindent" xml:id="S3.SS3.SSS3.p2">
          <equation labels="LABEL:eq:ng" xml:id="S3.E3">
            <tags>
              <tag>(3)</tag>
              <tag role="autoref">Equation 3</tag>
              <tag role="refnum">3</tag>
            </tags>
            <Math mode="display" tex="\tilde{\nabla}_{\tilde{\mathbf{\bm{\phi}}}}\tilde{f}(\tilde{\mathbf{\bm{\phi}}%&#10;})=\mathbf{\bm{G}}^{-1}(\tilde{\mathbf{\bm{\phi}}})\nabla_{\tilde{\mathbf{\bm{%&#10;\phi}}}}\tilde{f}(\tilde{\mathbf{\bm{\phi}}})" text="(tilde@(nabla)) _ (tilde@(phi)) * tilde@(f) * tilde@(phi) = G ^ (- 1) * tilde@(phi) * (nabla _ (tilde@(phi)))@(tilde@(f)) * tilde@(phi)" xml:id="S3.E3.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMApp>
                        <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                        <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                      </XMApp>
                      <XMApp>
                        <XMTok fontsize="70%" name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                        <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMApp>
                      <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                      <XMTok font="italic" role="UNKNOWN">f</XMTok>
                    </XMApp>
                    <XMDual>
                      <XMRef idref="S3.E3.m1.1"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="S3.E3.m1.1">
                          <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                          <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="bold" role="UNKNOWN">G</XMTok>
                      <XMApp>
                        <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                        <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMDual>
                      <XMRef idref="S3.E3.m1.2"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="S3.E3.m1.2">
                          <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                          <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                    <XMApp>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                        <XMApp>
                          <XMTok fontsize="70%" name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                          <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMApp>
                        <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                        <XMTok font="italic" role="UNKNOWN">f</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMDual>
                      <XMRef idref="S3.E3.m1.3"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="S3.E3.m1.3">
                          <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                          <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>
          </equation>
          <p>where <Math mode="inline" tex="\mathbf{\bm{G}}" text="G" xml:id="S3.SS3.SSS3.p2.m1">
              <XMath>
                <XMTok font="bold" role="UNKNOWN">G</XMTok>
              </XMath>
            </Math> is a positive definite matrix characterizing the local curvature of <Math mode="inline" tex="\tilde{\Phi}" text="tilde@(Phi)" xml:id="S3.SS3.SSS3.p2.m2">
              <XMath>
                <XMApp>
                  <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                  <XMTok name="Phi" role="UNKNOWN">Φ</XMTok>
                </XMApp>
              </XMath>
            </Math>.
In the context of policy search methods, <Math mode="inline" tex="\mathbf{\bm{G}}" text="G" xml:id="S3.SS3.SSS3.p2.m3">
              <XMath>
                <XMTok font="bold" role="UNKNOWN">G</XMTok>
              </XMath>
            </Math> is known as the <text font="italic">Fisher Information Matrix</text> and noted <Math mode="inline" tex="\mathbf{\bm{F}}" text="F" xml:id="S3.SS3.SSS3.p2.m4">
              <XMath>
                <XMTok font="bold" role="UNKNOWN">F</XMTok>
              </XMath>
            </Math>.
Natural gradient optimization has several beneficial properties that make it more data efficient than vanilla gradient optimization. The counterpart is that computing <Math mode="inline" tex="F^{-1}" text="F ^ (- 1)" xml:id="S3.SS3.SSS3.p2.m5">
              <XMath>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">F</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math> can be demanding.
However, in the context of policy search, there are several ways to approximate the natural gradient without computing <Math mode="inline" tex="F^{-1}" text="F ^ (- 1)" xml:id="S3.SS3.SSS3.p2.m6">
              <XMath>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">F</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>.
All these aspects are covered in detail in <cite class="ltx_citemacro_citep">(<bibref bibrefs="grondman2012survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.</p>
        </para>
        <ERROR class="undefined">{tcolorbox}</ERROR>
        <para xml:id="S3.SS3.SSS3.p3">
          <p>[colback=red!10!white]<text font="bold">Message 3:</text>
Natural gradient optimization is computationally more intensive, but more data efficient than vanilla gradient optimization.</p>
        </para>
      </subsubsection>
    </subsection>
    <subsection inlist="toc" labels="LABEL:sec:regression" xml:id="S3.SS4">
      <tags>
        <tag>3.4</tag>
        <tag role="autoref">subsection 3.4</tag>
        <tag role="refnum">3.4</tag>
        <tag role="typerefnum">§3.4</tag>
      </tags>
      <title><tag close=" ">3.4</tag>Regression</title>
<!--  %**** sample˙efficiency.tex Line 900 **** -->      <para xml:id="S3.SS4.p1">
        <p>Parametric regression is covered in details in <cite class="ltx_citemacro_citep">(<bibref bibrefs="stulp15NN" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, here we just present the necessary concepts for our paper to be self-contained.
Given a model <Math mode="inline" tex="\hat{f}_{\mathbf{\bm{\omega}}}" text="(hat@(f)) _ omega" xml:id="S3.SS4.p1.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">f</XMTok>
                </XMApp>
                <XMTok font="bold italic" fontsize="70%" name="omega" role="UNKNOWN">ω</XMTok>
              </XMApp>
            </XMath>
          </Math> parameterized by <Math mode="inline" tex="\mathbf{\bm{\omega}}\in\mathbf{\bm{\Omega}}" text="omega element-of Omega" xml:id="S3.SS4.p1.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="element-of" name="in" role="RELOP">∈</XMTok>
                <XMTok font="bold italic" name="omega" role="UNKNOWN">ω</XMTok>
                <XMTok font="bold" name="Omega" role="UNKNOWN">Ω</XMTok>
              </XMApp>
            </XMath>
          </Math>, the goal of parametric regression is to find the value of <Math mode="inline" tex="\mathbf{\bm{\omega}}" text="omega" xml:id="S3.SS4.p1.m3">
            <XMath>
              <XMTok font="bold italic" name="omega" role="UNKNOWN">ω</XMTok>
            </XMath>
          </Math> that optimize a <text font="italic">loss function</text>, that is</p>
      </para>
      <para xml:id="S3.SS4.p2">
        <equation xml:id="S3.Ex2">
          <Math mode="display" tex="\mathbf{\bm{\omega}}^{*}=\textmd{argmin}_{\mathbf{\bm{\omega}}\in\mathbf{\bm{%&#10;\Omega}}}~{}Loss(f,\hat{f}_{\mathbf{\bm{\omega}}})." text="omega ^ * = [argmin] _ (omega element-of Omega) * L * o * s * s * open-interval@(f, (hat@(f)) _ omega)" xml:id="S3.Ex2.m1">
            <XMath>
              <XMDual>
                <XMRef idref="S3.Ex2.m1.2"/>
                <XMWrap>
                  <XMApp xml:id="S3.Ex2.m1.2">
                    <XMTok meaning="equals" role="RELOP">=</XMTok>
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="bold italic" name="omega" role="UNKNOWN">ω</XMTok>
                      <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                    </XMApp>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp rpadding="3.3pt">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMText>argmin</XMText>
                        <XMApp>
                          <XMTok fontsize="70%" meaning="element-of" name="in" role="RELOP">∈</XMTok>
                          <XMTok font="bold italic" fontsize="70%" name="omega" role="UNKNOWN">ω</XMTok>
                          <XMTok font="bold" fontsize="70%" name="Omega" role="UNKNOWN">Ω</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMTok font="italic" role="UNKNOWN">L</XMTok>
                      <XMTok font="italic" role="UNKNOWN">o</XMTok>
                      <XMTok font="italic" role="UNKNOWN">s</XMTok>
                      <XMTok font="italic" role="UNKNOWN">s</XMTok>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S3.Ex2.m1.1"/>
                          <XMRef idref="S3.Ex2.m1.2.1"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="italic" role="UNKNOWN" xml:id="S3.Ex2.m1.1">f</XMTok>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMApp xml:id="S3.Ex2.m1.2.1">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMApp>
                              <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                              <XMTok font="italic" role="UNKNOWN">f</XMTok>
                            </XMApp>
                            <XMTok font="bold italic" fontsize="70%" name="omega" role="UNKNOWN">ω</XMTok>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                  </XMApp>
                  <XMTok role="PERIOD">.</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>
        </equation>
      </para>
      <para xml:id="S3.SS4.p3">
        <p>The simplest case for regression is when the model <Math mode="inline" tex="\hat{f}_{\mathbf{\bm{\omega}}}" text="(hat@(f)) _ omega" xml:id="S3.SS4.p3.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">f</XMTok>
                </XMApp>
                <XMTok font="bold italic" fontsize="70%" name="omega" role="UNKNOWN">ω</XMTok>
              </XMApp>
            </XMath>
          </Math> is a linear function of the input.
When it is not, one often defines a set of fixed <text font="italic">feature</text> functions and the model is a linear combination of these features, where <Math mode="inline" tex="\mathbf{\bm{\omega}}" text="omega" xml:id="S3.SS4.p3.m2">
            <XMath>
              <XMTok font="bold italic" name="omega" role="UNKNOWN">ω</XMTok>
            </XMath>
          </Math> are the weigths.
This defines a <text font="italic">linear architecture</text>.
In some cases, the feature functions also contain parameters which are tuned by the regression process. This is the case of deep neural networks, for instance.
This is also the case when the model is a unique Gaussian function, used in <text font="smallcaps">EDA</text>s (see Section <ref labelref="LABEL:sec:edas"/>), where the updated parameters are the average and the covariance of the feature function.</p>
      </para>
      <para xml:id="S3.SS4.p4">
        <p>Given a set of <Math mode="inline" tex="n_{ts}" text="n _ (t * s)" xml:id="S3.SS4.p4.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">n</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> training samples <Math mode="inline" tex="&lt;\mathbf{\bm{\phi}}_{j},f(\mathbf{\bm{\phi}}_{j})&gt;" xml:id="S3.SS4.p4.m2">
            <XMath>
              <XMTok meaning="less-than" role="RELOP">&lt;</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
              </XMApp>
              <XMTok role="PUNCT">,</XMTok>
              <XMTok font="italic" role="UNKNOWN">f</XMTok>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">(</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                </XMApp>
                <XMTok role="CLOSE" stretchy="false">)</XMTok>
              </XMWrap>
              <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
            </XMath>
          </Math>, <Math mode="inline" tex="j\in\{1,n_{ts}\}" text="j element-of set@(1, n _ (t * s))" xml:id="S3.SS4.p4.m3">
            <XMath>
              <XMApp>
                <XMTok meaning="element-of" name="in" role="RELOP">∈</XMTok>
                <XMTok font="italic" role="UNKNOWN">j</XMTok>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="set"/>
                    <XMRef idref="S3.SS4.p4.m3.1"/>
                    <XMRef idref="S3.SS4.p4.m3.2"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">{</XMTok>
                    <XMTok meaning="1" role="NUMBER" xml:id="S3.SS4.p4.m3.1">1</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMApp xml:id="S3.SS4.p4.m3.2">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">n</XMTok>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">}</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math>, the regression error is generally defined as</p>
        <equation labels="LABEL:eq:loss" xml:id="S3.E4">
          <tags>
            <tag>(4)</tag>
            <tag role="autoref">Equation 4</tag>
            <tag role="refnum">4</tag>
          </tags>
          <Math mode="display" tex="\epsilon({\mathbf{\bm{\omega}}})=\sum_{j=1}^{n_{ts}}Loss(f(\mathbf{\bm{\phi}}_%&#10;{j}),\hat{f}_{\mathbf{\bm{\omega}}}(\mathbf{\bm{\phi}}_{j}))," text="epsilon * omega = ((sum _ (j = 1)) ^ (n _ (t * s)))@(L * o * s * s * open-interval@(f * phi _ j, (hat@(f)) _ omega * phi _ j))" xml:id="S3.E4.m1">
            <XMath>
              <XMDual>
                <XMRef idref="S3.E4.m1.2"/>
                <XMWrap>
                  <XMApp xml:id="S3.E4.m1.2">
                    <XMTok meaning="equals" role="RELOP">=</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                      <XMDual>
                        <XMRef idref="S3.E4.m1.1"/>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="bold italic" name="omega" role="UNKNOWN" xml:id="S3.E4.m1.1">ω</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                    <XMApp>
                      <XMApp scriptpos="mid">
                        <XMTok role="SUPERSCRIPTOP" scriptpos="mid1"/>
                        <XMApp scriptpos="mid">
                          <XMTok role="SUBSCRIPTOP" scriptpos="mid1"/>
                          <XMTok mathstyle="display" meaning="sum" role="SUMOP" scriptpos="mid">∑</XMTok>
                          <XMApp>
                            <XMTok fontsize="70%" meaning="equals" role="RELOP">=</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                            <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" fontsize="50%" role="UNKNOWN">t</XMTok>
                            <XMTok font="italic" fontsize="50%" role="UNKNOWN">s</XMTok>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="italic" role="UNKNOWN">L</XMTok>
                        <XMTok font="italic" role="UNKNOWN">o</XMTok>
                        <XMTok font="italic" role="UNKNOWN">s</XMTok>
                        <XMTok font="italic" role="UNKNOWN">s</XMTok>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="open-interval"/>
                            <XMRef idref="S3.E4.m1.2.1"/>
                            <XMRef idref="S3.E4.m1.2.2"/>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">(</XMTok>
                            <XMApp xml:id="S3.E4.m1.2.1">
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMTok font="italic" role="UNKNOWN">f</XMTok>
                              <XMDual>
                                <XMRef idref="S3.E4.m1.2.1.1"/>
                                <XMWrap>
                                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                                  <XMApp xml:id="S3.E4.m1.2.1.1">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                    <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                                  </XMApp>
                                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                </XMWrap>
                              </XMDual>
                            </XMApp>
                            <XMTok role="PUNCT">,</XMTok>
                            <XMApp xml:id="S3.E4.m1.2.2">
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                <XMApp>
                                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">f</XMTok>
                                </XMApp>
                                <XMTok font="bold italic" fontsize="70%" name="omega" role="UNKNOWN">ω</XMTok>
                              </XMApp>
                              <XMDual>
                                <XMRef idref="S3.E4.m1.2.2.1"/>
                                <XMWrap>
                                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                                  <XMApp xml:id="S3.E4.m1.2.2.1">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                    <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                                  </XMApp>
                                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                </XMWrap>
                              </XMDual>
                            </XMApp>
                            <XMTok role="CLOSE" stretchy="false">)</XMTok>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>
        </equation>
      </para>
      <para xml:id="S3.SS4.p5">
        <p>where <Math mode="inline" tex="Loss" text="L * o * s * s" xml:id="S3.SS4.p5.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">L</XMTok>
                <XMTok font="italic" role="UNKNOWN">o</XMTok>
                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                <XMTok font="italic" role="UNKNOWN">s</XMTok>
              </XMApp>
            </XMath>
          </Math> can for instance be the squared error <Math mode="inline" tex="(f(\mathbf{\bm{\phi}}_{j})-\hat{f}_{\mathbf{\bm{\omega}}}(\mathbf{\bm{\phi}}_{%&#10;j}))^{2}" text="(f * phi _ j - (hat@(f)) _ omega * phi _ j) ^ 2" xml:id="S3.SS4.p5.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMDual>
                  <XMRef idref="S3.SS4.p5.m2.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp xml:id="S3.SS4.p5.m2.1">
                      <XMTok meaning="minus" role="ADDOP">-</XMTok>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="italic" role="UNKNOWN">f</XMTok>
                        <XMDual>
                          <XMRef idref="S3.SS4.p5.m2.1.1"/>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">(</XMTok>
                            <XMApp xml:id="S3.SS4.p5.m2.1.1">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                            </XMApp>
                            <XMTok role="CLOSE" stretchy="false">)</XMTok>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMApp>
                            <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                            <XMTok font="italic" role="UNKNOWN">f</XMTok>
                          </XMApp>
                          <XMTok font="bold italic" fontsize="70%" name="omega" role="UNKNOWN">ω</XMTok>
                        </XMApp>
                        <XMDual>
                          <XMRef idref="S3.SS4.p5.m2.1.2"/>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">(</XMTok>
                            <XMApp xml:id="S3.SS4.p5.m2.1.2">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                            </XMApp>
                            <XMTok role="CLOSE" stretchy="false">)</XMTok>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
                <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
              </XMApp>
            </XMath>
          </Math>.</p>
      </para>
      <para xml:id="S3.SS4.p6">
        <p>The values of <Math mode="inline" tex="f(\mathbf{\bm{\phi}}_{j})" text="f * phi _ j" xml:id="S3.SS4.p6.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">f</XMTok>
                <XMDual>
                  <XMRef idref="S3.SS4.p6.m1.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp xml:id="S3.SS4.p6.m1.1">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> in (<ref labelref="LABEL:eq:loss"/>) being known constants, the regression error <Math mode="inline" tex="\epsilon({\mathbf{\bm{\omega}}})" text="epsilon * omega" xml:id="S3.SS4.p6.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                <XMDual>
                  <XMRef idref="S3.SS4.p6.m2.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold italic" name="omega" role="UNKNOWN" xml:id="S3.SS4.p6.m2.1">ω</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> is an analytical function of <Math mode="inline" tex="\mathbf{\bm{\omega}}" text="omega" xml:id="S3.SS4.p6.m3">
            <XMath>
              <XMTok font="bold italic" name="omega" role="UNKNOWN">ω</XMTok>
            </XMath>
          </Math>.
In that context, minimizing the regression error can be performed analytically or through gradient descent, giving rise to batch and incremental regression respectively.</p>
      </para>
      <subsubsection inlist="toc" xml:id="S3.SS4.SSS1">
        <tags>
          <tag>3.4.1</tag>
          <tag role="autoref">subsubsection 3.4.1</tag>
          <tag role="refnum">3.4.1</tag>
          <tag role="typerefnum">§3.4.1</tag>
        </tags>
        <title><tag close=" ">3.4.1</tag>Batch regression</title>
<!--  %**** sample˙efficiency.tex Line 925 **** -->        <para xml:id="S3.SS4.SSS1.p1">
          <p>Batch regression is the analytical minimization of the regression error. Though the optimum is given in closed-form, it is a function of all the training samples, and computing this function can be computationally intensive, despite being analytical.
This is the case in batch regression over a linear architecture when minimizing a squared error, for instance, where analytical resolution consists of a matrix inversion (see <cite class="ltx_citemacro_citep">(<bibref bibrefs="stulp15NN" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite> for details). By contrast, this is not the case of the unique Gaussian feature function, where determining the average and covariance from training samples is straightforward.</p>
        </para>
      </subsubsection>
      <subsubsection inlist="toc" labels="LABEL:mes:deepreg LABEL:mes:incrregression" xml:id="S3.SS4.SSS2">
        <tags>
          <tag>3.4.2</tag>
          <tag role="autoref">subsubsection 3.4.2</tag>
          <tag role="refnum">3.4.2</tag>
          <tag role="typerefnum">§3.4.2</tag>
        </tags>
        <title><tag close=" ">3.4.2</tag>Incremental regression</title>
        <para xml:id="S3.SS4.SSS2.p1">
          <p>Incremental regression is the minimization of the regression error by updating <Math mode="inline" tex="\hat{f}_{\mathbf{\bm{\omega}}}" text="(hat@(f)) _ omega" xml:id="S3.SS4.SSS2.p1.m1">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">f</XMTok>
                  </XMApp>
                  <XMTok font="bold italic" fontsize="70%" name="omega" role="UNKNOWN">ω</XMTok>
                </XMApp>
              </XMath>
            </Math> over iterations.
It can be implemented by performing derivative-based optimization over <Math mode="inline" tex="\epsilon({\mathbf{\bm{\omega}}})" text="epsilon * omega" xml:id="S3.SS4.SSS2.p1.m2">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                  <XMDual>
                    <XMRef idref="S3.SS4.SSS2.p1.m2.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="bold italic" name="omega" role="UNKNOWN" xml:id="S3.SS4.SSS2.p1.m2.1">ω</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math>, for instance.
It can benefit from the work done at previous iterations to improve <Math mode="inline" tex="\hat{f}_{\mathbf{\bm{\omega}}}" text="(hat@(f)) _ omega" xml:id="S3.SS4.SSS2.p1.m3">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">f</XMTok>
                  </XMApp>
                  <XMTok font="bold italic" fontsize="70%" name="omega" role="UNKNOWN">ω</XMTok>
                </XMApp>
              </XMath>
            </Math> model at a lower cost than batch regression.
It is used for instance in the “back-propagation of the gradient” algorithm to train deep neural networks.</p>
        </para>
        <ERROR class="undefined">{tcolorbox}</ERROR>
        <para xml:id="S3.SS4.SSS2.p2">
          <p>[colback=red!10!white]<text font="bold">Message 4:</text>
Incremental regression is generally less computationally intensive than batch regression.</p>
        </para>
        <ERROR class="undefined">{tcolorbox}</ERROR>
        <para xml:id="S3.SS4.SSS2.p3">
          <p>[colback=red!10!white]<text font="bold">Message 5:</text>
Back-propagation of the gradient used to train deep neural networks is incremental regression by performing derivative-based optimization over the network.</p>
        </para>
<!--  %%   using 
     %%   “begin–algorithm˝[hbt]
     %%     “caption–mi“˙it($“hatf˙–“vomega˙i˝$,$S$): Iterative model improvement“label–alg:miit˝˝
     %%     “begin–algorithmic˝
     %%       “REQUIRE $“hatf˙–“vomega˙i˝$: current model, $S$: samples $¡~“inpu˙j,“f(“inpu˙j)¿$
     %**** sample˙efficiency.tex Line 950 ****
     %%       “STATE $“epsilon(–“vomega˙i˝) = “sum˙–j=1˝^–n˙–ts˝˝ Loss(“f(“inpu˙j),“hatf˙–“vomega˙i˝(“inpu˙j))$
     %%       “STATE $“vomega˙–i+1˝ = dbo(“vomega˙i,“epsilon)$ // Algorithm~“ref–alg:vgo˝
     %%       “RETURN $“hatf˙–“vomega˙–i+1˝˝$
     %%     “end–algorithmic˝
     %%   “end–algorithm˝
     %% “noindent-->      </subsubsection>
      <subsubsection inlist="toc" labels="LABEL:mes:deep_reuse LABEL:mes:incrreg_reuse" xml:id="S3.SS4.SSS3">
        <tags>
          <tag>3.4.3</tag>
          <tag role="autoref">subsubsection 3.4.3</tag>
          <tag role="refnum">3.4.3</tag>
          <tag role="typerefnum">§3.4.3</tag>
        </tags>
        <title><tag close=" ">3.4.3</tag>Sample reuse in regression</title>
        <para xml:id="S3.SS4.SSS3.p1">
          <p>Batch and incremental regression can be called over several iterations in a loop and can use a different set of samples each time.
When doing so, they can use as input either samples which were never used before, or samples already used in a previous iteration.
The latter case defines <text font="italic">sample reuse</text>.</p>
        </para>
        <para xml:id="S3.SS4.SSS3.p2">
          <p>But using batch regression several times with the same samples provides an identical model each time.
By contrast, when doing the same with incremental regression, the model improves at each iteration until it eventually starts to overfit.
Thus sample reuse makes more sense in combination with incremental regression.</p>
        </para>
        <ERROR class="undefined">{tcolorbox}</ERROR>
        <para xml:id="S3.SS4.SSS3.p3">
          <p>[colback=red!10!white]<text font="bold">Message 6:</text>
In contrast with batch regression, incremental regression methods benefit from sample reuse.</p>
        </para>
        <ERROR class="undefined">{tcolorbox}</ERROR>
        <para xml:id="S3.SS4.SSS3.p4">
          <p>[colback=red!10!white]<text font="bold">Message 7:</text>
Deep RL benefits from sample reuse because it uses incremental regression methods.</p>
        </para>
<!--  %**** sample˙efficiency.tex Line 975 **** -->        <para xml:id="S3.SS4.SSS3.p5">
          <p>With incremental regression, there is no guarantee to get <Math mode="inline" tex="\hat{f}_{\mathbf{\bm{\omega}}}(\mathbf{\bm{\phi}}_{i})=f(\mathbf{\bm{\phi}}_{i})" text="(hat@(f)) _ omega * phi _ i = f * phi _ i" xml:id="S3.SS4.SSS3.p5.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMApp>
                        <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                        <XMTok font="italic" role="UNKNOWN">f</XMTok>
                      </XMApp>
                      <XMTok font="bold italic" fontsize="70%" name="omega" role="UNKNOWN">ω</XMTok>
                    </XMApp>
                    <XMDual>
                      <XMRef idref="S3.SS4.SSS3.p5.m1.1"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="S3.SS4.SSS3.p5.m1.1">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">f</XMTok>
                    <XMDual>
                      <XMRef idref="S3.SS4.SSS3.p5.m1.2"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="S3.SS4.SSS3.p5.m1.2">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math> for all samples <Math mode="inline" tex="&lt;~{}\mathbf{\bm{\phi}}_{i},f(\mathbf{\bm{\phi}}_{i})&gt;" xml:id="S3.SS4.SSS3.p5.m2">
              <XMath>
                <XMTok meaning="less-than" role="RELOP" rpadding="3.3pt">&lt;</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok font="italic" role="UNKNOWN">f</XMTok>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
                <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
              </XMath>
            </Math>.
The same is true about batch regression when using some regularization.
However, the purpose of reusing a set of samples <Math mode="inline" tex="&lt;\mathbf{\bm{\phi}}_{i},f(\mathbf{\bm{\phi}}_{i})&gt;" xml:id="S3.SS4.SSS3.p5.m3">
              <XMath>
                <XMTok meaning="less-than" role="RELOP">&lt;</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok font="italic" role="UNKNOWN">f</XMTok>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
                <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
              </XMath>
            </Math> is not so much to improve the accuracy of <Math mode="inline" tex="\hat{f}_{\mathbf{\bm{\omega}}}" text="(hat@(f)) _ omega" xml:id="S3.SS4.SSS3.p5.m4">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">f</XMTok>
                  </XMApp>
                  <XMTok font="bold italic" fontsize="70%" name="omega" role="UNKNOWN">ω</XMTok>
                </XMApp>
              </XMath>
            </Math> for the corresponding input, but to improve the accuracy of <Math mode="inline" tex="\hat{f}_{\mathbf{\bm{\omega}}}" text="(hat@(f)) _ omega" xml:id="S3.SS4.SSS3.p5.m5">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">f</XMTok>
                  </XMApp>
                  <XMTok font="bold italic" fontsize="70%" name="omega" role="UNKNOWN">ω</XMTok>
                </XMApp>
              </XMath>
            </Math> for <text font="italic">unseen</text> input <Math mode="inline" tex="\mathbf{\bm{\phi}}_{j}\neq\mathbf{\bm{\phi}}_{i}" text="phi _ j not-equals phi _ i" xml:id="S3.SS4.SSS3.p5.m6">
              <XMath>
                <XMApp>
                  <XMTok meaning="not-equals" name="neq" role="RELOP">≠</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>.</p>
        </para>
        <para xml:id="S3.SS4.SSS3.p6">
          <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="italic">why do I need the above insight?</text></p>
        </para>
      </subsubsection>
    </subsection>
    <subsection inlist="toc" labels="LABEL:mes:al LABEL:mes:mb2 LABEL:sec:surrogate" xml:id="S3.SS5">
      <tags>
        <tag>3.5</tag>
        <tag role="autoref">subsection 3.5</tag>
        <tag role="refnum">3.5</tag>
        <tag role="typerefnum">§3.5</tag>
      </tags>
      <title><tag close=" ">3.5</tag>Optimization with a surrogate utility model</title>
      <para xml:id="S3.SS5.p1">
        <p>In BBO, the analytical form of the utility function over <Math mode="inline" tex="\Theta" text="Theta" xml:id="S3.SS5.p1.m1">
            <XMath>
              <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
            </XMath>
          </Math> is unknown.
One can improve sample efficiency over optimization without a utility model by combining two processes:
1) learning a <text font="italic">surrogate</text> model of the utility function and
2) finding the optimum over this surrogate function by performing analytical or derivate-based optimization on this model.</p>
      </para>
      <para xml:id="S3.SS5.p2">
        <p>The key feature of using a surrogate model is that it provides an estimate of the utility of an unseen sample.
However, the purpose of getting this estimated utility “for free” is not to remove sampling.
Rather, it helps determining where to sample next, by looking for an optimum <Math mode="inline" tex="\hat{{\mathbf{\bm{\theta}}}}^{*}" text="(hat@(theta)) ^ *" xml:id="S3.SS5.p2.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
                </XMApp>
                <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
              </XMApp>
            </XMath>
          </Math> over <Math mode="inline" tex="\hat{J}({\mathbf{\bm{\theta}}})" text="hat@(J) * theta" xml:id="S3.SS5.p2.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">J</XMTok>
                </XMApp>
                <XMDual>
                  <XMRef idref="S3.SS5.p2.m2.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S3.SS5.p2.m2.1">θ</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> without the need for any additional sample.
In addition, using <Math mode="inline" tex="\hat{{\mathbf{\bm{\theta}}}}^{*}" text="(hat@(theta)) ^ *" xml:id="S3.SS5.p2.m3">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
                </XMApp>
                <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
              </XMApp>
            </XMath>
          </Math> as new sample for model learning improves the model preferentially in the area where the optimum may be.</p>
      </para>
      <ERROR class="undefined">{tcolorbox}</ERROR>
      <para xml:id="S3.SS5.p3">
        <p>[colback=red!10!white]<text font="bold">Message 8:</text>
Using a surrogate model in the context of BBO provides an estimate of the value of a sample without having to generate that sample.</p>
      </para>
      <para xml:id="S3.SS5.p4">
        <p>Besides, choosing where to sample next is the basis of <text font="italic">active learning</text>.</p>
      </para>
<!--  %**** sample˙efficiency.tex Line 1000 **** -->      <ERROR class="undefined">{tcolorbox}</ERROR>
      <para xml:id="S3.SS5.p5">
        <p>[colback=red!10!white]<text font="bold">Message 9:</text>
Optimization using a surrogate model improves sample efficiency by implementing a basic form of active learning.</p>
      </para>
      <subsubsection inlist="toc" labels="LABEL:sec:loops" xml:id="S3.SS5.SSS1">
        <tags>
          <tag>3.5.1</tag>
          <tag role="autoref">subsubsection 3.5.1</tag>
          <tag role="refnum">3.5.1</tag>
          <tag role="typerefnum">§3.5.1</tag>
        </tags>
        <title><tag close=" ">3.5.1</tag>Optimization loops</title>
        <para xml:id="S3.SS5.SSS1.p1">
          <p>There are two main ways to temporally organize regression and optimization into a loop: the iterative and the incremental loops.</p>
          <itemize xml:id="S3.I1">
            <item xml:id="S3.I1.i1">
              <tags>
                <tag>1.</tag>
                <tag role="autoref">item 1</tag>
                <tag role="refnum">1</tag>
                <tag role="typerefnum">item 1</tag>
              </tags>
              <para xml:id="S3.I1.i1.p1">
                <p>In the <text font="italic">iterative loop</text>, a new surrogate model is computed at each iteration.
Then optimization determines where to sample next, new samples are generated, a new surrogate model is computed using these new samples, and so on.
This approach is more often used with the analytical resolution of regression.
<!--  %Instances: “edas, “cmaes... --></p>
              </para>
            </item>
            <item xml:id="S3.I1.i2">
              <tags>
                <tag>2.</tag>
                <tag role="autoref">item 2</tag>
                <tag role="refnum">2</tag>
                <tag role="typerefnum">item 2</tag>
              </tags>
              <para xml:id="S3.I1.i2.p1">
                <p>The <text font="italic">incremental loop</text> is similar but, at each iteration, the surrogate model is incrementally updated rather than recomputed.
<!--  %This approach is necessarily combined with the incremental resolution of regression. 
     %Instances: gradient backprop, stochastic gradient descent--></p>
              </para>
            </item>
          </itemize>
        </para>
        <para xml:id="S3.SS5.SSS1.p2">
          <p>The <text font="italic">sequential loop</text> is particular case of the <text font="italic">iterative loop</text>, when a single iteration is performed: a first model is learned, generally out of many samples in a <text font="italic">batch</text> way, the optimum is determined based on this model, and the process stops.
Besides, one may eventually reuse samples in the <text font="italic">iterative loop</text> but, as stated in Message <ref labelref="LABEL:mes:incrreg_reuse"/>, this brings no benefit and this is not computationally efficient.</p>
        </para>
        <para xml:id="S3.SS5.SSS1.p3">
          <p>In the incremental loop, it is frequent that a single new sample is used at each iteration.
In that case, there is no sample reuse.
The mini-batch approach often used in <text font="italic">deep learning</text> methods is a counterexample. This is only when using such mini-batches that samples are reused.
<!--  %**** sample˙efficiency.tex Line 1025 **** --></p>
        </para>
        <para xml:id="S3.SS5.SSS1.p4">
          <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="italic">continue with model in <Math mode="inline" tex="\mathbf{\bm{\theta}}" text="theta" xml:id="S3.SS5.SSS1.p4.m1">
                <XMath>
                  <XMTok font="bold" name="theta" role="UNKNOWN">θ</XMTok>
                </XMath>
              </Math> versus in <Math mode="inline" tex="(\mathbf{\bm{x}},\mathbf{\bm{u}})" text="open-interval@(x, u)" xml:id="S3.SS5.SSS1.p4.m2">
                <XMath>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S3.SS5.SSS1.p4.m2.1"/>
                      <XMRef idref="S3.SS5.SSS1.p4.m2.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok font="upright" role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="bold upright" role="UNKNOWN" xml:id="S3.SS5.SSS1.p4.m2.1">x</XMTok>
                      <XMTok font="upright" role="PUNCT">,</XMTok>
                      <XMTok font="bold upright" role="UNKNOWN" xml:id="S3.SS5.SSS1.p4.m2.2">u</XMTok>
                      <XMTok font="upright" role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMath>
              </Math>.</text></p>
        </para>
        <para xml:id="S3.SS5.SSS1.p5">
          <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="italic">Explain Table <ref labelref="LABEL:tab:methods"/>.</text></p>
        </para>
        <table inlist="lot" labels="LABEL:tab:methods" placement="hbtp" xml:id="S3.T1">
          <tags>
            <tag>Table 1</tag>
            <tag role="autoref">Table 1</tag>
            <tag role="refnum">1</tag>
            <tag role="typerefnum">Table 1</tag>
          </tags>
          <toccaption class="ltx_centering"><tag close=" ">1</tag>Classification of episodic policy search methods</toccaption>
          <caption class="ltx_centering"><tag close=": ">Table 1</tag>Classification of episodic policy search methods</caption>
          <tabular class="ltx_centering ltx_guessed_headers" vattach="middle">
            <thead>
              <tr>
                <td align="center" border="l r t" thead="column row">Algo</td>
                <td align="center" border="r t" thead="column row">Model space</td>
                <td align="center" border="r t" thead="column">Model Improv.</td>
                <td align="center" border="r t" thead="column">Policy Improv.</td>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td align="center" border="l r t" thead="row"><text font="smallcaps">EDA</text>s</td>
                <td align="center" border="r t" thead="row"><Math mode="inline" tex="\mathbf{\bm{\theta}}" text="theta" xml:id="S3.T1.m1">
                    <XMath>
                      <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">local search</td>
                <td align="center" border="r t">Analytical</td>
              </tr>
              <tr>
                <td align="center" border="l r" thead="row">Policy-gradient</td>
                <td align="center" border="r" thead="row"><Math mode="inline" tex="\mathbf{\bm{\theta}}" text="theta" xml:id="S3.T1.m2">
                    <XMath>
                      <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
                    </XMath>
                  </Math> or <Math mode="inline" tex="(\mathbf{\bm{x}},vu)" text="open-interval@(x, v * u)" xml:id="S3.T1.m3">
                    <XMath>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S3.T1.m3.1"/>
                          <XMRef idref="S3.T1.m3.2"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="bold" role="UNKNOWN" xml:id="S3.T1.m3.1">x</XMTok>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMApp xml:id="S3.T1.m3.2">
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" role="UNKNOWN">v</XMTok>
                            <XMTok font="italic" role="UNKNOWN">u</XMTok>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMath>
                  </Math></td>
                <td align="center" border="r">iterative recomputation</td>
                <td align="center" border="r">Incremental, gradient-based</td>
              </tr>
              <tr>
                <td align="center" border="l r" thead="row">Critic-only</td>
                <td align="center" border="r" thead="row"><Math mode="inline" tex="(\mathbf{\bm{x}},vu)" text="open-interval@(x, v * u)" xml:id="S3.T1.m4">
                    <XMath>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S3.T1.m4.1"/>
                          <XMRef idref="S3.T1.m4.2"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="bold" role="UNKNOWN" xml:id="S3.T1.m4.1">x</XMTok>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMApp xml:id="S3.T1.m4.2">
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" role="UNKNOWN">v</XMTok>
                            <XMTok font="italic" role="UNKNOWN">u</XMTok>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMath>
                  </Math></td>
                <td align="center" border="r">incremental</td>
                <td align="center" border="r">look for max</td>
              </tr>
              <tr>
                <td align="center" border="l r" thead="row">Bayes Opt.</td>
                <td align="center" border="r" thead="row"><Math mode="inline" tex="\mathbf{\bm{\theta}}" text="theta" xml:id="S3.T1.m5">
                    <XMath>
                      <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
                    </XMath>
                  </Math></td>
                <td align="center" border="r">incremental</td>
                <td align="center" border="r">incremental</td>
              </tr>
              <tr>
                <td align="center" border="b l r" thead="row">Actor-critic</td>
                <td align="center" border="b r" thead="row"><Math mode="inline" tex="(\mathbf{\bm{x}},vu)" text="open-interval@(x, v * u)" xml:id="S3.T1.m6">
                    <XMath>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S3.T1.m6.1"/>
                          <XMRef idref="S3.T1.m6.2"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="bold" role="UNKNOWN" xml:id="S3.T1.m6.1">x</XMTok>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMApp xml:id="S3.T1.m6.2">
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" role="UNKNOWN">v</XMTok>
                            <XMTok font="italic" role="UNKNOWN">u</XMTok>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMath>
                  </Math></td>
                <td align="center" border="b r">incremental</td>
                <td align="center" border="b r">incremental</td>
              </tr>
            </tbody>
          </tabular>
        </table>
        <para xml:id="S3.SS5.SSS1.p6">
          <p>In an actor-critic architecture, two derivative-based optimization processes are used, one for improving the model and one for finding the optimum policy guess on the resulting model.</p>
        </para>
      </subsubsection>
      <subsubsection inlist="toc" labels="LABEL:mes:db_df LABEL:mes:mb3" xml:id="S3.SS5.SSS2">
        <tags>
          <tag>3.5.2</tag>
          <tag role="autoref">subsubsection 3.5.2</tag>
          <tag role="refnum">3.5.2</tag>
          <tag role="typerefnum">§3.5.2</tag>
        </tags>
        <title><tag close=" ">3.5.2</tag>Summary</title>
<!--  %**** sample˙efficiency.tex Line 1050 **** -->        <ERROR class="undefined">{tcolorbox}</ERROR>
        <para xml:id="S3.SS5.SSS2.p1">
          <p>[colback=red!10!white]<text font="bold">Message 10:</text>
Optimization with a surrogate model can combine sample reuse brought by incremental regression and data efficiency brought by analytical optimization.
This is the case of deep RL.</p>
        </para>
        <ERROR class="undefined">{tcolorbox}</ERROR>
        <para xml:id="S3.SS5.SSS2.p2">
          <p>[colback=red!10!white]<text font="bold">Message 11:</text>
Being greedy, vanilla and natural gradient optimization are generally more sample efficient than optimization without a utility model, but they are also less robust to surrogate model inaccuracies.</p>
        </para>
<!--  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->      </subsubsection>
    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:mfo" xml:id="S4">
    <tags>
      <tag>4</tag>
      <tag role="autoref">section 4</tag>
      <tag role="refnum">4</tag>
      <tag role="typerefnum">§4</tag>
    </tags>
    <title><tag close=" ">4</tag>Policy search without a utility model</title>
    <para xml:id="S4.p1">
      <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="italic">rewrite for direct instantiation.</text></p>
    </para>
    <para xml:id="S4.p2">
      <p>In the context of episodic policy search, the latent function to be optimized is the utility of the policy.
When using episodic-samples, a sample is a <Math mode="inline" tex="&lt;{\mathbf{\bm{\theta}}},\hat{J}({\mathbf{\bm{\theta}}})&gt;" xml:id="S4.p2.m1">
          <XMath>
            <XMTok meaning="less-than" role="RELOP">&lt;</XMTok>
            <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S4.p2.m1.2">θ</XMTok>
            <XMTok role="PUNCT">,</XMTok>
            <XMApp>
              <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
              <XMTok font="italic" role="UNKNOWN">J</XMTok>
            </XMApp>
            <XMWrap>
              <XMTok role="OPEN" stretchy="false">(</XMTok>
              <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S4.p2.m1.1">θ</XMTok>
              <XMTok role="CLOSE" stretchy="false">)</XMTok>
            </XMWrap>
            <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
          </XMath>
        </Math> pair, where <Math mode="inline" tex="\hat{J}({\mathbf{\bm{\theta}}})" text="hat@(J) * theta" xml:id="S4.p2.m2">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMApp>
                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                <XMTok font="italic" role="UNKNOWN">J</XMTok>
              </XMApp>
              <XMDual>
                <XMRef idref="S4.p2.m2.1"/>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S4.p2.m2.1">θ</XMTok>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math> is obtained by running a number of episodes with the system.
Episodes can be run from the same state in the MC approach or from various states in the more general case.</p>
    </para>
    <para xml:id="S4.p3">
      <p>Random search, population-based and finite difference methods can be directly instantiated by taking <Math mode="inline" tex="\mathbf{\bm{\phi}}" text="phi" xml:id="S4.p3.m1">
          <XMath>
            <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
          </XMath>
        </Math> =<Math mode="inline" tex="\mathbf{\bm{\theta}}" text="theta" xml:id="S4.p3.m2">
          <XMath>
            <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
          </XMath>
        </Math> and <Math mode="inline" tex="f" text="f" xml:id="S4.p3.m3">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">f</XMTok>
          </XMath>
        </Math> =<Math mode="inline" tex="J" text="J" xml:id="S4.p3.m4">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">J</XMTok>
          </XMath>
        </Math>.</p>
    </para>
    <subsection inlist="toc" xml:id="S4.SS1">
      <tags>
        <tag>4.1</tag>
        <tag role="autoref">subsection 4.1</tag>
        <tag role="refnum">4.1</tag>
        <tag role="typerefnum">§4.1</tag>
      </tags>
      <title><tag close=" ">4.1</tag>Random search</title>
      <para xml:id="S4.SS1.p1">
        <p>The simplest BBO method randomly searches <Math mode="inline" tex="\Phi" text="Phi" xml:id="S4.SS1.p1.m1">
            <XMath>
              <XMTok name="Phi" role="UNKNOWN">Φ</XMTok>
            </XMath>
          </Math> until it stumbles on a good enough <Math mode="inline" tex="\hat{f}(\mathbf{\bm{\phi}})" text="hat@(f) * phi" xml:id="S4.SS1.p1.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">f</XMTok>
                </XMApp>
                <XMDual>
                  <XMRef idref="S4.SS1.p1.m2.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold italic" name="phi" role="UNKNOWN" xml:id="S4.SS1.p1.m2.1">ϕ</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math>.
<!--  %**** sample˙efficiency.tex Line 1075 **** -->Its distinguishing feature is that the previous value <Math mode="inline" tex="\hat{f}(\mathbf{\bm{\phi}})" text="hat@(f) * phi" xml:id="S4.SS1.p1.m3">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">f</XMTok>
                </XMApp>
                <XMDual>
                  <XMRef idref="S4.SS1.p1.m3.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold italic" name="phi" role="UNKNOWN" xml:id="S4.SS1.p1.m3.1">ϕ</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> has no impact on the choice of the next <Math mode="inline" tex="\mathbf{\bm{\phi}}" text="phi" xml:id="S4.SS1.p1.m4">
            <XMath>
              <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
            </XMath>
          </Math>.</p>
      </para>
      <para xml:id="S4.SS1.p2">
        <p>Quite obviously, this family of methods is not sample efficient, but it requires no assumption at all on the function to be optimized.
Therefore, it is an option only if the optimized function does not show any <text font="italic">regularity</text> that can be exploited.
All other methods rely on the implicit assumption that the latent function presents some smoothness around optima.</p>
      </para>
<!--  %Interestingly, multiple shooting generally uses random search for a good initial guess, turning the local search processes described in Section~“ref–sec:GFPS˝ and “ref–sec:DBPS˝ into a global search process “cite–bock1984multiple˝. -->    </subsection>
    <subsection inlist="toc" xml:id="S4.SS2">
      <tags>
        <tag>4.2</tag>
        <tag role="autoref">subsection 4.2</tag>
        <tag role="refnum">4.2</tag>
        <tag role="typerefnum">§4.2</tag>
      </tags>
      <title><tag close=" ">4.2</tag>Population-based optimization</title>
      <para xml:id="S4.SS2.p1">
        <p>Population-based BBO methods manage a limited population of individuals, and generate new individuals randomly in the vincinity of the previous <text font="italic">elite</text> individuals.
It is based on the assumption that the optimum is close to these individuals but, in contrast with Estimation of Distribution Algorithms (<text font="smallcaps">EDA</text>s), the smoothness of the latent function around the optimum is not exploited with an explicit model (see Section <ref labelref="LABEL:sec:edas"/>). See <cite class="ltx_citemacro_citep">(<bibref bibrefs="back1996evolutionary" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> for further reading.
The parameter <Math mode="inline" tex="\mathbf{\bm{\phi}}" text="phi" xml:id="S4.SS2.p1.m1">
            <XMath>
              <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
            </XMath>
          </Math> corresponding to an individual is often called its <text font="italic">genotype</text> and <Math mode="inline" tex="\hat{f}(\mathbf{\bm{\phi}})" text="hat@(f) * phi" xml:id="S4.SS2.p1.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">f</XMTok>
                </XMApp>
                <XMDual>
                  <XMRef idref="S4.SS2.p1.m2.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold italic" name="phi" role="UNKNOWN" xml:id="S4.SS2.p1.m2.1">ϕ</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> is called its <text font="italic">fitness</text>.</p>
      </para>
      <para xml:id="S4.SS2.p2">
        <p>Since these methods use a random exploration component, they are not much data efficient. Furthermore, the value of an individual being known once and for all, they do not give rise to sample reuse.</p>
      </para>
    </subsection>
    <subsection inlist="toc" labels="LABEL:mes:mb LABEL:sec:edas" xml:id="S4.SS3">
      <tags>
        <tag>4.3</tag>
        <tag role="autoref">subsection 4.3</tag>
        <tag role="refnum">4.3</tag>
        <tag role="typerefnum">§4.3</tag>
      </tags>
      <title><tag close=" ">4.3</tag>Estimation of Distribution Algorithms</title>
      <para xml:id="S4.SS3.p1">
        <p>The standard perspective about <text font="smallcaps">EDA</text>s is that they are a specific family of <text font="italic">evolutionary strategies</text> using a covariance matrix to control exploration <cite class="ltx_citemacro_citep">(<bibref bibrefs="larranaga2001estimation" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. Under this perspective, <text font="smallcaps">EDA</text>s are very similar to population-based optimization methods, where samples at iteration <Math mode="inline" tex="i" text="i" xml:id="S4.SS3.p1.m1">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">i</XMTok>
            </XMath>
          </Math> form a population and samples at iteration <Math mode="inline" tex="i+1" text="i + 1" xml:id="S4.SS3.p1.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="plus" role="ADDOP">+</XMTok>
                <XMTok font="italic" role="UNKNOWN">i</XMTok>
                <XMTok meaning="1" role="NUMBER">1</XMTok>
              </XMApp>
            </XMath>
          </Math> form the next generation of this population. However, <text font="smallcaps">EDA</text>s use an explicit Gaussian model of the latent function where population-based methods use the population in as an implicit model.</p>
      </para>
      <para xml:id="S4.SS3.p2">
        <p>Thus, <text font="smallcaps">EDA</text>s are in fact a specific case of BBO with a surrogate model, where the surrogate model is a Gaussian function parametrized by the previous optimum guess and a covariance matrix. <text font="smallcaps">EDA</text>s are iterative: samples of iteration <Math mode="inline" tex="i" text="i" xml:id="S4.SS3.p2.m1">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">i</XMTok>
            </XMath>
          </Math> are used to build the Gaussian model, then samples of iteration <Math mode="inline" tex="i+1" text="i + 1" xml:id="S4.SS3.p2.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="plus" role="ADDOP">+</XMTok>
                <XMTok font="italic" role="UNKNOWN">i</XMTok>
                <XMTok meaning="1" role="NUMBER">1</XMTok>
              </XMApp>
            </XMath>
          </Math> are drawn with a probability proportional to this Gaussian model value.</p>
      </para>
      <para xml:id="S4.SS3.p3">
        <p>A particularity of <text font="smallcaps">EDA</text>s is that the new optimum guess is not obtained by using derivative-based optimization, but by analytically finding the maximum of the Gaussian surrogate model.</p>
      </para>
<!--  %**** sample˙efficiency.tex Line 1100 **** -->      <figure inlist="lof" labels="LABEL:fig:rwa_grad" placement="hbtp" xml:id="S4.F3">
        <tags>
          <tag>Figure 3</tag>
          <tag role="autoref">Figure 3</tag>
          <tag role="refnum">3</tag>
          <tag role="typerefnum">Figure 3</tag>
        </tags>
        <graphics class="ltx_centering" graphic="rwa-svg" options="width=303.534pt" xml:id="S4.F3.g1"/>
        <toccaption class="ltx_centering"><tag close=" ">3</tag>One iteration of <text font="smallcaps">EDA</text>s. Red dot: current guess <Math mode="inline" tex="\mathbf{\bm{\phi}}_{i}" text="phi _ i" xml:id="S4.F3.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
              </XMApp>
            </XMath>
          </Math>. Blue ellipsoid: current sampling domain. Full blue dots: samples with a good evaluation. Empty blue dots: samples with a poor evaluation. Green dot: new guess <Math mode="inline" tex="\mathbf{\bm{\phi}}_{i+1}" text="phi _ (i + 1)" xml:id="S4.F3.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                <XMApp>
                  <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>. Green dotted ellipsoid: new sampling domain</toccaption>
        <caption class="ltx_centering"><tag close=": ">Figure 3</tag>One iteration of <text font="smallcaps">EDA</text>s. Red dot: current guess <Math mode="inline" tex="\mathbf{\bm{\phi}}_{i}" text="phi _ i" xml:id="S4.F3.m3">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
              </XMApp>
            </XMath>
          </Math>. Blue ellipsoid: current sampling domain. Full blue dots: samples with a good evaluation. Empty blue dots: samples with a poor evaluation. Green dot: new guess <Math mode="inline" tex="\mathbf{\bm{\phi}}_{i+1}" text="phi _ (i + 1)" xml:id="S4.F3.m4">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                <XMApp>
                  <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>. Green dotted ellipsoid: new sampling domain</caption>
      </figure>
      <para xml:id="S4.SS3.p4">
        <p>In Estimation of Distribution Algorithms (<text font="smallcaps">EDA</text>s), <Math mode="inline" tex="n_{s}" text="n _ s" xml:id="S4.SS3.p4.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">n</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
              </XMApp>
            </XMath>
          </Math> samples <Math mode="inline" tex="\mathbf{\bm{\phi}}_{j}" text="phi _ j" xml:id="S4.SS3.p4.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
              </XMApp>
            </XMath>
          </Math> are drawn from a Gaussian distribution centered on the current guess <Math mode="inline" tex="\mathbf{\bm{\phi}}_{i}" text="phi _ i" xml:id="S4.SS3.p4.m3">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
              </XMApp>
            </XMath>
          </Math> and with covariance <Math mode="inline" tex="\mathbf{\bm{\Sigma}}_{i}" text="Sigma _ i" xml:id="S4.SS3.p4.m4">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold" name="Sigma" role="UNKNOWN">Σ</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
              </XMApp>
            </XMath>
          </Math>.
The evaluation of these samples determines the new optimum guess <Math mode="inline" tex="\mathbf{\bm{\phi}}_{i+1}" text="phi _ (i + 1)" xml:id="S4.SS3.p4.m5">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                <XMApp>
                  <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> and the new covariance matrix <Math mode="inline" tex="\mathbf{\bm{\Sigma}}_{i+1}" text="Sigma _ (i + 1)" xml:id="S4.SS3.p4.m6">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold" name="Sigma" role="UNKNOWN">Σ</XMTok>
                <XMApp>
                  <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> (Algorithm <ref labelref="LABEL:alg:eda"/>).
Along iterations, the ellipsoid defined by <Math mode="inline" tex="\mathbf{\bm{\Sigma}}_{i+1}" text="Sigma _ (i + 1)" xml:id="S4.SS3.p4.m7">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold" name="Sigma" role="UNKNOWN">Σ</XMTok>
                <XMApp>
                  <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> is progressively adjusted to the top part of the hill corresponding to the local optimum <Math mode="inline" tex="\mathbf{\bm{\phi}}^{*}" text="phi ^ *" xml:id="S4.SS3.p4.m8">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
              </XMApp>
            </XMath>
          </Math>.</p>
      </para>
      <theorem class="ltx_theorem_algorithm" labels="LABEL:alg:eda" xml:id="alg3">
        <tags>
          <tag>Algorithm 3</tag>
          <tag role="autoref">3</tag>
          <tag role="refnum">3</tag>
          <tag role="typerefnum">Algorithm 3</tag>
        </tags>
        <para xml:id="alg3.p1">
          <p>[htb]
<toccaption><tag close=" ">3</tag>eda(<Math mode="inline" tex="\mathbf{\bm{\phi}}_{i}" text="phi _ i" xml:id="alg3.p1.m1">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  </XMApp>
                </XMath>
              </Math>,<Math mode="inline" tex="\mathbf{\bm{\Sigma}}_{i}" text="Sigma _ i" xml:id="alg3.p1.m2">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold" name="Sigma" role="UNKNOWN">Σ</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  </XMApp>
                </XMath>
              </Math>,<Math mode="inline" tex="n_{s}" text="n _ s" xml:id="alg3.p1.m3">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">n</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                  </XMApp>
                </XMath>
              </Math>): one iteration of <text font="smallcaps">EDA</text>s </toccaption><caption><tag close=" ">Algorithm 3</tag>eda(<Math mode="inline" tex="\mathbf{\bm{\phi}}_{i}" text="phi _ i" xml:id="alg3.p1.m4">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  </XMApp>
                </XMath>
              </Math>,<Math mode="inline" tex="\mathbf{\bm{\Sigma}}_{i}" text="Sigma _ i" xml:id="alg3.p1.m5">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold" name="Sigma" role="UNKNOWN">Σ</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  </XMApp>
                </XMath>
              </Math>,<Math mode="inline" tex="n_{s}" text="n _ s" xml:id="alg3.p1.m6">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">n</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                  </XMApp>
                </XMath>
              </Math>): one iteration of <text font="smallcaps">EDA</text>s </caption><ERROR class="undefined">\lx@orig@algorithmic</ERROR>
<ERROR class="undefined">\REQUIRE</ERROR><Math mode="inline" tex="\mathbf{\bm{\phi}}_{i}" text="phi _ i" xml:id="alg3.p1.m7">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                </XMApp>
              </XMath>
            </Math>: current best guess, <Math mode="inline" tex="\mathbf{\bm{\Sigma}}_{i}" text="Sigma _ i" xml:id="alg3.p1.m8">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" name="Sigma" role="UNKNOWN">Σ</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                </XMApp>
              </XMath>
            </Math>: current covariance matrix, <Math mode="inline" tex="n_{s}" text="n _ s" xml:id="alg3.p1.m9">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">n</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                </XMApp>
              </XMath>
            </Math>: number of samples
<ERROR class="undefined">\STATE</ERROR>// generate samples and evaluate them
<ERROR class="undefined">\STATE</ERROR><Math mode="inline" tex="S=\emptyset" text="S = empty-set" xml:id="alg3.p1.m10">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMTok font="italic" role="UNKNOWN">S</XMTok>
                  <XMTok meaning="empty-set" name="emptyset" role="ID">∅</XMTok>
                </XMApp>
              </XMath>
            </Math>
<ERROR class="undefined">\FOR</ERROR><Math mode="inline" tex="j" text="j" xml:id="alg3.p1.m11">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">j</XMTok>
              </XMath>
            </Math> in <Math mode="inline" tex="\{1,\ldots,n_{s}\}" text="set@(1, ldots, n _ s)" xml:id="alg3.p1.m12">
              <XMath>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="set"/>
                    <XMRef idref="alg3.p1.m12.1"/>
                    <XMRef idref="alg3.p1.m12.2"/>
                    <XMRef idref="alg3.p1.m12.3"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">{</XMTok>
                    <XMTok meaning="1" role="NUMBER" xml:id="alg3.p1.m12.1">1</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok name="ldots" role="ID" xml:id="alg3.p1.m12.2">…</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMApp xml:id="alg3.p1.m12.3">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">n</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">}</XMTok>
                  </XMWrap>
                </XMDual>
              </XMath>
            </Math>
<ERROR class="undefined">\STATE</ERROR><Math mode="inline" tex="\mathbf{\bm{\phi}}_{j}=\mathcal{N}(\mathbf{\bm{\phi}}_{i},\mathbf{\bm{\Sigma}}%&#10;_{i})" text="phi _ j = N * open-interval@(phi _ i, Sigma _ i)" xml:id="alg3.p1.m13">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN">N</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="open-interval"/>
                        <XMRef idref="alg3.p1.m13.1"/>
                        <XMRef idref="alg3.p1.m13.2"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="alg3.p1.m13.1">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                        </XMApp>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMApp xml:id="alg3.p1.m13.2">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" name="Sigma" role="UNKNOWN">Σ</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>
<ERROR class="undefined">\STATE</ERROR><Math mode="inline" tex="f(\mathbf{\bm{\phi}}_{j})=evaluate(\mathbf{\bm{\phi}}_{j})" text="f * phi _ j = e * v * a * l * u * a * t * e * phi _ j" xml:id="alg3.p1.m14">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">f</XMTok>
                    <XMDual>
                      <XMRef idref="alg3.p1.m14.1"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="alg3.p1.m14.1">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">e</XMTok>
                    <XMTok font="italic" role="UNKNOWN">v</XMTok>
                    <XMTok font="italic" role="UNKNOWN">a</XMTok>
                    <XMTok font="italic" role="UNKNOWN">l</XMTok>
                    <XMTok font="italic" role="UNKNOWN">u</XMTok>
                    <XMTok font="italic" role="UNKNOWN">a</XMTok>
                    <XMTok font="italic" role="UNKNOWN">t</XMTok>
                    <XMTok font="italic" role="UNKNOWN">e</XMTok>
                    <XMDual>
                      <XMRef idref="alg3.p1.m14.2"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="alg3.p1.m14.2">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>
<ERROR class="undefined">\STATE</ERROR><Math mode="inline" tex="S\rightarrow S+&lt;\mathbf{\bm{\phi}}_{j},f(\mathbf{\bm{\phi}}_{j})&gt;" text="formulae@(S rightarrow limit-from@(S, +) less phi _ j, f * phi _ j &gt; absent)" xml:id="alg3.p1.m15">
              <XMath>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="formulae"/>
                    <XMRef idref="alg3.p1.m15.1"/>
                    <XMRef idref="alg3.p1.m15.2"/>
                  </XMApp>
                  <XMWrap>
                    <XMApp xml:id="alg3.p1.m15.1">
                      <XMTok meaning="multirelation"/>
                      <XMTok font="italic" role="UNKNOWN">S</XMTok>
                      <XMTok name="rightarrow" role="ARROW">→</XMTok>
                      <XMApp>
                        <XMTok meaning="limit-from"/>
                        <XMTok font="italic" role="UNKNOWN">S</XMTok>
                        <XMTok meaning="plus" role="ADDOP">+</XMTok>
                      </XMApp>
                      <XMTok meaning="less-than" role="RELOP">&lt;</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMApp xml:id="alg3.p1.m15.2">
                      <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="italic" role="UNKNOWN">f</XMTok>
                        <XMDual>
                          <XMRef idref="alg3.p1.m15.2.1"/>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">(</XMTok>
                            <XMApp xml:id="alg3.p1.m15.2.1">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                            </XMApp>
                            <XMTok role="CLOSE" stretchy="false">)</XMTok>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                      <XMTok meaning="absent"/>
                    </XMApp>
                  </XMWrap>
                </XMDual>
              </XMath>
            </Math>
<ERROR class="undefined">\ENDFOR</ERROR><!--  %“STATE // compute the new guess and covariance --><ERROR class="undefined">\STATE</ERROR><Math mode="inline" tex="\mathbf{\bm{\phi}}_{i+1}=compute\_best\_guess(S)" text="phi _ (i + 1) = c * o * m * p * u * t * e * _ * b * e * s * t * _ * g * u * e * s * s * S" xml:id="alg3.p1.m16">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">c</XMTok>
                    <XMTok font="italic" role="UNKNOWN">o</XMTok>
                    <XMTok font="italic" role="UNKNOWN">m</XMTok>
                    <XMTok font="italic" role="UNKNOWN">p</XMTok>
                    <XMTok font="italic" role="UNKNOWN">u</XMTok>
                    <XMTok font="italic" role="UNKNOWN">t</XMTok>
                    <XMTok font="italic" role="UNKNOWN">e</XMTok>
                    <XMTok role="UNKNOWN">_</XMTok>
                    <XMTok font="italic" role="UNKNOWN">b</XMTok>
                    <XMTok font="italic" role="UNKNOWN">e</XMTok>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMTok font="italic" role="UNKNOWN">t</XMTok>
                    <XMTok role="UNKNOWN">_</XMTok>
                    <XMTok font="italic" role="UNKNOWN">g</XMTok>
                    <XMTok font="italic" role="UNKNOWN">u</XMTok>
                    <XMTok font="italic" role="UNKNOWN">e</XMTok>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMDual>
                      <XMRef idref="alg3.p1.m16.1"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMTok font="italic" role="UNKNOWN" xml:id="alg3.p1.m16.1">S</XMTok>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>
<!--  %**** sample˙efficiency.tex Line 1125 **** --><ERROR class="undefined">\STATE</ERROR><Math mode="inline" tex="\mathbf{\bm{\Sigma}}_{i+1}=update(S)" text="Sigma _ (i + 1) = u * p * d * a * t * e * S" xml:id="alg3.p1.m17">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold" name="Sigma" role="UNKNOWN">Σ</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">u</XMTok>
                    <XMTok font="italic" role="UNKNOWN">p</XMTok>
                    <XMTok font="italic" role="UNKNOWN">d</XMTok>
                    <XMTok font="italic" role="UNKNOWN">a</XMTok>
                    <XMTok font="italic" role="UNKNOWN">t</XMTok>
                    <XMTok font="italic" role="UNKNOWN">e</XMTok>
                    <XMDual>
                      <XMRef idref="alg3.p1.m17.1"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMTok font="italic" role="UNKNOWN" xml:id="alg3.p1.m17.1">S</XMTok>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>
<ERROR class="undefined">\RETURN</ERROR><Math mode="inline" tex="\mathbf{\bm{\phi}}_{i+1}" text="phi _ (i + 1)" xml:id="alg3.p1.m18">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>, <Math mode="inline" tex="\mathbf{\bm{\Sigma}}_{i+1}" text="Sigma _ (i + 1)" xml:id="alg3.p1.m19">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" name="Sigma" role="UNKNOWN">Σ</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math></p>
        </para>
      </theorem>
      <para xml:id="S4.SS3.p5">
        <p>The <Math mode="inline" tex="compute\_best\_guess" text="c * o * m * p * u * t * e * _ * b * e * s * t * _ * g * u * e * s * s" xml:id="S4.SS3.p5.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">c</XMTok>
                <XMTok font="italic" role="UNKNOWN">o</XMTok>
                <XMTok font="italic" role="UNKNOWN">m</XMTok>
                <XMTok font="italic" role="UNKNOWN">p</XMTok>
                <XMTok font="italic" role="UNKNOWN">u</XMTok>
                <XMTok font="italic" role="UNKNOWN">t</XMTok>
                <XMTok font="italic" role="UNKNOWN">e</XMTok>
                <XMTok role="UNKNOWN">_</XMTok>
                <XMTok font="italic" role="UNKNOWN">b</XMTok>
                <XMTok font="italic" role="UNKNOWN">e</XMTok>
                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                <XMTok font="italic" role="UNKNOWN">t</XMTok>
                <XMTok role="UNKNOWN">_</XMTok>
                <XMTok font="italic" role="UNKNOWN">g</XMTok>
                <XMTok font="italic" role="UNKNOWN">u</XMTok>
                <XMTok font="italic" role="UNKNOWN">e</XMTok>
                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                <XMTok font="italic" role="UNKNOWN">s</XMTok>
              </XMApp>
            </XMath>
          </Math> and <Math mode="inline" tex="update" text="u * p * d * a * t * e" xml:id="S4.SS3.p5.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">u</XMTok>
                <XMTok font="italic" role="UNKNOWN">p</XMTok>
                <XMTok font="italic" role="UNKNOWN">d</XMTok>
                <XMTok font="italic" role="UNKNOWN">a</XMTok>
                <XMTok font="italic" role="UNKNOWN">t</XMTok>
                <XMTok font="italic" role="UNKNOWN">e</XMTok>
              </XMApp>
            </XMath>
          </Math> methods depend on the algorithm.</p>
      </para>
      <ERROR class="undefined">{tcolorbox}</ERROR>
      <para xml:id="S4.SS3.p6">
        <p>[colback=red!10!white]<text font="bold">Message 12:</text>
<text font="smallcaps">EDA</text>s do not give rise to sample reuse because their surrogate model regression mechanism is analytical.</p>
      </para>
      <para xml:id="S4.SS3.p7">
        <p>Sampling in <text font="smallcaps">EDA</text>s corresponds to local stochastic exploration where the exploration process is driven by the update of the covariance matrix.
In this process, the importance of spatial proximity between good samples is particularly obvious.
This exploration policy can be characterized as <text font="italic">uncorrelated</text> when it only updates the diagonal of the covariance matrix and <text font="italic">correlated</text> when it updates the full covariance matrix. The latter is more efficient in small parameter spaces but computationnally more demanding and potentially inaccurate in larger spaces where more samples are required. See <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> for a discussion.</p>
      </para>
      <para xml:id="S4.SS3.p8">
        <p>EDAs can be simply instantiated into a policy search algorithm by iterating Algorithm <ref labelref="LABEL:alg:eda"/>, using <Math mode="inline" tex="{\mathbf{\bm{\theta}}}" text="theta" xml:id="S4.SS3.p8.m1">
            <XMath>
              <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
            </XMath>
          </Math> as input <Math mode="inline" tex="\mathbf{\bm{\phi}}" text="phi" xml:id="S4.SS3.p8.m2">
            <XMath>
              <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
            </XMath>
          </Math> and <Math mode="inline" tex="J" text="J" xml:id="S4.SS3.p8.m3">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">J</XMTok>
            </XMath>
          </Math> as latent function <Math mode="inline" tex="f" text="f" xml:id="S4.SS3.p8.m4">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">f</XMTok>
            </XMath>
          </Math>.
In that case, the <Math mode="inline" tex="evaluate" text="e * v * a * l * u * a * t * e" xml:id="S4.SS3.p8.m5">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">e</XMTok>
                <XMTok font="italic" role="UNKNOWN">v</XMTok>
                <XMTok font="italic" role="UNKNOWN">a</XMTok>
                <XMTok font="italic" role="UNKNOWN">l</XMTok>
                <XMTok font="italic" role="UNKNOWN">u</XMTok>
                <XMTok font="italic" role="UNKNOWN">a</XMTok>
                <XMTok font="italic" role="UNKNOWN">t</XMTok>
                <XMTok font="italic" role="UNKNOWN">e</XMTok>
              </XMApp>
            </XMath>
          </Math> function of Algorithm <ref labelref="LABEL:alg:eda"/> gets the MC return, either from a single initial state or over the state space.
The implications of this choice are discussed in Section <ref labelref="LABEL:sec:single"/>.</p>
      </para>
      <para xml:id="S4.SS3.p9">
        <p>Various instantiations of <text font="smallcaps">EDA</text>s, such as <text font="smallcaps">cem</text>, <text font="smallcaps">cma-es</text>, <text font="smallcaps">pi<Math mode="inline" tex="{}^{\mbox{\tiny BB}}" text="^[BB]" xml:id="S4.SS3.p9.m1">
              <XMath>
                <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                  <XMText><text fontsize="50%">BB</text></XMText>
                </XMApp>
              </XMath>
            </Math></text>, <text font="smallcaps">pi<Math mode="inline" tex="{}^{2}" text="^2" xml:id="S4.SS3.p9.m2">
              <XMath>
                <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                  <XMTok font="upright" fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                </XMApp>
              </XMath>
            </Math>-cma</text>, <text font="smallcaps">nes</text>, x<text font="smallcaps">nes</text>, are covered in <cite class="ltx_citemacro_citep">(<bibref bibrefs="stulp12icml,stulp2012policy,stulp13paladyn" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
    </subsection>
    <subsection inlist="toc" labels="LABEL:sec:emb" xml:id="S4.SS4">
      <tags>
        <tag>4.4</tag>
        <tag role="autoref">subsection 4.4</tag>
        <tag role="refnum">4.4</tag>
        <tag role="typerefnum">§4.4</tag>
      </tags>
      <title><tag close=" ">4.4</tag>EM-based algorithms</title>
<!--  %**** sample˙efficiency.tex Line 1150 **** -->      <para xml:id="S4.SS4.p1">
        <p>The key and somewhat confusing feature of <text font="smallcaps">EDA</text>s is that the same Gaussian function is both used as a utility model to determine the current best guess and as an exploration policy to generate new samples.</p>
      </para>
      <para xml:id="S4.SS4.p2">
        <p>Interestingly, the <text font="smallcaps">rwr</text> and <text font="smallcaps">crkr</text> algorithms covered in <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> as EM-based methods are quite similar to <text font="smallcaps">EDA</text>s, apart from the fact that
the utility model they learn can be any function, learned through a locally weighted regression algorithm (see <cite class="ltx_citemacro_citep">(<bibref bibrefs="stulp15NN" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> for details).</p>
      </para>
      <para xml:id="S4.SS4.p3">
        <p>Both <text font="smallcaps">rwr</text> and <text font="smallcaps">crkr</text> use Monte Carlo evaluation which corresponds to generating new samples and evaluating them in the first half of Algorithm <ref labelref="LABEL:alg:eda"/>.</p>
      </para>
      <para xml:id="S4.SS4.p4">
        <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="italic">finish that</text></p>
      </para>
      <para xml:id="S4.SS4.p5">
        <p>The algorithm uses Gaussian exploration as in Algorithm <ref labelref="LABEL:alg:eda"/> to generate these samples,
whereas uses a more sophisticated kernel-based representation to drive the exploration process.</p>
      </para>
      <para xml:id="S4.SS4.p6">
        <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="italic">mention EB-<text font="smallcaps">reps</text>.</text></p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S4.SS5">
      <tags>
        <tag>4.5</tag>
        <tag role="autoref">subsection 4.5</tag>
        <tag role="refnum">4.5</tag>
        <tag role="typerefnum">§4.5</tag>
      </tags>
      <title><tag close=" ">4.5</tag>Finite difference methods</title>
      <para xml:id="S4.SS5.p1">
        <p>In <text font="italic">finite difference</text> methods, instead of being analytically computed, the gradient is estimated as the first order approximation of the Taylor expansion:</p>
      </para>
      <para class="ltx_noindent" xml:id="S4.SS5.p2">
        <equation labels="LABEL:eq:fd" xml:id="S4.E5">
          <tags>
            <tag>(5)</tag>
            <tag role="autoref">Equation 5</tag>
            <tag role="refnum">5</tag>
          </tags>
          <Math mode="display" tex="\nabla_{\mathbf{\bm{\phi}}_{i}}f|_{\mathbf{\bm{\phi}}=\mathbf{\bm{\phi}}_{i}}%&#10;\sim(\delta\mathbf{\bm{\phi}}_{i}^{T}\delta\mathbf{\bm{\phi}}_{i})^{-1}(\delta%&#10;\mathbf{\bm{\phi}}_{i}^{T})(f(\mathbf{\bm{\phi}}_{i}+\delta\mathbf{\bm{\phi}}_%&#10;{i})-f(\mathbf{\bm{\phi}}_{i}))" text="evaluated-at@((nabla _ phi _ i)@(f), phi = phi _ i) similar-to (delta * (phi _ i) ^ T * delta * phi _ i) ^ (- 1) * delta * (phi _ i) ^ T * (f * (phi _ i + delta * phi _ i) - f * phi _ i)" xml:id="S4.E5.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="evaluated-at"/>
                    <XMRef idref="S4.E5.m1.2"/>
                    <XMRef idref="S4.E5.m1.1"/>
                  </XMApp>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMWrap>
                      <XMApp xml:id="S4.E5.m1.2">
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                            <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                            <XMTok font="italic" fontsize="50%" role="UNKNOWN">i</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMTok font="italic" role="UNKNOWN">f</XMTok>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">|</XMTok>
                    </XMWrap>
                    <XMApp xml:id="S4.E5.m1.1">
                      <XMTok fontsize="70%" meaning="equals" role="RELOP">=</XMTok>
                      <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                        <XMTok font="bold italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                        <XMTok font="italic" fontsize="50%" role="UNKNOWN">i</XMTok>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                </XMDual>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                    <XMDual>
                      <XMRef idref="S4.E5.m1.3"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="S4.E5.m1.3">
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
                          <XMApp>
                            <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                            </XMApp>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">T</XMTok>
                          </XMApp>
                          <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMDual>
                    <XMRef idref="S4.E5.m1.4"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S4.E5.m1.4">
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
                        <XMApp>
                          <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                          </XMApp>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">T</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                  <XMDual>
                    <XMRef idref="S4.E5.m1.5"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S4.E5.m1.5">
                        <XMTok meaning="minus" role="ADDOP">-</XMTok>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" role="UNKNOWN">f</XMTok>
                          <XMDual>
                            <XMRef idref="S4.E5.m1.5.1"/>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false">(</XMTok>
                              <XMApp xml:id="S4.E5.m1.5.1">
                                <XMTok meaning="plus" role="ADDOP">+</XMTok>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                </XMApp>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                    <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                  </XMApp>
                                </XMApp>
                              </XMApp>
                              <XMTok role="CLOSE" stretchy="false">)</XMTok>
                            </XMWrap>
                          </XMDual>
                        </XMApp>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" role="UNKNOWN">f</XMTok>
                          <XMDual>
                            <XMRef idref="S4.E5.m1.5.2"/>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false">(</XMTok>
                              <XMApp xml:id="S4.E5.m1.5.2">
                                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                              </XMApp>
                              <XMTok role="CLOSE" stretchy="false">)</XMTok>
                            </XMWrap>
                          </XMDual>
                        </XMApp>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>
        </equation>
        <p>where <Math mode="inline" tex="\delta\mathbf{\bm{\phi}}_{i}" text="delta * phi _ i" xml:id="S4.SS5.p2.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> is a small variation of <Math mode="inline" tex="\mathbf{\bm{\phi}}_{i}" text="phi _ i" xml:id="S4.SS5.p2.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
              </XMApp>
            </XMath>
          </Math>.
Solving (<ref labelref="LABEL:eq:fd"/>) using a set of samples that relate <Math mode="inline" tex="\delta\mathbf{\bm{\phi}}_{i}" text="delta * phi _ i" xml:id="S4.SS5.p2.m3">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> to <Math mode="inline" tex="f(\mathbf{\bm{\phi}}_{i}+\delta\mathbf{\bm{\phi}}_{i})-f(\mathbf{\bm{\phi}}_{i})" text="f * (phi _ i + delta * phi _ i) - f * phi _ i" xml:id="S4.SS5.p2.m4">
            <XMath>
              <XMApp>
                <XMTok meaning="minus" role="ADDOP">-</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">f</XMTok>
                  <XMDual>
                    <XMRef idref="S4.SS5.p2.m4.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S4.SS5.p2.m4.1">
                        <XMTok meaning="plus" role="ADDOP">+</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                        </XMApp>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">f</XMTok>
                  <XMDual>
                    <XMRef idref="S4.SS5.p2.m4.2"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S4.SS5.p2.m4.2">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> is a standard regression problem, but perturbations along each dimension of the input can be treated separately, which results in a very simple algorithm <cite class="ltx_citemacro_citep">(<bibref bibrefs="riedmiller08evaluation" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. The counterpart of this simplicity is that it suffers from a lot of variance. Besides, this algorithm is derivative-free and we classify it as using no surrogate model, even if it is based on a local approximation with a linear model of the gradient.
<!--  %**** sample˙efficiency.tex Line 1175 **** --></p>
      </para>
      <para xml:id="S4.SS5.p3">
        <p>However, finite difference methods are limited to deterministic policies and suffer from a large variance. In simulation,
they can be applied to stochastic policies by using identical noise along all trajectories <cite class="ltx_citemacro_citep">(<bibref bibrefs="ng2000pegasus" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
    </subsection>
    <subsection inlist="toc" labels="LABEL:mes:mf" xml:id="S4.SS6">
      <tags>
        <tag>4.6</tag>
        <tag role="autoref">subsection 4.6</tag>
        <tag role="refnum">4.6</tag>
        <tag role="typerefnum">§4.6</tag>
      </tags>
      <title><tag close=" ">4.6</tag>Summary</title>
      <ERROR class="undefined">{tcolorbox}</ERROR>
      <para xml:id="S4.SS6.p1">
        <p>[colback=red!10!white]<text font="bold">Message 13:</text>
Policy search without a utility model is not much data efficient and does not generally give rise to sample reuse.</p>
      </para>
<!--  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:theta" xml:id="S5">
    <tags>
      <tag>5</tag>
      <tag role="autoref">section 5</tag>
      <tag role="refnum">5</tag>
      <tag role="typerefnum">§5</tag>
    </tags>
    <title><tag close=" ">5</tag>Policy search with a surrogate model in the space of policy parameters</title>
    <para xml:id="S5.p1">
      <p>Surprisingly, there is no algorithm which performs simple policy search with a surrogate model in the space of policy parameters.
The only algorithms based on this idea are Bayesian optimization and <text font="smallcaps">rock<Math mode="inline" tex="{}^{*}" text="^*" xml:id="S5.p1.m1">
            <XMath>
              <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                <XMTok font="upright" fontsize="70%" meaning="times" role="MULOP">*</XMTok>
              </XMApp>
            </XMath>
          </Math></text>, but they use a distribution over surrogate models
which endow them with active learning capabilities.</p>
    </para>
<!--  %“begin–comment˝ -->    <table inlist="lot" labels="LABEL:tab:classif1" placement="htbp" xml:id="S5.T2">
      <tags>
        <tag>Table 2</tag>
        <tag role="autoref">Table 2</tag>
        <tag role="refnum">2</tag>
        <tag role="typerefnum">Table 2</tag>
      </tags>
<!--  %**** sample˙efficiency.tex Line 1200 **** -->      <tabular class="ltx_centering ltx_guessed_headers" vattach="middle">
        <thead>
          <tr>
            <td border="l r t" thead="column row"/>
            <td align="center" border="r t" thead="column row">Model Improv.</td>
            <td align="left" border="r t" colspan="4" thead="column">Policy Improvement</td>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td align="left" border="l r tt" thead="row"><text font="smallcaps">cma-es</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="hansen01completely" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                  <bibrefphrase>, </bibrefphrase>
                </bibref>)</cite></td>
            <td align="center" border="r tt" thead="row">Iter.</td>
            <td align="center" border="tt">A.</td>
            <td align="center" border="tt">RWA</td>
            <td align="center" border="tt">LS</td>
            <td align="center" border="r tt">NSO</td>
          </tr>
          <tr>
            <td align="left" border="l r" thead="row"><text font="smallcaps">pi<Math mode="inline" tex="{}^{2}" text="^2" xml:id="S5.T2.m1">
                  <XMath>
                    <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                      <XMTok font="upright" fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                    </XMApp>
                  </XMath>
                </Math>-cma</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="stulp12icml" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                  <bibrefphrase>, </bibrefphrase>
                </bibref>)</cite></td>
            <td align="center" border="r" thead="row">Iter.</td>
            <td align="center">A.</td>
            <td align="center">RWA</td>
            <td align="center">LS</td>
            <td align="center" border="r">NSO</td>
          </tr>
          <tr>
            <td align="left" border="l r" thead="row"><text font="smallcaps">nes</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="wierstra2008natural" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                  <bibrefphrase>, </bibrefphrase>
                </bibref>)</cite></td>
            <td align="center" border="r" thead="row">Iter.</td>
            <td align="center">A.</td>
            <td align="center">NG</td>
            <td align="center">LS</td>
            <td align="center" border="r">NSO</td>
          </tr>
          <tr>
            <td align="left" border="l r" thead="row"><text font="smallcaps">rwr</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="peters2007reinforcement" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                  <bibrefphrase>, </bibrefphrase>
                </bibref>)</cite></td>
            <td align="center" border="r" thead="row">Iter.</td>
            <td align="center">GB.</td>
            <td align="center">VG</td>
            <td align="center">LS</td>
            <td align="center" border="r">MC-EM</td>
          </tr>
          <tr>
            <td align="left" border="l r" thead="row"><text font="smallcaps">crkr</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="kober2012reinforcement" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                  <bibrefphrase>, </bibrefphrase>
                </bibref>)</cite></td>
            <td align="center" border="r" thead="row">Iter.</td>
            <td align="center">GB.</td>
            <td align="center">VG</td>
            <td align="center">LS</td>
            <td align="center" border="r">MC-EM</td>
          </tr>
          <tr>
            <td align="left" border="l r" thead="row">EB-<text font="smallcaps">reps</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="peters2010relative" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                  <bibrefphrase>, </bibrefphrase>
                </bibref>)</cite></td>
            <td align="center" border="r" thead="row">Iter.</td>
            <td align="center">GB.</td>
            <td align="center">NG</td>
            <td align="center">LS</td>
            <td align="center" border="r">IT</td>
          </tr>
          <tr>
            <td align="left" border="l r t" thead="row"><text font="smallcaps">cem</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="rubinstein04crossentropy" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                  <bibrefphrase>, </bibrefphrase>
                </bibref>)</cite></td>
            <td align="center" border="r t" thead="row">Iter.</td>
            <td align="center" border="t">GB.</td>
            <td align="center" border="t">VG</td>
            <td align="center" border="t">LS</td>
            <td align="center" border="r t">SO</td>
          </tr>
          <tr>
            <td align="left" border="l r" thead="row"><text font="smallcaps">pi<Math mode="inline" tex="{}^{\mbox{\tiny BB}}" text="^[BB]" xml:id="S5.T2.m2">
                  <XMath>
                    <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                      <XMText><text fontsize="50%">BB</text></XMText>
                    </XMApp>
                  </XMath>
                </Math></text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="stulp2012policy" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                  <bibrefphrase>, </bibrefphrase>
                </bibref>)</cite></td>
            <td align="center" border="r" thead="row">Iter.</td>
            <td align="center">GB.</td>
            <td align="center">RWA</td>
            <td align="center">LS</td>
            <td align="center" border="r">SO</td>
          </tr>
          <tr>
            <td align="left" border="l r" thead="row">x<text font="smallcaps">nes</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="glasmachers2010exponential" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                  <bibrefphrase>, </bibrefphrase>
                </bibref>)</cite></td>
            <td align="center" border="r" thead="row">Iter.</td>
            <td align="center">GB.</td>
            <td align="center">RWA</td>
            <td align="center">LS</td>
            <td align="center" border="r">NSO</td>
          </tr>
          <tr>
            <td align="left" border="l r" thead="row">Bayes. Opt. <cite class="ltx_citemacro_citep">(<bibref bibrefs="pelikan1999boa" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                  <bibrefphrase>, </bibrefphrase>
                </bibref>)</cite></td>
            <td align="center" border="r" thead="row">Incr.</td>
            <td align="center">A.</td>
            <td align="center">VG</td>
            <td align="center">GS</td>
            <td align="center" border="r">PG</td>
          </tr>
          <tr>
            <td align="left" border="b l r" thead="row"><text font="smallcaps">rock<Math mode="inline" tex="{}^{*}" text="^*" xml:id="S5.T2.m3">
                  <XMath>
                    <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                      <XMTok font="upright" fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                    </XMApp>
                  </XMath>
                </Math></text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="hwangbo2014rock" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                  <bibrefphrase>, </bibrefphrase>
                </bibref>)</cite></td>
            <td align="center" border="b r" thead="row">Incr.</td>
            <td align="center" border="b">A.</td>
            <td align="center" border="b">RWA</td>
            <td align="center" border="b">GS</td>
            <td align="center" border="b r">PG</td>
          </tr>
        </tbody>
      </tabular>
      <toccaption class="ltx_centering"><tag close=" ">2</tag>List of episodic policy search algorithms using a surrogate model in the policy parameter space.
Above the line, they were studied in <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, below they were not.
Bayes. Opt.: Bayesian optimization.
Model improvement: Iter. = iterative, Incr. = incremental.
Policy improvement: A. = analytical, GB. = gradient-based, RWA = reward-weighted averaging, VG = vanilla gradient, NG = natural gradient, LS = local search, GS = global search.
Policy update methods (following <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>):
MC-EM = Expectation-Maximization with Monte Carlo,
IT = Information-theoretic method,
SO = stochastic optimization,
NSO = stochastic optimization with the natural gradient,
PG = policy gradient.
</toccaption>
      <caption class="ltx_centering"><tag close=": ">Table 2</tag>List of episodic policy search algorithms using a surrogate model in the policy parameter space.
Above the line, they were studied in <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, below they were not.
Bayes. Opt.: Bayesian optimization.
Model improvement: Iter. = iterative, Incr. = incremental.
Policy improvement: A. = analytical, GB. = gradient-based, RWA = reward-weighted averaging, VG = vanilla gradient, NG = natural gradient, LS = local search, GS = global search.
Policy update methods (following <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>):
MC-EM = Expectation-Maximization with Monte Carlo,
IT = Information-theoretic method,
SO = stochastic optimization,
NSO = stochastic optimization with the natural gradient,
PG = policy gradient.
</caption>
    </table>
    <subsection inlist="toc" labels="LABEL:mes:BO2" xml:id="S5.SS1">
      <tags>
        <tag>5.1</tag>
        <tag role="autoref">subsection 5.1</tag>
        <tag role="refnum">5.1</tag>
        <tag role="typerefnum">§5.1</tag>
      </tags>
      <title><tag close=" ">5.1</tag>Bayesian Optimization</title>
      <para xml:id="S5.SS1.p1">
        <p>Bayesian optimization (BO) is an instance of optimization with a surrogate model in <Math mode="inline" tex="\mathbf{\bm{\theta}}" text="theta" xml:id="S5.SS1.p1.m1">
            <XMath>
              <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
            </XMath>
          </Math> where, instead of learning one surrogate model, a <text font="italic">distribution</text> over probabilities of such surrogate models is updated through Bayesian inference. The distribution over surrogate models is initialized with some <text font="italic">prior</text>, and each new sample, considered as some new <text font="italic">evidence</text>, helps adjusting the model distribution towards a peak at the true value while keeping some information about uncertainty as the variance of the distribution.</p>
      </para>
      <para xml:id="S5.SS1.p2">
        <p>A BO algorithm comes with some <text font="italic">covariance function</text> that determines how the information provided by a new sample influences the model distribution around this sample.
That is, where <text font="smallcaps">EDA</text>s necessarily assume a Gaussian relationship between the value of two samples, BO can assume more varied functions.
BO also comes with an <text font="italic">acquisition function</text> used to choose the next sample given the current model distribution. A good acquisition function should take into account the expected value of the chosen sample as well as the uncertainty around this sample.
Furthermore, it should be computationally as cheap as possible to find the optimum over the acquisition function, so as to control the computational cost of choosing the next sample.</p>
      </para>
      <para xml:id="S5.SS1.p3">
        <p>By quickly reducing uncertainty, BO implements a form of active learning.
As a consequence, it is very sample efficient when the parameter space is small enough, and it converges to a global optimum rather than a local one.
However, given the necessity to optimize globally over the acquisition function, it scales poorly in the size of the parameter space.
We do not expand further the presentation of BO here, see <cite class="ltx_citemacro_citep">(<bibref bibrefs="brochu2010tutorial" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> for more details.</p>
      </para>
      <ERROR class="undefined">{tcolorbox}</ERROR>
      <para xml:id="S5.SS1.p4">
        <p>[colback=red!10!white]<text font="bold">Message 14:</text>
Bayesian Optimization is BBO with a surrogate model, where a distribution over surrogate models is managed.
<!--  %**** sample˙efficiency.tex Line 1250 **** -->It performs active learning by trying to reduce uncertainty, and can perform global search, at the price of a lesser scalability.</p>
      </para>
<!--  %“end–comment˝ 
     %In contrast with model-free methods and derivative-based optimisation methods, the new optimum guess can be far away from the current best samples.
     %Previous samples are thrown away, but they could be reused.
     %In “edas, parameter of the surrogate model are the mean and the covariance of the Gaussian
     %“section–Derivative-based “pse with a model˝
     %“label–sec:DBPSMO˝-->      <para xml:id="S5.SS1.p5">
        <p>Besides, finding the optimum can use derivative-based optimization, but it does not need to be local.
This allows the algorithm to consider several local optima and find the <text font="italic">global</text> optimum.</p>
      </para>
      <para xml:id="S5.SS1.p6">
        <p>The <text font="smallcaps">rock<Math mode="inline" tex="{}^{*}" text="^*" xml:id="S5.SS1.p6.m1">
              <XMath>
                <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                  <XMTok font="upright" fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                </XMApp>
              </XMath>
            </Math></text> algorithm is one such instance <cite class="ltx_citemacro_citep">(<bibref bibrefs="hwangbo2014rock" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.
However, it uses <text font="smallcaps">cma-es</text> to find the optimum over the model function.
By doing so, it performs natural gradient optimization rather than vanilla gradient optimization.
But it uses derivative-free optimization where it could use a derivative-based optimization approach, the model function being known.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S5.SS2">
      <tags>
        <tag>5.2</tag>
        <tag role="autoref">subsection 5.2</tag>
        <tag role="refnum">5.2</tag>
        <tag role="typerefnum">§5.2</tag>
      </tags>
      <title><tag close=" ">5.2</tag>Summary</title>
      <para xml:id="S5.SS2.p1">
        <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="italic">There is no table for model-free methods</text></p>
      </para>
<!--  %**** sample˙efficiency.tex Line 1275 **** -->      <para xml:id="S5.SS2.p2">
        <p>Table <ref labelref="LABEL:tab:classif1"/> gives the most important distinctions between episodic policy search algorithms using a surrogate model in the policy parameter space.
All episodic policy search algorithms using a surrogate model in the policy parameter space share the same architecture. Their model is linear in the features (the Gaussian utility model is a special case with just one feature), they use a deterministic policy, and they do not use a forward model. Quite obviously, they perform exploration in the policy parameter space. The on-policy versus off-policy and multi-steps versus single-step updates distinctions do not make sense in their case.</p>
      </para>
<!--  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:psss" xml:id="S6">
    <tags>
      <tag>6</tag>
      <tag role="autoref">section 6</tag>
      <tag role="refnum">6</tag>
      <tag role="typerefnum">§6</tag>
    </tags>
    <title><tag close=" ">6</tag>Policy search with a critic</title>
    <para xml:id="S6.p1">
      <p>Methods presented in Section <ref labelref="LABEL:sec:theta"/> learn a surrogate model in the space of policy parameters.
The family of methods we are now presenting also learn a model of the utility function, but they do so in the space of states
and eventually actions. This model is called a <text font="italic">critic</text>.
We first give a quick overview of the mathematical justification of this approach before presenting the methods themselves.
<!--  %A deeper and more rigorous presentation of this line of thought is given in “cite–peters2008reinforcement˝ and “cite–deisenroth2013survey˝. --></p>
    </para>
    <subsection inlist="toc" labels="LABEL:mes:critic LABEL:sec:theta2x" xml:id="S6.SS1">
      <tags>
        <tag>6.1</tag>
        <tag role="autoref">subsection 6.1</tag>
        <tag role="refnum">6.1</tag>
        <tag role="typerefnum">§6.1</tag>
      </tags>
      <title><tag close=" ">6.1</tag>From a utility model in the policy parameter space to a critic</title>
      <para xml:id="S6.SS1.p1">
        <p>Starting from (<ref labelref="LABEL:eq:glob_return_tau"/>), page <ref labelref="LABEL:eq:glob_return_tau"/> and after simple mathematical transformations explained in <cite class="ltx_citemacro_citep">(<bibref bibrefs="peters2008reinforcement" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> and <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, the gradient over <Math mode="inline" tex="{\mathbf{\bm{\theta}}}" text="theta" xml:id="S6.SS1.p1.m1">
            <XMath>
              <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
            </XMath>
          </Math> of the global utility function <Math mode="inline" tex="J({\mathbf{\bm{\theta}}})" text="J * theta" xml:id="S6.SS1.p1.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">J</XMTok>
                <XMDual>
                  <XMRef idref="S6.SS1.p1.m2.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S6.SS1.p1.m2.1">θ</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> can be rewritten</p>
      </para>
      <para xml:id="S6.SS1.p2">
        <equation labels="LABEL:eq:reinforce" xml:id="S6.E6">
          <tags>
            <tag>(6)</tag>
            <tag role="autoref">Equation 6</tag>
            <tag role="refnum">6</tag>
          </tags>
          <Math mode="display" tex="\nabla_{\mathbf{\bm{\theta}}}J({\mathbf{\bm{\theta}}})={{\rm I\!E}}{}_{\tau}\{%&#10;\nabla_{\mathbf{\bm{\theta}}}logp_{\mathbf{\bm{\theta}}}(\tau)J(\tau)\}." xml:id="S6.E6.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
              </XMApp>
              <XMTok font="italic" role="UNKNOWN">J</XMTok>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">(</XMTok>
                <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S6.E6.m1.1">θ</XMTok>
                <XMTok role="CLOSE" stretchy="false">)</XMTok>
              </XMWrap>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMTok role="UNKNOWN" rpadding="-1.7pt">I</XMTok>
              <XMTok role="UNKNOWN">E</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="pre1"/>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">{</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                    <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                  </XMApp>
                  <XMTok font="italic" role="UNKNOWN">l</XMTok>
                  <XMTok font="italic" role="UNKNOWN">o</XMTok>
                  <XMTok font="italic" role="UNKNOWN">g</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">p</XMTok>
                    <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                  <XMTok font="italic" role="UNKNOWN">J</XMTok>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                  <XMTok role="CLOSE" stretchy="false">}</XMTok>
                </XMWrap>
                <XMTok font="italic" fontsize="70%" name="tau" role="UNKNOWN">τ</XMTok>
              </XMApp>
              <XMTok role="PERIOD">.</XMTok>
            </XMath>
          </Math>
        </equation>
      </para>
      <para xml:id="S6.SS1.p3">
        <p>Furthermore, one can show that</p>
      </para>
<!--  %**** sample˙efficiency.tex Line 1300 **** -->      <para xml:id="S6.SS1.p4">
        <equation labels="LABEL:eq:params2xu" xml:id="S6.E7">
          <tags>
            <tag>(7)</tag>
            <tag role="autoref">Equation 7</tag>
            <tag role="refnum">7</tag>
          </tags>
          <Math mode="display" tex="\nabla_{\mathbf{\bm{\theta}}}logp_{\mathbf{\bm{\theta}}}(\tau)=\sum_{k=0}^{k_{%&#10;f}}\nabla_{\mathbf{\bm{\theta}}}log\pi_{\mathbf{\bm{\theta}}}(\mathbf{\bm{u}}_%&#10;{k}|\mathbf{\bm{x}}_{k})." text="(nabla _ theta)@(l * o * g * p _ theta) * tau = ((sum _ (k = 0)) ^ (k _ f))@((nabla _ theta)@(l * o * g * pi _ theta) * conditional@(u _ k, x _ k))" xml:id="S6.E7.m1">
            <XMath>
              <XMDual>
                <XMRef idref="S6.E7.m1.2"/>
                <XMWrap>
                  <XMApp xml:id="S6.E7.m1.2">
                    <XMTok meaning="equals" role="RELOP">=</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                          <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                        </XMApp>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" role="UNKNOWN">l</XMTok>
                          <XMTok font="italic" role="UNKNOWN">o</XMTok>
                          <XMTok font="italic" role="UNKNOWN">g</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="italic" role="UNKNOWN">p</XMTok>
                            <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                      <XMDual>
                        <XMRef idref="S6.E7.m1.1"/>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="italic" name="tau" role="UNKNOWN" xml:id="S6.E7.m1.1">τ</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                    <XMApp>
                      <XMApp scriptpos="mid">
                        <XMTok role="SUPERSCRIPTOP" scriptpos="mid1"/>
                        <XMApp scriptpos="mid">
                          <XMTok role="SUBSCRIPTOP" scriptpos="mid1"/>
                          <XMTok mathstyle="display" meaning="sum" role="SUMOP" scriptpos="mid">∑</XMTok>
                          <XMApp>
                            <XMTok fontsize="70%" meaning="equals" role="RELOP">=</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                            <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                          <XMTok font="italic" fontsize="50%" role="UNKNOWN">f</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMApp>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                            <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                          </XMApp>
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" role="UNKNOWN">l</XMTok>
                            <XMTok font="italic" role="UNKNOWN">o</XMTok>
                            <XMTok font="italic" role="UNKNOWN">g</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                              <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                            </XMApp>
                          </XMApp>
                        </XMApp>
                        <XMDual>
                          <XMRef idref="S6.E7.m1.2.1"/>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">(</XMTok>
                            <XMApp xml:id="S6.E7.m1.2.1">
                              <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                <XMTok font="bold" role="UNKNOWN">u</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                              </XMApp>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                <XMTok font="bold" role="UNKNOWN">x</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                              </XMApp>
                            </XMApp>
                            <XMTok role="CLOSE" stretchy="false">)</XMTok>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMTok role="PERIOD">.</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>
        </equation>
      </para>
      <para xml:id="S6.SS1.p5">
        <p>Thus, by introducing the right-hand side part of (<ref labelref="LABEL:eq:params2xu"/>) in (<ref labelref="LABEL:eq:reinforce"/>), we get</p>
      </para>
      <para xml:id="S6.SS1.p6">
        <equation labels="LABEL:eq:params2xu2" xml:id="S6.E8">
          <tags>
            <tag>(8)</tag>
            <tag role="autoref">Equation 8</tag>
            <tag role="refnum">8</tag>
          </tags>
          <Math mode="display" tex="\nabla_{\mathbf{\bm{\theta}}}J({\mathbf{\bm{\theta}}})={{\rm I\!E}}{}_{\tau}\{%&#10;\sum_{k=0}^{k_{f}}\nabla_{\mathbf{\bm{\theta}}}log\pi_{\mathbf{\bm{\theta}}}(%&#10;\mathbf{\bm{u}}_{k}|\mathbf{\bm{x}}_{k})J(\tau)\}." xml:id="S6.E8.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
              </XMApp>
              <XMTok font="italic" role="UNKNOWN">J</XMTok>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">(</XMTok>
                <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S6.E8.m1.1">θ</XMTok>
                <XMTok role="CLOSE" stretchy="false">)</XMTok>
              </XMWrap>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMTok role="UNKNOWN" rpadding="-1.7pt">I</XMTok>
              <XMTok role="UNKNOWN">E</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="pre1"/>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">{</XMTok>
                  <XMApp scriptpos="mid">
                    <XMTok role="SUPERSCRIPTOP" scriptpos="mid1"/>
                    <XMApp scriptpos="mid">
                      <XMTok role="SUBSCRIPTOP" scriptpos="mid1"/>
                      <XMTok mathstyle="display" meaning="sum" role="SUMOP" scriptpos="mid">∑</XMTok>
                      <XMApp>
                        <XMTok fontsize="70%" meaning="equals" role="RELOP">=</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                        <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      <XMTok font="italic" fontsize="50%" role="UNKNOWN">f</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                    <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                  </XMApp>
                  <XMTok font="italic" role="UNKNOWN">l</XMTok>
                  <XMTok font="italic" role="UNKNOWN">o</XMTok>
                  <XMTok font="italic" role="UNKNOWN">g</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                    <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="bold" role="UNKNOWN">u</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                    </XMApp>
                    <XMTok role="VERTBAR" stretchy="false">|</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="bold" role="UNKNOWN">x</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                  <XMTok font="italic" role="UNKNOWN">J</XMTok>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                  <XMTok role="CLOSE" stretchy="false">}</XMTok>
                </XMWrap>
                <XMTok font="italic" fontsize="70%" name="tau" role="UNKNOWN">τ</XMTok>
              </XMApp>
              <XMTok role="PERIOD">.</XMTok>
            </XMath>
          </Math>
        </equation>
      </para>
      <para xml:id="S6.SS1.p7">
        <p>In (<ref labelref="LABEL:eq:params2xu2"/>), the gradient of <Math mode="inline" tex="J({\mathbf{\bm{\theta}}})" text="J * theta" xml:id="S6.SS1.p7.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">J</XMTok>
                <XMDual>
                  <XMRef idref="S6.SS1.p7.m1.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S6.SS1.p7.m1.1">θ</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> over <Math mode="inline" tex="{\mathbf{\bm{\theta}}}" text="theta" xml:id="S6.SS1.p7.m2">
            <XMath>
              <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
            </XMath>
          </Math> is transformed into a gradient of <Math mode="inline" tex="\pol{{}_{\mathbf{\bm{\theta}}}}" text="pol@(_theta)" xml:id="S6.SS1.p7.m3">
            <XMath>
              <XMApp>
                <XMTok name="pol" role="OVERACCENT">→</XMTok>
                <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                  <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> over <Math mode="inline" tex="(\mathbf{\bm{u}}|\mathbf{\bm{x}})" text="conditional@(u, x)" xml:id="S6.SS1.p7.m4">
            <XMath>
              <XMDual>
                <XMRef idref="S6.SS1.p7.m4.1"/>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp xml:id="S6.SS1.p7.m4.1">
                    <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                    <XMTok font="bold" role="UNKNOWN">u</XMTok>
                    <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>. This is advantageous because in general the former is not analytically known whereas the latter is, since the parametric policy is a function given by the user.</p>
      </para>
<!--  %% In order to mitigate this issue, one can show that the value of the gradient expressed in “eqref–eq:params2xu2˝ is not modified by introducing a constant baseline $b$ in its computation, that is 
     %% “begin–equation˝
     %%   “label–eq:baseline˝
     %%   “nabla˙“params J(“params) =  “Esp˙–“tau˝“–“nabla˙“params log p˙“params(“tau)(J(“tau)-b)“˝.
     %% “end–equation˝-->      <para xml:id="S6.SS1.p8">
        <p>By using the <text font="italic">policy gradient theorem</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="sutton00_NIPS" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> in the stochastic case, one can further show that (<ref labelref="LABEL:eq:reinforce"/>) can be reformulated as:</p>
      </para>
      <para xml:id="S6.SS1.p9">
        <equationgroup class="ltx_eqn_align" xml:id="Sx1.EGx1">
          <equation labels="LABEL:eq:pgt" xml:id="S6.Ex3">
            <MathFork>
              <Math tex="\displaystyle\nabla_{\mathbf{\bm{\theta}}}J({\mathbf{\bm{\theta}}})=\int_{%&#10;\mathcal{X}}d^{\pi_{\mathbf{\bm{\theta}}}}(\mathbf{\bm{x}})\int_{\mathcal{U}}%&#10;\nabla_{\mathbf{\bm{\theta}}}\pol{{}_{\mathbf{\bm{\theta}}}}(\mathbf{\bm{u}}|%&#10;\mathbf{\bm{x}})(Q^{\pol{{}_{\mathbf{\bm{\theta}}}}}(\mathbf{\bm{x}},\mathbf{%&#10;\bm{u}})" xml:id="S6.Ex3.m3">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                    <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                  </XMApp>
                  <XMTok font="italic" role="UNKNOWN">J</XMTok>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S6.Ex3.m3.1">θ</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok mathstyle="display" meaning="integral" name="int" role="INTOP">∫</XMTok>
                    <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">X</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">d</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                      <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
                      <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.Ex3.m3.2">x</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok mathstyle="display" meaning="integral" name="int" role="INTOP">∫</XMTok>
                    <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">U</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                    <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok name="pol" role="OVERACCENT">→</XMTok>
                    <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                      <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold" role="UNKNOWN">u</XMTok>
                    <XMTok role="VERTBAR" stretchy="false">|</XMTok>
                    <XMTok font="bold" role="UNKNOWN">x</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                      <XMApp>
                        <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                        <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                          <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                        </XMApp>
                      </XMApp>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="bold" role="UNKNOWN" xml:id="S6.Ex3.m3.3">x</XMTok>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMTok font="bold" role="UNKNOWN" xml:id="S6.Ex3.m3.4">u</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMWrap>
                </XMath>
              </Math>
              <MathBranch>
                <td align="right"><Math mode="inline" tex="\displaystyle\nabla_{\mathbf{\bm{\theta}}}J({\mathbf{\bm{\theta}}})=\int_{%&#10;\mathcal{X}}d^{\pi_{\mathbf{\bm{\theta}}}}(\mathbf{\bm{x}})" text="(nabla _ theta)@(J) * theta = (integral _ X)@(d ^ (pi _ theta) * x)" xml:id="S6.Ex3.m1">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="equals" role="RELOP">=</XMTok>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMApp>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                              <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                            </XMApp>
                            <XMTok font="italic" role="UNKNOWN">J</XMTok>
                          </XMApp>
                          <XMDual>
                            <XMRef idref="S6.Ex3.m1.1"/>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false">(</XMTok>
                              <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S6.Ex3.m1.1">θ</XMTok>
                              <XMTok role="CLOSE" stretchy="false">)</XMTok>
                            </XMWrap>
                          </XMDual>
                        </XMApp>
                        <XMApp>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok mathstyle="display" meaning="integral" name="int" role="INTOP">∫</XMTok>
                            <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">X</XMTok>
                          </XMApp>
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMApp>
                              <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="italic" role="UNKNOWN">d</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                                <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
                                <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                              </XMApp>
                            </XMApp>
                            <XMDual>
                              <XMRef idref="S6.Ex3.m1.2"/>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">(</XMTok>
                                <XMTok font="bold" role="UNKNOWN" xml:id="S6.Ex3.m1.2">x</XMTok>
                                <XMTok role="CLOSE" stretchy="false">)</XMTok>
                              </XMWrap>
                            </XMDual>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="left"><Math mode="inline" tex="\displaystyle\int_{\mathcal{U}}\nabla_{\mathbf{\bm{\theta}}}\pol{{}_{\mathbf{%&#10;\bm{\theta}}}}(\mathbf{\bm{u}}|\mathbf{\bm{x}})(Q^{\pol{{}_{\mathbf{\bm{\theta%&#10;}}}}}(\mathbf{\bm{x}},\mathbf{\bm{u}})" xml:id="S6.Ex3.m2">
                    <XMath>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok mathstyle="display" meaning="integral" name="int" role="INTOP">∫</XMTok>
                        <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">U</XMTok>
                      </XMApp>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                        <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                      </XMApp>
                      <XMApp>
                        <XMTok name="pol" role="OVERACCENT">→</XMTok>
                        <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                          <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMTok font="bold" role="UNKNOWN">u</XMTok>
                        <XMTok role="VERTBAR" stretchy="false">|</XMTok>
                        <XMTok font="bold" role="UNKNOWN">x</XMTok>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp>
                          <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                          <XMApp>
                            <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                            <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                              <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                            </XMApp>
                          </XMApp>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="bold" role="UNKNOWN" xml:id="S6.Ex3.m2.1">x</XMTok>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMTok font="bold" role="UNKNOWN" xml:id="S6.Ex3.m2.2">u</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMWrap>
                    </XMath>
                  </Math></td>
              </MathBranch>
            </MathFork>
          </equation>
          <equation xml:id="S6.E9">
            <tags>
              <tag>(9)</tag>
              <tag role="autoref">Equation 9</tag>
              <tag role="refnum">9</tag>
            </tags>
            <MathFork>
              <Math tex="\displaystyle-b^{\pol{{}_{\mathbf{\bm{\theta}}}}}(\mathbf{\bm{x}}))d\mathbf{%&#10;\bm{x}}d\mathbf{\bm{u}}" xml:id="S6.E9.m3">
                <XMath>
                  <XMTok meaning="minus" role="ADDOP">-</XMTok>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">b</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                      <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                        <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.E9.m3.1">x</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  <XMTok font="italic" role="UNKNOWN">d</XMTok>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok font="italic" role="UNKNOWN">d</XMTok>
                  <XMTok font="bold" role="UNKNOWN">u</XMTok>
                </XMath>
              </Math>
              <MathBranch>
                <td align="right"/>
                <td align="left"><Math mode="inline" tex="\displaystyle-b^{\pol{{}_{\mathbf{\bm{\theta}}}}}(\mathbf{\bm{x}}))d\mathbf{%&#10;\bm{x}}d\mathbf{\bm{u}}" xml:id="S6.E9.m2">
                    <XMath>
                      <XMTok meaning="minus" role="ADDOP">-</XMTok>
                      <XMApp>
                        <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" role="UNKNOWN">b</XMTok>
                        <XMApp>
                          <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                          <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                            <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMTok font="bold" role="UNKNOWN" xml:id="S6.E9.m2.1">x</XMTok>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      <XMTok font="italic" role="UNKNOWN">d</XMTok>
                      <XMTok font="bold" role="UNKNOWN">x</XMTok>
                      <XMTok font="italic" role="UNKNOWN">d</XMTok>
                      <XMTok font="bold" role="UNKNOWN">u</XMTok>
                    </XMath>
                  </Math></td>
              </MathBranch>
            </MathFork>
          </equation>
        </equationgroup>
      </para>
      <para xml:id="S6.SS1.p10">
        <p>where <Math mode="inline" tex="d^{\pi_{\mathbf{\bm{\theta}}}}(\mathbf{\bm{x}})" text="d ^ (pi _ theta) * x" xml:id="S6.SS1.p10.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">d</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                    <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
                    <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                  </XMApp>
                </XMApp>
                <XMDual>
                  <XMRef idref="S6.SS1.p10.m1.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS1.p10.m1.1">x</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> is the density of probability of being in state <Math mode="inline" tex="\mathbf{\bm{x}}" text="x" xml:id="S6.SS1.p10.m2">
            <XMath>
              <XMTok font="bold" role="UNKNOWN">x</XMTok>
            </XMath>
          </Math> given the current policy <Math mode="inline" tex="\pol{{}_{\mathbf{\bm{\theta}}}}" text="pol@(_theta)" xml:id="S6.SS1.p10.m3">
            <XMath>
              <XMApp>
                <XMTok name="pol" role="OVERACCENT">→</XMTok>
                <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                  <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>,
<Math mode="inline" tex="Q^{\pol{{}_{\mathbf{\bm{\theta}}}}}(\mathbf{\bm{x}},\mathbf{\bm{u}})" text="Q ^ (pol@(_theta)) * open-interval@(x, u)" xml:id="S6.SS1.p10.m4">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                    <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                      <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                    </XMApp>
                  </XMApp>
                </XMApp>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S6.SS1.p10.m4.1"/>
                    <XMRef idref="S6.SS1.p10.m4.2"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS1.p10.m4.1">x</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS1.p10.m4.2">u</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> is the <text font="italic">action value</text> function and <Math mode="inline" tex="b^{\pol{{}_{\mathbf{\bm{\theta}}}}}(\mathbf{\bm{x}})" text="b ^ (pol@(_theta)) * x" xml:id="S6.SS1.p10.m5">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">b</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                    <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                      <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                    </XMApp>
                  </XMApp>
                </XMApp>
                <XMDual>
                  <XMRef idref="S6.SS1.p10.m5.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS1.p10.m5.1">x</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> is a baseline function.</p>
      </para>
      <para xml:id="S6.SS1.p11">
        <p>Various choices for <Math mode="inline" tex="b^{\pol{{}_{\mathbf{\bm{\theta}}}}}(\mathbf{\bm{x}})" text="b ^ (pol@(_theta)) * x" xml:id="S6.SS1.p11.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">b</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                    <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                      <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                    </XMApp>
                  </XMApp>
                </XMApp>
                <XMDual>
                  <XMRef idref="S6.SS1.p11.m1.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS1.p11.m1.1">x</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> give rise to various well-studied ways to compute the policy gradient.
To keep as general as possible, we define a utility function <Math mode="inline" tex="U(\mathbf{\bm{x}},\mathbf{\bm{u}})" text="U * open-interval@(x, u)" xml:id="S6.SS1.p11.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">U</XMTok>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S6.SS1.p11.m2.1"/>
                    <XMRef idref="S6.SS1.p11.m2.2"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS1.p11.m2.1">x</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS1.p11.m2.2">u</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> which can represent any function of the form <Math mode="inline" tex="(Q^{\pol{{}_{\mathbf{\bm{\theta}}}}}(\mathbf{\bm{x}},\mathbf{\bm{u}})-b^{\pol{%&#10;{}_{\mathbf{\bm{\theta}}}}}(\mathbf{\bm{x}}))" text="Q ^ (pol@(_theta)) * open-interval@(x, u) - b ^ (pol@(_theta)) * x" xml:id="S6.SS1.p11.m3">
            <XMath>
              <XMDual>
                <XMRef idref="S6.SS1.p11.m3.4"/>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp xml:id="S6.SS1.p11.m3.4">
                    <XMTok meaning="minus" role="ADDOP">-</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                        <XMApp>
                          <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                          <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                            <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S6.SS1.p11.m3.1"/>
                          <XMRef idref="S6.SS1.p11.m3.2"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS1.p11.m3.1">x</XMTok>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS1.p11.m3.2">u</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" role="UNKNOWN">b</XMTok>
                        <XMApp>
                          <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                          <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                            <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                      <XMDual>
                        <XMRef idref="S6.SS1.p11.m3.3"/>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS1.p11.m3.3">x</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>.</p>
      </para>
      <para xml:id="S6.SS1.p12">
        <p>Using <Math mode="inline" tex="U" text="U" xml:id="S6.SS1.p12.m1">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">U</XMTok>
            </XMath>
          </Math>, (<ref labelref="LABEL:eq:pgt"/>) becomes</p>
      </para>
      <para xml:id="S6.SS1.p13">
        <equation labels="LABEL:eq:pgtw" xml:id="S6.E10">
          <tags>
            <tag>(10)</tag>
            <tag role="autoref">Equation 10</tag>
            <tag role="refnum">10</tag>
          </tags>
          <Math mode="display" tex="\nabla_{\mathbf{\bm{\theta}}}J({\mathbf{\bm{\theta}}})=\int_{\mathcal{X}}d^{%&#10;\pi_{\mathbf{\bm{\theta}}}}(\mathbf{\bm{x}})\int_{\mathcal{U}}\nabla_{\mathbf{%&#10;\bm{\theta}}}\pol{{}_{\mathbf{\bm{\theta}}}}(\mathbf{\bm{u}}|\mathbf{\bm{x}})U%&#10;(\mathbf{\bm{x}},\mathbf{\bm{u}})d\mathbf{\bm{x}}d\mathbf{\bm{u}}." text="(nabla _ theta)@(J) * theta = (integral _ X)@(d ^ (pi _ theta) * x * (integral _ U)@((nabla _ theta)@(pol@(_theta)) * conditional@(u, x) * U * open-interval@(x, u) * differential-d@(x) * differential-d@(u)))" xml:id="S6.E10.m1">
            <XMath>
              <XMDual>
                <XMRef idref="S6.E10.m1.5"/>
                <XMWrap>
                  <XMApp xml:id="S6.E10.m1.5">
                    <XMTok meaning="equals" role="RELOP">=</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                          <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                        </XMApp>
                        <XMTok font="italic" role="UNKNOWN">J</XMTok>
                      </XMApp>
                      <XMDual>
                        <XMRef idref="S6.E10.m1.1"/>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S6.E10.m1.1">θ</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                    <XMApp>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok mathstyle="display" meaning="integral" name="int" role="INTOP">∫</XMTok>
                        <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">X</XMTok>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMApp>
                          <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="italic" role="UNKNOWN">d</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                            <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
                            <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMDual>
                          <XMRef idref="S6.E10.m1.2"/>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">(</XMTok>
                            <XMTok font="bold" role="UNKNOWN" xml:id="S6.E10.m1.2">x</XMTok>
                            <XMTok role="CLOSE" stretchy="false">)</XMTok>
                          </XMWrap>
                        </XMDual>
                        <XMApp>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok mathstyle="display" meaning="integral" name="int" role="INTOP">∫</XMTok>
                            <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">U</XMTok>
                          </XMApp>
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMApp>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                                <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                              </XMApp>
                              <XMApp>
                                <XMTok name="pol" role="OVERACCENT">→</XMTok>
                                <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                                  <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                                </XMApp>
                              </XMApp>
                            </XMApp>
                            <XMDual>
                              <XMRef idref="S6.E10.m1.5.1"/>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">(</XMTok>
                                <XMApp xml:id="S6.E10.m1.5.1">
                                  <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                                  <XMTok font="bold" role="UNKNOWN">u</XMTok>
                                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false">)</XMTok>
                              </XMWrap>
                            </XMDual>
                            <XMTok font="italic" role="UNKNOWN">U</XMTok>
                            <XMDual>
                              <XMApp>
                                <XMTok meaning="open-interval"/>
                                <XMRef idref="S6.E10.m1.3"/>
                                <XMRef idref="S6.E10.m1.4"/>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">(</XMTok>
                                <XMTok font="bold" role="UNKNOWN" xml:id="S6.E10.m1.3">x</XMTok>
                                <XMTok role="PUNCT">,</XMTok>
                                <XMTok font="bold" role="UNKNOWN" xml:id="S6.E10.m1.4">u</XMTok>
                                <XMTok role="CLOSE" stretchy="false">)</XMTok>
                              </XMWrap>
                            </XMDual>
                            <XMApp>
                              <XMTok font="italic" meaning="differential-d" role="DIFFOP">d</XMTok>
                              <XMTok font="bold" role="UNKNOWN">x</XMTok>
                            </XMApp>
                            <XMApp>
                              <XMTok font="italic" meaning="differential-d" role="DIFFOP">d</XMTok>
                              <XMTok font="bold" role="UNKNOWN">u</XMTok>
                            </XMApp>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMTok role="PERIOD">.</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>
        </equation>
      </para>
      <para xml:id="S6.SS1.p14">
        <p>In (<ref labelref="LABEL:eq:pgtw"/>), <Math mode="inline" tex="\pol{{}_{\mathbf{\bm{\theta}}}}(\mathbf{\bm{u}}|\mathbf{\bm{x}})" text="pol@(_theta) * conditional@(u, x)" xml:id="S6.SS1.p14.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok name="pol" role="OVERACCENT">→</XMTok>
                  <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                    <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                  </XMApp>
                </XMApp>
                <XMDual>
                  <XMRef idref="S6.SS1.p14.m1.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp xml:id="S6.SS1.p14.m1.1">
                      <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                      <XMTok font="bold" role="UNKNOWN">u</XMTok>
                      <XMTok font="bold" role="UNKNOWN">x</XMTok>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> is given.
Therefore, its gradient can be computed analytically, but this is not the case of <Math mode="inline" tex="U(\mathbf{\bm{x}},\mathbf{\bm{u}})" text="U * open-interval@(x, u)" xml:id="S6.SS1.p14.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">U</XMTok>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S6.SS1.p14.m2.1"/>
                    <XMRef idref="S6.SS1.p14.m2.2"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS1.p14.m2.1">x</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS1.p14.m2.2">u</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math>.
To turn these equations into a practical algorithm, one thus needs to learn a model <Math mode="inline" tex="\hat{U}_{\eta}" text="(hat@(U)) _ eta" xml:id="S6.SS1.p14.m3">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">U</XMTok>
                </XMApp>
                <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
              </XMApp>
            </XMath>
          </Math> of <Math mode="inline" tex="U" text="U" xml:id="S6.SS1.p14.m4">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">U</XMTok>
            </XMath>
          </Math>.</p>
      </para>
      <para xml:id="S6.SS1.p15">
        <p>The <Math mode="inline" tex="\hat{U}_{\eta}(\mathbf{\bm{x}},\mathbf{\bm{u}})" text="(hat@(U)) _ eta * open-interval@(x, u)" xml:id="S6.SS1.p15.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">U</XMTok>
                  </XMApp>
                  <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                </XMApp>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S6.SS1.p15.m1.1"/>
                    <XMRef idref="S6.SS1.p15.m1.2"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS1.p15.m1.1">x</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS1.p15.m1.2">u</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> function is called a <text font="italic">critic</text> and methods combining an approximation of <Math mode="inline" tex="U_{\eta}" text="U _ eta" xml:id="S6.SS1.p15.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">U</XMTok>
                <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
              </XMApp>
            </XMath>
          </Math> and gradient descent on <Math mode="inline" tex="\pol{{}_{\mathbf{\bm{\theta}}}}" text="pol@(_theta)" xml:id="S6.SS1.p15.m3">
            <XMath>
              <XMApp>
                <XMTok name="pol" role="OVERACCENT">→</XMTok>
                <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                  <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> are called <text font="italic">actor-critic</text> methods, the policy <Math mode="inline" tex="\pol{{}_{\mathbf{\bm{\theta}}}}" text="pol@(_theta)" xml:id="S6.SS1.p15.m4">
            <XMath>
              <XMApp>
                <XMTok name="pol" role="OVERACCENT">→</XMTok>
                <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                  <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> being the actor.</p>
      </para>
      <para xml:id="S6.SS1.p16">
        <p>Finally, in the case of a deterministic policy, instead of using (<ref labelref="LABEL:eq:pgt"/>), one can compute the <text font="italic">deterministic policy gradient</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="silver2014deterministic" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> using</p>
      </para>
<!--  %**** sample˙efficiency.tex Line 1350 **** -->      <para xml:id="S6.SS1.p17">
        <equation labels="LABEL:eq:dpg" xml:id="S6.E11">
          <tags>
            <tag>(11)</tag>
            <tag role="autoref">Equation 11</tag>
            <tag role="refnum">11</tag>
          </tags>
          <Math mode="display" tex="\nabla_{\mathbf{\bm{\theta}}}J({\mathbf{\bm{\theta}}})=\nabla_{\mathbf{\bm{u}}%&#10;}Q^{\pol{{}_{\mathbf{\bm{\theta}}}}}(\mathbf{\bm{x}},\mathbf{\bm{u}})\nabla_{%&#10;\mathbf{\bm{\theta}}}\pol{{}_{\mathbf{\bm{\theta}}}}(\mathbf{\bm{u}}|\mathbf{%&#10;\bm{x}})." text="(nabla _ theta)@(J) * theta = (nabla _ u)@(Q ^ (pol@(_theta))) * open-interval@(x, u) * (nabla _ theta)@(pol@(_theta)) * conditional@(u, x)" xml:id="S6.E11.m1">
            <XMath>
              <XMDual>
                <XMRef idref="S6.E11.m1.4"/>
                <XMWrap>
                  <XMApp xml:id="S6.E11.m1.4">
                    <XMTok meaning="equals" role="RELOP">=</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                          <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                        </XMApp>
                        <XMTok font="italic" role="UNKNOWN">J</XMTok>
                      </XMApp>
                      <XMDual>
                        <XMRef idref="S6.E11.m1.1"/>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S6.E11.m1.1">θ</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                          <XMTok font="bold" fontsize="70%" role="UNKNOWN">u</XMTok>
                        </XMApp>
                        <XMApp>
                          <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                          <XMApp>
                            <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                            <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                              <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                            </XMApp>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S6.E11.m1.2"/>
                          <XMRef idref="S6.E11.m1.3"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="bold" role="UNKNOWN" xml:id="S6.E11.m1.2">x</XMTok>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMTok font="bold" role="UNKNOWN" xml:id="S6.E11.m1.3">u</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                      <XMApp>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok name="nabla" role="OPERATOR">∇</XMTok>
                          <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                        </XMApp>
                        <XMApp>
                          <XMTok name="pol" role="OVERACCENT">→</XMTok>
                          <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                            <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                      <XMDual>
                        <XMRef idref="S6.E11.m1.4.1"/>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMApp xml:id="S6.E11.m1.4.1">
                            <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                            <XMTok font="bold" role="UNKNOWN">u</XMTok>
                            <XMTok font="bold" role="UNKNOWN">x</XMTok>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                  </XMApp>
                  <XMTok role="PERIOD">.</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>
        </equation>
      </para>
      <para xml:id="S6.SS1.p18">
        <p>This can be advantageous because, the space of deterministic policies being smaller than the space of stochastic policies, searching the former is faster than searching the latter. However, a stochastic policy might be more appropriate when Markov property does not hold <cite class="ltx_citemacro_citep">(<bibref bibrefs="williams1998experimental" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> or in adversarial contexts <cite class="ltx_citemacro_citep">(<bibref bibrefs="wang2016sample" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <ERROR class="undefined">{tcolorbox}</ERROR>
      <para xml:id="S6.SS1.p19">
        <p>[colback=red!10!white]<text font="bold">Message 15:</text>
Learning a surrogate model of utility in the parameter space and ascending its gradient can be turned into learning a critic and ascending the utility gradient base on that critic</p>
      </para>
      <para xml:id="S6.SS1.p20">
        <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="italic">better rephrase the above.</text></p>
      </para>
<!--  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->    </subsection>
    <subsection inlist="toc" labels="LABEL:mes:ac_mes LABEL:mes:drl_bias LABEL:sec:bias_var" xml:id="S6.SS2">
      <tags>
        <tag>6.2</tag>
        <tag role="autoref">subsection 6.2</tag>
        <tag role="refnum">6.2</tag>
        <tag role="typerefnum">§6.2</tag>
      </tags>
      <title><tag close=" ">6.2</tag>Trading bias against variance</title>
      <para xml:id="S6.SS2.p1">
        <p>The insight in (<ref labelref="LABEL:eq:params2xu2"/>) is used in the <text font="smallcaps">reinforce</text> algorithm <cite class="ltx_citemacro_citep">(<bibref bibrefs="williams1992" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.
<!--  %“pepg is a variant of “reinforce using “epsas. -->However, using (<ref labelref="LABEL:eq:params2xu2"/>) to compute the policy gradient suffers from a variance which grows with the length of the episodes.</p>
      </para>
      <para xml:id="S6.SS2.p2">
        <p>One way to reduce this variance consists in adequately choosing the baseline function in (<ref labelref="LABEL:eq:pgt"/>).
In particular, the optimal baseline, that is the baseline that minimizes the variance without introducing bias is the <text font="italic">value</text> function <Math mode="inline" tex="V^{\pol{{}_{\mathbf{\bm{\theta}}}}}(\mathbf{\bm{x}})" text="V ^ (pol@(_theta)) * x" xml:id="S6.SS2.p2.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">V</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                    <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                      <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                    </XMApp>
                  </XMApp>
                </XMApp>
                <XMDual>
                  <XMRef idref="S6.SS2.p2.m1.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS2.p2.m1.1">x</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math>
(see e.g. <cite class="ltx_citemacro_citep">(<bibref bibrefs="schulman2015high" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>). Thus, the <Math mode="inline" tex="U" text="U" xml:id="S6.SS2.p2.m2">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">U</XMTok>
            </XMath>
          </Math> function which optimally reduces the variance is <Math mode="inline" tex="U=A^{\pol{{}_{\mathbf{\bm{\theta}}}}}(\mathbf{\bm{x}},\mathbf{\bm{u}})=Q^{\pol%&#10;{{}_{\mathbf{\bm{\theta}}}}}(\mathbf{\bm{x}},\mathbf{\bm{u}})-V^{\pol{{}_{%&#10;\mathbf{\bm{\theta}}}}}(\mathbf{\bm{x}})" text="U = A ^ (pol@(_theta)) * open-interval@(x, u) = Q ^ (pol@(_theta)) * open-interval@(x, u) - V ^ (pol@(_theta)) * x" xml:id="S6.SS2.p2.m3">
            <XMath>
              <XMApp>
                <XMTok meaning="multirelation"/>
                <XMTok font="italic" role="UNKNOWN">U</XMTok>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">A</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                      <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                        <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S6.SS2.p2.m3.1"/>
                      <XMRef idref="S6.SS2.p2.m3.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS2.p2.m3.1">x</XMTok>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS2.p2.m3.2">u</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMApp>
                  <XMTok meaning="minus" role="ADDOP">-</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                      <XMApp>
                        <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                        <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                          <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                        </XMApp>
                      </XMApp>
                    </XMApp>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="open-interval"/>
                        <XMRef idref="S6.SS2.p2.m3.3"/>
                        <XMRef idref="S6.SS2.p2.m3.4"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS2.p2.m3.3">x</XMTok>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS2.p2.m3.4">u</XMTok>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">V</XMTok>
                      <XMApp>
                        <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                        <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                          <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                        </XMApp>
                      </XMApp>
                    </XMApp>
                    <XMDual>
                      <XMRef idref="S6.SS2.p2.m3.5"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS2.p2.m3.5">x</XMTok>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>, which is known as the <text font="italic">advantage</text> function.</p>
      </para>
<!--  %**** sample˙efficiency.tex Line 1375 **** -->      <para xml:id="S6.SS2.p3">
        <p>Using (<ref labelref="LABEL:eq:pgtw"/>) helps reducing the variance inherent to (<ref labelref="LABEL:eq:params2xu2"/>), but it may introduce some bias, which means that the obtained policy may not be adequately optimized or may even diverge. In order to reduce the variance without introducing bias, there exists a <text font="italic">compatibility condition</text> between the features of <Math mode="inline" tex="\pol{{}_{\mathbf{\bm{\theta}}}}" text="pol@(_theta)" xml:id="S6.SS2.p3.m1">
            <XMath>
              <XMApp>
                <XMTok name="pol" role="OVERACCENT">→</XMTok>
                <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                  <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>, and those of <Math mode="inline" tex="\hat{U}_{\eta}" text="(hat@(U)) _ eta" xml:id="S6.SS2.p3.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">U</XMTok>
                </XMApp>
                <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
              </XMApp>
            </XMath>
          </Math> <cite class="ltx_citemacro_citep">(<bibref bibrefs="sutton00_NIPS" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.
In the case where the critic is represented with a linear architecture (see Section <ref labelref="LABEL:sec:regression"/>), using compatible features and estimating the advantage function as a critic results in performing natural gradient optimization on the actor <cite class="ltx_citemacro_citep">(<bibref bibrefs="peters2008reinforcement" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <ERROR class="undefined">{tcolorbox}</ERROR>
      <para xml:id="S6.SS2.p4">
        <p>[colback=red!10!white]<text font="bold">Message 16:</text>
Transforming derivative-based optimization on <Math mode="inline" tex="\hat{J}({\mathbf{\bm{\theta}}})" text="hat@(J) * theta" xml:id="S6.SS2.p4.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">J</XMTok>
                </XMApp>
                <XMDual>
                  <XMRef idref="S6.SS2.p4.m1.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S6.SS2.p4.m1.1">θ</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> into derivative-based optimization on <Math mode="inline" tex="\pol{{}_{\mathbf{\bm{\theta}}}}" text="pol@(_theta)" xml:id="S6.SS2.p4.m2">
            <XMath>
              <XMApp>
                <XMTok name="pol" role="OVERACCENT">→</XMTok>
                <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                  <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>, using the optimal baseline,
introducing an actor-critic approach and finally optimizing the natural gradient are four steps that all improve the sample
efficiency of episodic policy search, mainly by reducing the variance, potentially at the price of some bias.</p>
      </para>
      <ERROR class="undefined">{tcolorbox}</ERROR>
      <para xml:id="S6.SS2.p5">
        <p>[colback=red!10!white]<text font="bold">Message 17:</text>
Most deep RL approaches build on these concepts to benefit from a high sample efficiency.</p>
      </para>
    </subsection>
    <subsection inlist="toc" labels="LABEL:sec:learn_critic" xml:id="S6.SS3">
      <tags>
        <tag>6.3</tag>
        <tag role="autoref">subsection 6.3</tag>
        <tag role="refnum">6.3</tag>
        <tag role="typerefnum">§6.3</tag>
      </tags>
      <title><tag close=" ">6.3</tag>Learning a critic</title>
      <para xml:id="S6.SS3.p1">
        <p>There are two ways to learn a critic: using a bootstrap method or using batch regression methods.
Besides, the former can be combined with using regression towards a target critic.</p>
      </para>
      <subsubsection inlist="toc" labels="LABEL:mes:bootstrap_regress LABEL:sec:bootstrap" xml:id="S6.SS3.SSS1">
        <tags>
          <tag>6.3.1</tag>
          <tag role="autoref">subsubsection 6.3.1</tag>
          <tag role="refnum">6.3.1</tag>
          <tag role="typerefnum">§6.3.1</tag>
        </tags>
        <title><tag close=" ">6.3.1</tag>Bootstrap approximation of a critic</title>
        <para xml:id="S6.SS3.SSS1.p1">
          <p>The temporal difference way to estimate <Math mode="inline" tex="\hat{U}_{\eta}" text="(hat@(U)) _ eta" xml:id="S6.SS3.SSS1.p1.m1">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">U</XMTok>
                  </XMApp>
                  <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                </XMApp>
              </XMath>
            </Math> is known as a bootstrap method <cite class="ltx_citemacro_citep">(<bibref bibrefs="sutton88" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>. It can be explained as follows.
Consider we get a new step-sample <Math mode="inline" tex="s_{k}=&lt;\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k},j_{k},\mathbf{\bm{x}}_{k+1}&gt;" xml:id="S6.SS3.SSS1.p1.m2">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">s</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok meaning="less-than" role="RELOP">&lt;</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">u</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">j</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMApp>
                <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
              </XMath>
            </Math>.
In the discounted reward case, by using Bellman’s principle, one can show that, if the critic <Math mode="inline" tex="\hat{U}" text="hat@(U)" xml:id="S6.SS3.SSS1.p1.m3">
              <XMath>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">U</XMTok>
                </XMApp>
              </XMath>
            </Math> was accurate, we should have
<!--  %**** sample˙efficiency.tex Line 1400 **** --></p>
        </para>
        <para xml:id="S6.SS3.SSS1.p2">
          <equation xml:id="S6.Ex4">
            <Math mode="display" tex="\hat{U}(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k})=j(\mathbf{\bm{x}}_{k},\mathbf%&#10;{\bm{u}}_{k})+\gamma\textmd{max}_{\mathbf{\bm{u}}}~{}\hat{U}(\mathbf{\bm{x}}_{%&#10;k+1},\mathbf{\bm{u}})." text="hat@(U) * open-interval@(x _ k, u _ k) = j * open-interval@(x _ k, u _ k) + gamma * [max] _ u * hat@(U) * open-interval@(x _ (k + 1), u)" xml:id="S6.Ex4.m1">
              <XMath>
                <XMDual>
                  <XMRef idref="S6.Ex4.m1.2"/>
                  <XMWrap>
                    <XMApp xml:id="S6.Ex4.m1.2">
                      <XMTok meaning="equals" role="RELOP">=</XMTok>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMApp>
                          <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                          <XMTok font="italic" role="UNKNOWN">U</XMTok>
                        </XMApp>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="open-interval"/>
                            <XMRef idref="S6.Ex4.m1.2.1"/>
                            <XMRef idref="S6.Ex4.m1.2.2"/>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">(</XMTok>
                            <XMApp xml:id="S6.Ex4.m1.2.1">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="bold" role="UNKNOWN">x</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                            </XMApp>
                            <XMTok role="PUNCT">,</XMTok>
                            <XMApp xml:id="S6.Ex4.m1.2.2">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="bold" role="UNKNOWN">u</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                            </XMApp>
                            <XMTok role="CLOSE" stretchy="false">)</XMTok>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="plus" role="ADDOP">+</XMTok>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" role="UNKNOWN">j</XMTok>
                          <XMDual>
                            <XMApp>
                              <XMTok meaning="open-interval"/>
                              <XMRef idref="S6.Ex4.m1.2.3"/>
                              <XMRef idref="S6.Ex4.m1.2.4"/>
                            </XMApp>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false">(</XMTok>
                              <XMApp xml:id="S6.Ex4.m1.2.3">
                                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                <XMTok font="bold" role="UNKNOWN">x</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                              </XMApp>
                              <XMTok role="PUNCT">,</XMTok>
                              <XMApp xml:id="S6.Ex4.m1.2.4">
                                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                <XMTok font="bold" role="UNKNOWN">u</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                              </XMApp>
                              <XMTok role="CLOSE" stretchy="false">)</XMTok>
                            </XMWrap>
                          </XMDual>
                        </XMApp>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
                          <XMApp rpadding="3.3pt">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMText>max</XMText>
                            <XMTok font="bold" fontsize="70%" role="UNKNOWN">u</XMTok>
                          </XMApp>
                          <XMApp>
                            <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                            <XMTok font="italic" role="UNKNOWN">U</XMTok>
                          </XMApp>
                          <XMDual>
                            <XMApp>
                              <XMTok meaning="open-interval"/>
                              <XMRef idref="S6.Ex4.m1.2.5"/>
                              <XMRef idref="S6.Ex4.m1.1"/>
                            </XMApp>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false">(</XMTok>
                              <XMApp xml:id="S6.Ex4.m1.2.5">
                                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                <XMTok font="bold" role="UNKNOWN">x</XMTok>
                                <XMApp>
                                  <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                                </XMApp>
                              </XMApp>
                              <XMTok role="PUNCT">,</XMTok>
                              <XMTok font="bold" role="UNKNOWN" xml:id="S6.Ex4.m1.1">u</XMTok>
                              <XMTok role="CLOSE" stretchy="false">)</XMTok>
                            </XMWrap>
                          </XMDual>
                        </XMApp>
                      </XMApp>
                    </XMApp>
                    <XMTok role="PERIOD">.</XMTok>
                  </XMWrap>
                </XMDual>
              </XMath>
            </Math>
          </equation>
        </para>
        <para xml:id="S6.SS3.SSS1.p3">
          <p>If the equality does not hold, <Math mode="inline" tex="\hat{U}" text="hat@(U)" xml:id="S6.SS3.SSS1.p3.m1">
              <XMath>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">U</XMTok>
                </XMApp>
              </XMath>
            </Math> in inaccurate.
To correct it at <Math mode="inline" tex="\mathbf{\bm{x}}_{k}" text="x _ k" xml:id="S6.SS3.SSS1.p3.m2">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                </XMApp>
              </XMath>
            </Math>, one can use the <text font="italic">temporal difference error</text> or <text font="italic">reward prediction error</text> (RPE) <Math mode="inline" tex="\delta" text="delta" xml:id="S6.SS3.SSS1.p3.m3">
              <XMath>
                <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
              </XMath>
            </Math> defined as</p>
        </para>
        <para xml:id="S6.SS3.SSS1.p4">
          <equation labels="LABEL:eq:td_error_U" xml:id="S6.E12">
            <tags>
              <tag>(12)</tag>
              <tag role="autoref">Equation 12</tag>
              <tag role="refnum">12</tag>
            </tags>
            <Math mode="display" tex="\delta=j(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k})+\gamma\hat{U}_{\eta}(\mathbf%&#10;{\bm{x}}_{k+1},\mathbf{\bm{w}})-\hat{U}_{\eta}(\mathbf{\bm{x}}_{k},\mathbf{\bm%&#10;{u}}_{k})" text="delta = (j * open-interval@(x _ k, u _ k) + gamma * (hat@(U)) _ eta * open-interval@(x _ (k + 1), w)) - (hat@(U)) _ eta * open-interval@(x _ k, u _ k)" xml:id="S6.E12.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
                  <XMApp>
                    <XMTok meaning="minus" role="ADDOP">-</XMTok>
                    <XMApp>
                      <XMTok meaning="plus" role="ADDOP">+</XMTok>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="italic" role="UNKNOWN">j</XMTok>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="open-interval"/>
                            <XMRef idref="S6.E12.m1.2"/>
                            <XMRef idref="S6.E12.m1.3"/>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">(</XMTok>
                            <XMApp xml:id="S6.E12.m1.2">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="bold" role="UNKNOWN">x</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                            </XMApp>
                            <XMTok role="PUNCT">,</XMTok>
                            <XMApp xml:id="S6.E12.m1.3">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="bold" role="UNKNOWN">u</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                            </XMApp>
                            <XMTok role="CLOSE" stretchy="false">)</XMTok>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMApp>
                            <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                            <XMTok font="italic" role="UNKNOWN">U</XMTok>
                          </XMApp>
                          <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                        </XMApp>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="open-interval"/>
                            <XMRef idref="S6.E12.m1.4"/>
                            <XMRef idref="S6.E12.m1.1"/>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">(</XMTok>
                            <XMApp xml:id="S6.E12.m1.4">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="bold" role="UNKNOWN">x</XMTok>
                              <XMApp>
                                <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                                <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                              </XMApp>
                            </XMApp>
                            <XMTok role="PUNCT">,</XMTok>
                            <XMTok font="bold" role="UNKNOWN" xml:id="S6.E12.m1.1">w</XMTok>
                            <XMTok role="CLOSE" stretchy="false">)</XMTok>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                    </XMApp>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMApp>
                          <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                          <XMTok font="italic" role="UNKNOWN">U</XMTok>
                        </XMApp>
                        <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                      </XMApp>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S6.E12.m1.5"/>
                          <XMRef idref="S6.E12.m1.6"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMApp xml:id="S6.E12.m1.5">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="bold" role="UNKNOWN">x</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                          </XMApp>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMApp xml:id="S6.E12.m1.6">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="bold" role="UNKNOWN">u</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>
          </equation>
        </para>
        <para xml:id="S6.SS3.SSS1.p5">
          <p>with either <Math mode="inline" tex="\mathbf{\bm{w}}=\textmd{argmax}_{\mathbf{\bm{u}}}~{}\hat{U}_{\eta}(\mathbf{\bm%&#10;{x}}_{k+1},\mathbf{\bm{u}})" text="w = [argmax] _ u * (hat@(U)) _ eta * open-interval@(x _ (k + 1), u)" xml:id="S6.SS3.SSS1.p5.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMTok font="bold" role="UNKNOWN">w</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMApp rpadding="3.3pt">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMText>argmax</XMText>
                      <XMTok font="bold" fontsize="70%" role="UNKNOWN">u</XMTok>
                    </XMApp>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMApp>
                        <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                        <XMTok font="italic" role="UNKNOWN">U</XMTok>
                      </XMApp>
                      <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                    </XMApp>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="open-interval"/>
                        <XMRef idref="S6.SS3.SSS1.p5.m1.2"/>
                        <XMRef idref="S6.SS3.SSS1.p5.m1.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="S6.SS3.SSS1.p5.m1.2">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">x</XMTok>
                          <XMApp>
                            <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                            <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS3.SSS1.p5.m1.1">u</XMTok>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>, as in <text font="smallcaps">q-learning</text>, or <Math mode="inline" tex="\mathbf{\bm{w}}=\mathbf{\bm{u}}_{k+1}" text="w = u _ (k + 1)" xml:id="S6.SS3.SSS1.p5.m2">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMTok font="bold" role="UNKNOWN">w</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="bold" role="UNKNOWN">u</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>, as in <text font="smallcaps">Sarsa</text>.</p>
        </para>
        <para xml:id="S6.SS3.SSS1.p6">
          <p>If the temporal difference error <Math mode="inline" tex="\delta" text="delta" xml:id="S6.SS3.SSS1.p6.m1">
              <XMath>
                <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
              </XMath>
            </Math> is positive, <Math mode="inline" tex="j(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k})" text="j * open-interval@(x _ k, u _ k)" xml:id="S6.SS3.SSS1.p6.m2">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">j</XMTok>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S6.SS3.SSS1.p6.m2.1"/>
                      <XMRef idref="S6.SS3.SSS1.p6.m2.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S6.SS3.SSS1.p6.m2.1">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">x</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMApp xml:id="S6.SS3.SSS1.p6.m2.2">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">u</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math> was greater than expected and <Math mode="inline" tex="\hat{U}" text="hat@(U)" xml:id="S6.SS3.SSS1.p6.m3">
              <XMath>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">U</XMTok>
                </XMApp>
              </XMath>
            </Math> should be increased.
If it is negative, it was smaller and <Math mode="inline" tex="\hat{U}" text="hat@(U)" xml:id="S6.SS3.SSS1.p6.m4">
              <XMath>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">U</XMTok>
                </XMApp>
              </XMath>
            </Math> should be decreased.
Thus one can improve <Math mode="inline" tex="\hat{U}" text="hat@(U)" xml:id="S6.SS3.SSS1.p6.m5">
              <XMath>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">U</XMTok>
                </XMApp>
              </XMath>
            </Math> by applying</p>
        </para>
        <para xml:id="S6.SS3.SSS1.p7">
          <equation labels="LABEL:eq:td_update" xml:id="S6.E13">
            <tags>
              <tag>(13)</tag>
              <tag role="autoref">Equation 13</tag>
              <tag role="refnum">13</tag>
            </tags>
            <Math mode="display" tex="\hat{U}_{\eta}(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k})\leftarrow\hat{U}_{\eta%&#10;}(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k})+\alpha\delta" text="(hat@(U)) _ eta * open-interval@(x _ k, u _ k) leftarrow (hat@(U)) _ eta * open-interval@(x _ k, u _ k) + alpha * delta" xml:id="S6.E13.m1">
              <XMath>
                <XMApp>
                  <XMTok name="leftarrow" role="ARROW">←</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMApp>
                        <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                        <XMTok font="italic" role="UNKNOWN">U</XMTok>
                      </XMApp>
                      <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                    </XMApp>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="open-interval"/>
                        <XMRef idref="S6.E13.m1.1"/>
                        <XMRef idref="S6.E13.m1.2"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="S6.E13.m1.1">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">x</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                        </XMApp>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMApp xml:id="S6.E13.m1.2">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">u</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="plus" role="ADDOP">+</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMApp>
                          <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                          <XMTok font="italic" role="UNKNOWN">U</XMTok>
                        </XMApp>
                        <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                      </XMApp>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S6.E13.m1.3"/>
                          <XMRef idref="S6.E13.m1.4"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMApp xml:id="S6.E13.m1.3">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="bold" role="UNKNOWN">x</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                          </XMApp>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMApp xml:id="S6.E13.m1.4">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="bold" role="UNKNOWN">u</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" name="alpha" role="UNKNOWN">α</XMTok>
                      <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
                    </XMApp>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>
          </equation>
        </para>
        <para xml:id="S6.SS3.SSS1.p8">
          <p>where <Math mode="inline" tex="\alpha" text="alpha" xml:id="S6.SS3.SSS1.p8.m1">
              <XMath>
                <XMTok font="italic" name="alpha" role="UNKNOWN">α</XMTok>
              </XMath>
            </Math> is a learning rate.</p>
        </para>
        <para xml:id="S6.SS3.SSS1.p9">
          <p>Using a learning rate <Math mode="inline" tex="\alpha" text="alpha" xml:id="S6.SS3.SSS1.p9.m1">
              <XMath>
                <XMTok font="italic" name="alpha" role="UNKNOWN">α</XMTok>
              </XMath>
            </Math> in (<ref labelref="LABEL:eq:td_update"/>) implies that the same sample can be reused many times.
<!--  %**** sample˙efficiency.tex Line 1425 **** -->If (<ref labelref="LABEL:eq:td_update"/>) is used repeatedly with the same sample, the corresponding value of <Math mode="inline" tex="\delta" text="delta" xml:id="S6.SS3.SSS1.p9.m2">
              <XMath>
                <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
              </XMath>
            </Math> will converge more or less quickly to 0 depending on <Math mode="inline" tex="\alpha" text="alpha" xml:id="S6.SS3.SSS1.p9.m3">
              <XMath>
                <XMTok font="italic" name="alpha" role="UNKNOWN">α</XMTok>
              </XMath>
            </Math>.
But, more importantly, <Math mode="inline" tex="\hat{U}_{\eta}(\mathbf{\bm{x}}_{i},\mathbf{\bm{u}}_{i})" text="(hat@(U)) _ eta * open-interval@(x _ i, u _ i)" xml:id="S6.SS3.SSS1.p9.m4">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMApp>
                      <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                      <XMTok font="italic" role="UNKNOWN">U</XMTok>
                    </XMApp>
                    <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                  </XMApp>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S6.SS3.SSS1.p9.m4.1"/>
                      <XMRef idref="S6.SS3.SSS1.p9.m4.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S6.SS3.SSS1.p9.m4.1">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">x</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      </XMApp>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMApp xml:id="S6.SS3.SSS1.p9.m4.2">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">u</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math> should change depending on other <Math mode="inline" tex="\hat{U}_{\eta}(\mathbf{\bm{x}}_{j},\mathbf{\bm{u}}_{j})" text="(hat@(U)) _ eta * open-interval@(x _ j, u _ j)" xml:id="S6.SS3.SSS1.p9.m5">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMApp>
                      <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                      <XMTok font="italic" role="UNKNOWN">U</XMTok>
                    </XMApp>
                    <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                  </XMApp>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S6.SS3.SSS1.p9.m5.1"/>
                      <XMRef idref="S6.SS3.SSS1.p9.m5.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S6.SS3.SSS1.p9.m5.1">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">x</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                      </XMApp>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMApp xml:id="S6.SS3.SSS1.p9.m5.2">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">u</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math> it is connected to.
This phenomenon is known as <text font="italic">value propagation</text> and is at the heart of the capability of bootstrap methods to reuse the same samples several times.</p>
        </para>
        <ERROR class="undefined">{tcolorbox}</ERROR>
        <para xml:id="S6.SS3.SSS1.p10">
          <p>[colback=red!10!white]<text font="bold">Message 18:</text>
Bootstrap methods generally give rise to more sample reuse than standard regression methods.</p>
        </para>
      </subsubsection>
      <subsubsection inlist="toc" labels="LABEL:mes:replay LABEL:sec:replay" xml:id="S6.SS3.SSS2">
        <tags>
          <tag>6.3.2</tag>
          <tag role="autoref">subsubsection 6.3.2</tag>
          <tag role="refnum">6.3.2</tag>
          <tag role="typerefnum">§6.3.2</tag>
        </tags>
        <title><tag close=" ">6.3.2</tag>Using a shuffled replay buffer</title>
        <para xml:id="S6.SS3.SSS2.p1">
          <p>Given that samples can be reused many times in bootstrap methods, one can collect a large set of samples into a <text font="italic">replay buffer</text> and process them any number of times to learn a critic.
However, using them in the order in which they are collected is detrimental to learning performance.
Indeed, learning a model can be shown to perform better if the samples are independent and identically distributed (i.i.d.),
which is not the case of the successive samples obtained along an episode.
The correlation between successive samples is one of the sources of the instability of RL algorithms in continuous domains <cite class="ltx_citemacro_citep">(<bibref bibrefs="baird94" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.
The correlation can be removed by shuffling the samples into the replay buffer to draw them randomly.
This idea played a key role in the success of the <text font="smallcaps">dqn</text> algorithm <cite class="ltx_citemacro_citep">(<bibref bibrefs="mnih2015human" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite> and is now used in most deep RL algorithms.</p>
        </para>
        <para xml:id="S6.SS3.SSS2.p2">
          <p>Finally, with respect to drawing the samples randomly, the sample efficiency of bootstrap methods can be further improved using <text font="italic">prioritized experience replay</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="schaul2015prioritized" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.</p>
        </para>
        <ERROR class="undefined">{tcolorbox}</ERROR>
        <para xml:id="S6.SS3.SSS2.p3">
          <p>[colback=red!10!white]<text font="bold">Message 19:</text>
Using a replay buffer dramatically improves sample reuse.</p>
        </para>
<!--  %**** sample˙efficiency.tex Line 1450 **** -->      </subsubsection>
      <subsubsection inlist="toc" labels="LABEL:mes:shuffle LABEL:sec:target_network" xml:id="S6.SS3.SSS3">
        <tags>
          <tag>6.3.3</tag>
          <tag role="autoref">subsubsection 6.3.3</tag>
          <tag role="refnum">6.3.3</tag>
          <tag role="typerefnum">§6.3.3</tag>
        </tags>
        <title><tag close=" ">6.3.3</tag>Regression towards a target critic</title>
        <para xml:id="S6.SS3.SSS3.p1">
          <p>In addition to a replay buffer, deep RL methods introduced incremental regression towards a target critic which is periodically or smoothly updated.
To understand this approach and its relationship to standard bootstrap learning, one should reconsider incremental regression (Section <ref labelref="LABEL:sec:regression"/>).</p>
        </para>
        <para xml:id="S6.SS3.SSS3.p2">
          <p>In (<ref labelref="LABEL:eq:loss"/>), the goal is to minimize a positive loss <Math mode="inline" tex="\epsilon(\mathbf{\bm{\omega}}_{i})" text="epsilon * omega _ i" xml:id="S6.SS3.SSS3.p2.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                  <XMDual>
                    <XMRef idref="S6.SS3.SSS3.p2.m1.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S6.SS3.SSS3.p2.m1.1">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold italic" name="omega" role="UNKNOWN">ω</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math> defined as a function of the current model <Math mode="inline" tex="\hat{f}_{\mathbf{\bm{\omega}}_{i}}" text="(hat@(f)) _ omega _ i" xml:id="S6.SS3.SSS3.p2.m2">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">f</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                    <XMTok font="bold italic" fontsize="70%" name="omega" role="UNKNOWN">ω</XMTok>
                    <XMTok font="italic" fontsize="50%" role="UNKNOWN">i</XMTok>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math> at known points <Math mode="inline" tex="\mathbf{\bm{\phi}}_{j}" text="phi _ j" xml:id="S6.SS3.SSS3.p2.m3">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                </XMApp>
              </XMath>
            </Math>.
In (<ref labelref="LABEL:eq:td_error_U"/>), which defines bootstrap learning, the goal is to drive to 0 the reward prediction error <Math mode="inline" tex="\delta" text="delta" xml:id="S6.SS3.SSS3.p2.m4">
              <XMath>
                <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
              </XMath>
            </Math>.</p>
        </para>
        <para xml:id="S6.SS3.SSS3.p3">
          <p>Equation (<ref labelref="LABEL:eq:td_error_U"/>) can be made equivalent to (<ref labelref="LABEL:eq:loss"/>) by applying the following transformations:</p>
        </para>
        <para xml:id="S6.SS3.SSS3.p4">
          <equation xml:id="S6.Ex5">
            <Math mode="display" tex="\left\{\begin{array}[]{ccl}loss(a,b)&amp;=&amp;a-b\\&#10;\hat{f}_{\mathbf{\bm{\omega}}_{i}}(\mathbf{\bm{\phi}}_{j})&amp;=&amp;\hat{U}_{\eta}(%&#10;\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k})\\&#10;f(\mathbf{\bm{\phi}}_{j})&amp;=&amp;j(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k})+\gamma%&#10;\hat{U}_{\eta}(\mathbf{\bm{x}}_{k+1},\mathbf{\bm{u}}_{k})\\&#10;\epsilon(\mathbf{\bm{\omega}}_{i})&amp;=&amp;\delta\end{array}\right." text="cases@(Array[[l * o * s * s * open-interval@(a, b), =, a - b], [(hat@(f)) _ omega _ i * phi _ j, =, (hat@(U)) _ eta * open-interval@(x _ k, u _ k)], [f * phi _ j, =, j * open-interval@(x _ k, u _ k) + gamma * (hat@(U)) _ eta * open-interval@(x _ (k + 1), u _ k)], [epsilon * omega _ i, =, delta]])" xml:id="S6.Ex5.m1">
              <XMath>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="cases"/>
                    <XMRef idref="S6.Ex5.m1.12"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="true">{</XMTok>
                    <XMArray role="ARRAY" vattach="middle" xml:id="S6.Ex5.m1.12">
                      <XMRow>
                        <XMCell align="center">
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" role="UNKNOWN">l</XMTok>
                            <XMTok font="italic" role="UNKNOWN">o</XMTok>
                            <XMTok font="italic" role="UNKNOWN">s</XMTok>
                            <XMTok font="italic" role="UNKNOWN">s</XMTok>
                            <XMDual>
                              <XMApp>
                                <XMTok meaning="open-interval"/>
                                <XMRef idref="S6.Ex5.m1.1"/>
                                <XMRef idref="S6.Ex5.m1.2"/>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">(</XMTok>
                                <XMTok font="italic" role="UNKNOWN" xml:id="S6.Ex5.m1.1">a</XMTok>
                                <XMTok role="PUNCT">,</XMTok>
                                <XMTok font="italic" role="UNKNOWN" xml:id="S6.Ex5.m1.2">b</XMTok>
                                <XMTok role="CLOSE" stretchy="false">)</XMTok>
                              </XMWrap>
                            </XMDual>
                          </XMApp>
                        </XMCell>
                        <XMCell align="center">
                          <XMTok meaning="equals" role="RELOP">=</XMTok>
                        </XMCell>
                        <XMCell align="left">
                          <XMApp>
                            <XMTok meaning="minus" role="ADDOP">-</XMTok>
                            <XMTok font="italic" role="UNKNOWN">a</XMTok>
                            <XMTok font="italic" role="UNKNOWN">b</XMTok>
                          </XMApp>
                        </XMCell>
                      </XMRow>
                      <XMRow>
                        <XMCell align="center">
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                              <XMApp>
                                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                                <XMTok font="italic" role="UNKNOWN">f</XMTok>
                              </XMApp>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post8"/>
                                <XMTok font="bold italic" fontsize="70%" name="omega" role="UNKNOWN">ω</XMTok>
                                <XMTok font="italic" fontsize="50%" role="UNKNOWN">i</XMTok>
                              </XMApp>
                            </XMApp>
                            <XMDual>
                              <XMRef idref="S6.Ex5.m1.3"/>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">(</XMTok>
                                <XMApp xml:id="S6.Ex5.m1.3">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false">)</XMTok>
                              </XMWrap>
                            </XMDual>
                          </XMApp>
                        </XMCell>
                        <XMCell align="center">
                          <XMTok meaning="equals" role="RELOP">=</XMTok>
                        </XMCell>
                        <XMCell align="left">
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                              <XMApp>
                                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                                <XMTok font="italic" role="UNKNOWN">U</XMTok>
                              </XMApp>
                              <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                            </XMApp>
                            <XMDual>
                              <XMApp>
                                <XMTok meaning="open-interval"/>
                                <XMRef idref="S6.Ex5.m1.4"/>
                                <XMRef idref="S6.Ex5.m1.5"/>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">(</XMTok>
                                <XMApp xml:id="S6.Ex5.m1.4">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="bold" role="UNKNOWN">x</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                                </XMApp>
                                <XMTok role="PUNCT">,</XMTok>
                                <XMApp xml:id="S6.Ex5.m1.5">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="bold" role="UNKNOWN">u</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false">)</XMTok>
                              </XMWrap>
                            </XMDual>
                          </XMApp>
                        </XMCell>
                      </XMRow>
                      <XMRow>
                        <XMCell align="center">
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" role="UNKNOWN">f</XMTok>
                            <XMDual>
                              <XMRef idref="S6.Ex5.m1.6"/>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">(</XMTok>
                                <XMApp xml:id="S6.Ex5.m1.6">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false">)</XMTok>
                              </XMWrap>
                            </XMDual>
                          </XMApp>
                        </XMCell>
                        <XMCell align="center">
                          <XMTok meaning="equals" role="RELOP">=</XMTok>
                        </XMCell>
                        <XMCell align="left">
                          <XMApp>
                            <XMTok meaning="plus" role="ADDOP">+</XMTok>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMTok font="italic" role="UNKNOWN">j</XMTok>
                              <XMDual>
                                <XMApp>
                                  <XMTok meaning="open-interval"/>
                                  <XMRef idref="S6.Ex5.m1.7"/>
                                  <XMRef idref="S6.Ex5.m1.8"/>
                                </XMApp>
                                <XMWrap>
                                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                                  <XMApp xml:id="S6.Ex5.m1.7">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="bold" role="UNKNOWN">x</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                                  </XMApp>
                                  <XMTok role="PUNCT">,</XMTok>
                                  <XMApp xml:id="S6.Ex5.m1.8">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="bold" role="UNKNOWN">u</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                                  </XMApp>
                                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                </XMWrap>
                              </XMDual>
                            </XMApp>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMApp>
                                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">U</XMTok>
                                </XMApp>
                                <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                              </XMApp>
                              <XMDual>
                                <XMApp>
                                  <XMTok meaning="open-interval"/>
                                  <XMRef idref="S6.Ex5.m1.9"/>
                                  <XMRef idref="S6.Ex5.m1.10"/>
                                </XMApp>
                                <XMWrap>
                                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                                  <XMApp xml:id="S6.Ex5.m1.9">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="bold" role="UNKNOWN">x</XMTok>
                                    <XMApp>
                                      <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                                    </XMApp>
                                  </XMApp>
                                  <XMTok role="PUNCT">,</XMTok>
                                  <XMApp xml:id="S6.Ex5.m1.10">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="bold" role="UNKNOWN">u</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                                  </XMApp>
                                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                </XMWrap>
                              </XMDual>
                            </XMApp>
                          </XMApp>
                        </XMCell>
                      </XMRow>
                      <XMRow>
                        <XMCell align="center">
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                            <XMDual>
                              <XMRef idref="S6.Ex5.m1.11"/>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">(</XMTok>
                                <XMApp xml:id="S6.Ex5.m1.11">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="bold italic" name="omega" role="UNKNOWN">ω</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false">)</XMTok>
                              </XMWrap>
                            </XMDual>
                          </XMApp>
                        </XMCell>
                        <XMCell align="center">
                          <XMTok meaning="equals" role="RELOP">=</XMTok>
                        </XMCell>
                        <XMCell align="left">
                          <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
                        </XMCell>
                      </XMRow>
                    </XMArray>
                    <XMTok meaning="absent"/>
                  </XMWrap>
                </XMDual>
              </XMath>
            </Math>
          </equation>
        </para>
        <para xml:id="S6.SS3.SSS3.p5">
          <p>and considering either a single sample or a collection of samples on both sides.</p>
        </para>
        <para xml:id="S6.SS3.SSS3.p6">
          <p>Under this perspective, (<ref labelref="LABEL:eq:td_error_U"/>) can be seen as a way to perform regression over <Math mode="inline" tex="\hat{U}_{\eta}(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k})" text="(hat@(U)) _ eta * open-interval@(x _ k, u _ k)" xml:id="S6.SS3.SSS3.p6.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMApp>
                      <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                      <XMTok font="italic" role="UNKNOWN">U</XMTok>
                    </XMApp>
                    <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                  </XMApp>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S6.SS3.SSS3.p6.m1.1"/>
                      <XMRef idref="S6.SS3.SSS3.p6.m1.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S6.SS3.SSS3.p6.m1.1">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">x</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMApp xml:id="S6.SS3.SSS3.p6.m1.2">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="bold" role="UNKNOWN">u</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math> toward a target value <Math mode="inline" tex="j(\mathbf{\bm{x}}_{k},\mathbf{\bm{u}}_{k})+\gamma\hat{U}_{\eta}(\mathbf{\bm{x}%&#10;}_{k+1},\mathbf{\bm{u}}_{k})" text="j * open-interval@(x _ k, u _ k) + gamma * (hat@(U)) _ eta * open-interval@(x _ (k + 1), u _ k)" xml:id="S6.SS3.SSS3.p6.m2">
              <XMath>
                <XMApp>
                  <XMTok meaning="plus" role="ADDOP">+</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">j</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="open-interval"/>
                        <XMRef idref="S6.SS3.SSS3.p6.m2.1"/>
                        <XMRef idref="S6.SS3.SSS3.p6.m2.2"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="S6.SS3.SSS3.p6.m2.1">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">x</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                        </XMApp>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMApp xml:id="S6.SS3.SSS3.p6.m2.2">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">u</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMApp>
                        <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                        <XMTok font="italic" role="UNKNOWN">U</XMTok>
                      </XMApp>
                      <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                    </XMApp>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="open-interval"/>
                        <XMRef idref="S6.SS3.SSS3.p6.m2.3"/>
                        <XMRef idref="S6.SS3.SSS3.p6.m2.4"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="S6.SS3.SSS3.p6.m2.3">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">x</XMTok>
                          <XMApp>
                            <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                            <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMApp xml:id="S6.SS3.SSS3.p6.m2.4">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="bold" role="UNKNOWN">u</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>. However, this target value is itself a function of <Math mode="inline" tex="\hat{U}_{\eta}" text="(hat@(U)) _ eta" xml:id="S6.SS3.SSS3.p6.m3">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">U</XMTok>
                  </XMApp>
                  <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                </XMApp>
              </XMath>
            </Math>, thus it is modified each time <Math mode="inline" tex="\hat{U}_{\eta}" text="(hat@(U)) _ eta" xml:id="S6.SS3.SSS3.p6.m4">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">U</XMTok>
                  </XMApp>
                  <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                </XMApp>
              </XMath>
            </Math> is modified, whereas
<!--  %**** sample˙efficiency.tex Line 1475 **** -->in standard regression, the target function is a constant function to be approximated.
After dependencies between samples, this phenomenon is the other main source of instability of RL algorithms in continuous domains, resulting in potential divergence of <Math mode="inline" tex="\hat{U}_{\eta}" text="(hat@(U)) _ eta" xml:id="S6.SS3.SSS3.p6.m5">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">U</XMTok>
                  </XMApp>
                  <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                </XMApp>
              </XMath>
            </Math> <cite class="ltx_citemacro_citep">(<bibref bibrefs="baird94" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.</p>
        </para>
        <para xml:id="S6.SS3.SSS3.p7">
          <p>To mitigate this instability, one can replace the term <Math mode="inline" tex="\hat{U}_{\eta}" text="(hat@(U)) _ eta" xml:id="S6.SS3.SSS3.p7.m1">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">U</XMTok>
                  </XMApp>
                  <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                </XMApp>
              </XMath>
            </Math> in the target function by another function <Math mode="inline" tex="\hat{U}_{\eta}^{\prime}" text="((hat@(U)) _ eta) ^ prime" xml:id="S6.SS3.SSS3.p7.m2">
              <XMath>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMApp>
                      <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                      <XMTok font="italic" role="UNKNOWN">U</XMTok>
                    </XMApp>
                    <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                  </XMApp>
                  <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                </XMApp>
              </XMath>
            </Math>.
If <Math mode="inline" tex="\hat{U}_{\eta}^{\prime}" text="((hat@(U)) _ eta) ^ prime" xml:id="S6.SS3.SSS3.p7.m3">
              <XMath>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMApp>
                      <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                      <XMTok font="italic" role="UNKNOWN">U</XMTok>
                    </XMApp>
                    <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                  </XMApp>
                  <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                </XMApp>
              </XMath>
            </Math> is held constant, then the bootstrap learning problem is turned into a standard regression problem.
But since in theory the target function should be a function of <Math mode="inline" tex="\hat{U}_{\eta}" text="(hat@(U)) _ eta" xml:id="S6.SS3.SSS3.p7.m4">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">U</XMTok>
                  </XMApp>
                  <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                </XMApp>
              </XMath>
            </Math>, one should rather periodically reset <Math mode="inline" tex="\hat{U}_{\eta}^{\prime}" text="((hat@(U)) _ eta) ^ prime" xml:id="S6.SS3.SSS3.p7.m5">
              <XMath>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMApp>
                      <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                      <XMTok font="italic" role="UNKNOWN">U</XMTok>
                    </XMApp>
                    <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                  </XMApp>
                  <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                </XMApp>
              </XMath>
            </Math> to the current <Math mode="inline" tex="\hat{U}_{\eta}" text="(hat@(U)) _ eta" xml:id="S6.SS3.SSS3.p7.m6">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">U</XMTok>
                  </XMApp>
                  <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                </XMApp>
              </XMath>
            </Math>,
switching from a regression problem to another. This idea was first introduced in <text font="smallcaps">dqn</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="mnih2015human" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite> and then modified so that <Math mode="inline" tex="\hat{U}_{\eta}^{\prime}" text="((hat@(U)) _ eta) ^ prime" xml:id="S6.SS3.SSS3.p7.m7">
              <XMath>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMApp>
                      <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                      <XMTok font="italic" role="UNKNOWN">U</XMTok>
                    </XMApp>
                    <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                  </XMApp>
                  <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                </XMApp>
              </XMath>
            </Math> tracks <Math mode="inline" tex="\hat{U}_{\eta}" text="(hat@(U)) _ eta" xml:id="S6.SS3.SSS3.p7.m8">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">U</XMTok>
                  </XMApp>
                  <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                </XMApp>
              </XMath>
            </Math> with smoother variations in <text font="smallcaps">ddpg</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="lillicrap2015continuous" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.
Both mechanisms, in addition to shuffling the samples, improve the stability of learning the critic.
Furthermore, the opportunity for sample reuse arising from bootstrap methods is transfered to solving successive regression problems with changing target networks.</p>
        </para>
        <ERROR class="undefined">{tcolorbox}</ERROR>
        <para xml:id="S6.SS3.SSS3.p8">
          <p>[colback=red!10!white]<text font="bold">Message 20:</text>
Replay buffer shuffling and using a target critic improve the stability of incremental improvement.</p>
        </para>
      </subsubsection>
      <subsubsection inlist="toc" labels="LABEL:sec:mc_critic" xml:id="S6.SS3.SSS4">
        <tags>
          <tag>6.3.4</tag>
          <tag role="autoref">subsubsection 6.3.4</tag>
          <tag role="refnum">6.3.4</tag>
          <tag role="typerefnum">§6.3.4</tag>
        </tags>
        <title><tag close=" ">6.3.4</tag>Batch learning of a critic</title>
        <para xml:id="S6.SS3.SSS4.p1">
          <p>Another way to learn a critic is through batch methods.</p>
        </para>
        <para xml:id="S6.SS3.SSS4.p2">
          <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="italic">clarify below: they use regression <text font="normal">and</text> MC: MC to get sample values, then regression to generalize over the model. Why is there a step-size in e<text font="smallcaps">nac</text>, and not in <text font="smallcaps">PoWER</text>?</text>
When the critic is represented as a linear architecture,
finding the optimal critic parameters given a batch of samples can be cast as a standard regression problem, as used in <text font="smallcaps">nac</text> and e<text font="smallcaps">nac</text>.
The EM-based methods such as <text font="smallcaps">PoWER</text> and <text font="smallcaps">pi<Math mode="inline" tex="{}^{2}" text="^2" xml:id="S6.SS3.SSS4.p2.m1">
                <XMath>
                  <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                    <XMTok font="upright" fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                  </XMApp>
                </XMath>
              </Math></text> rather rely on Monte Carlo sampling.</p>
        </para>
        <para xml:id="S6.SS3.SSS4.p3">
          <p>The details of the corresponding algorithms are well described in <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.
<!--  %**** sample˙efficiency.tex Line 1500 **** --></p>
        </para>
        <para xml:id="S6.SS3.SSS4.p4">
          <p>Batch methods are less sample efficient than bootstrap method as the former have to recompute the utility-to-go of each state from scratch at each iteration whereas the latter store these intermediate values into a memory and update them incrementally.
Furthermore, they give rise to no sample reuse (See Message <ref labelref="LABEL:mes:incrreg_reuse"/>). However, they are used in most iterative episodic policy search methods listed in Section <ref labelref="LABEL:sec:iter_ac"/>.</p>
        </para>
      </subsubsection>
    </subsection>
    <subsection inlist="toc" labels="LABEL:sec:explo_sb" xml:id="S6.SS4">
      <tags>
        <tag>6.4</tag>
        <tag role="autoref">subsection 6.4</tag>
        <tag role="refnum">6.4</tag>
        <tag role="typerefnum">§6.4</tag>
      </tags>
      <title><tag close=" ">6.4</tag>Exploration policies in step-sample-based methods</title>
      <para xml:id="S6.SS4.p1">
        <p>When using step-samples, exploration can be performed in the policy parameter space, as in <text font="smallcaps">pepg</text>, <text font="smallcaps">PoWER</text> and <text font="smallcaps">pi<Math mode="inline" tex="{}^{2}" text="^2" xml:id="S6.SS4.p1.m1">
              <XMath>
                <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                  <XMTok font="upright" fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                </XMApp>
              </XMath>
            </Math></text>, or in the state-action space, as in most other algorithms.</p>
      </para>
      <para xml:id="S6.SS4.p2">
        <p>Besides, most exploration policies are specified as stochastic Gaussian exploration through a covariance matrix.
When exploration is performed in the state-action space, the connection with policy parameter optimization is weaker.
As a result, the covariance matrix can be kept fixed, as in <text font="smallcaps">nac</text> and e<text font="smallcaps">nac</text>, or updated.
When it is updated, a principled way to tune the exploration rate must be found.</p>
      </para>
      <para xml:id="S6.SS4.p3">
        <p>Letting it decrease too fast may result in premature convergence.
A well-founded alternative consists in applying large policy update, while constraining it by an upper-bound on the Kullback-Leibler divergence between the previous trajectory distribution and the updated one. Performing large steps prevents premature convergence.
It is at the heart of the Relative Entropy Policy Search (<text font="smallcaps">reps</text>) algorithm <cite class="ltx_citemacro_citep">(<bibref bibrefs="peters2010relative" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.
When used in combination with a policy gradient method, it has been shown to ensure natural gradient updates <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey,peters2008reinforcement" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S6.SS4.p4">
        <p>The same method is also at the heart of the Trust Region Policy Optimization (<text font="smallcaps">trpo</text>) algorithm <cite class="ltx_citemacro_citep">(<bibref bibrefs="schulman2015trust" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.
Indeed, the upper-bound on the Kullback-Leibler divergence prevents the new policy from moving too far away from the current policy, hence staying in the “trust region”.
This is safer, particularly in the context of robotics where large jumps in the policy parameter space might be dangerous.
See <cite class="ltx_citemacro_citep">(<bibref bibrefs="schulman2015trust" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> for a mathematical derivation and for a discussion of the relationship to <text font="smallcaps">reps</text>.</p>
      </para>
    </subsection>
    <subsection inlist="toc" labels="LABEL:sec:ac" xml:id="S6.SS5">
      <tags>
        <tag>6.5</tag>
        <tag role="autoref">subsection 6.5</tag>
        <tag role="refnum">6.5</tag>
        <tag role="typerefnum">§6.5</tag>
      </tags>
      <title><tag close=" ">6.5</tag>Policy search methods using a critic</title>
<!--  %**** sample˙efficiency.tex Line 1525 **** -->      <para xml:id="S6.SS5.p1">
        <p>In Section <ref labelref="LABEL:sec:theta2x"/>, we have shown that one can turn derivative-based optimization over <Math mode="inline" tex="J({\mathbf{\bm{\theta}}})" text="J * theta" xml:id="S6.SS5.p1.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">J</XMTok>
                <XMDual>
                  <XMRef idref="S6.SS5.p1.m1.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S6.SS5.p1.m1.1">θ</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> into learning a critic <Math mode="inline" tex="\hat{U}_{\eta}" text="(hat@(U)) _ eta" xml:id="S6.SS5.p1.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">U</XMTok>
                </XMApp>
                <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
              </XMApp>
            </XMath>
          </Math> in the <Math mode="inline" tex="(\mathbf{\bm{x}},\mathbf{\bm{u}})" text="open-interval@(x, u)" xml:id="S6.SS5.p1.m3">
            <XMath>
              <XMDual>
                <XMApp>
                  <XMTok meaning="open-interval"/>
                  <XMRef idref="S6.SS5.p1.m3.1"/>
                  <XMRef idref="S6.SS5.p1.m3.2"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS5.p1.m3.1">x</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS5.p1.m3.2">u</XMTok>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math> space and using it to perform policy gradient ascent.
In Section <ref labelref="LABEL:sec:learn_critic"/>, we have listed various ways to learn such a critic.
Then in Section <ref labelref="LABEL:sec:explo_sb"/>, we have studied various exploration policies in various spaces.
We are now ready to explain how one can implement policy search methods by performing derivative-based optimization over <Math mode="inline" tex="\pol{{}_{\mathbf{\bm{\theta}}}}" text="pol@(_theta)" xml:id="S6.SS5.p1.m4">
            <XMath>
              <XMApp>
                <XMTok name="pol" role="OVERACCENT">→</XMTok>
                <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                  <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> based on the gradient of <Math mode="inline" tex="\hat{U}_{\eta}" text="(hat@(U)) _ eta" xml:id="S6.SS5.p1.m5">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">U</XMTok>
                </XMApp>
                <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
              </XMApp>
            </XMath>
          </Math> using (<ref labelref="LABEL:eq:pgtw"/>) or (<ref labelref="LABEL:eq:dpg"/>).</p>
      </para>
      <para xml:id="S6.SS5.p2">
        <p>All the corresponding approaches are particular instantiations of episodic policy search with a surrogate model, with <Math mode="inline" tex="\mathbf{\bm{\phi}}=(\mathbf{\bm{x}},\mathbf{\bm{u}})" text="phi = open-interval@(x, u)" xml:id="S6.SS5.p2.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="bold italic" name="phi" role="UNKNOWN">ϕ</XMTok>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S6.SS5.p2.m1.1"/>
                    <XMRef idref="S6.SS5.p2.m1.2"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS5.p2.m1.1">x</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS5.p2.m1.2">u</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> and <Math mode="inline" tex="f=\hat{U}_{\eta}(\mathbf{\bm{x}},\mathbf{\bm{u}})" text="f = (hat@(U)) _ eta * open-interval@(x, u)" xml:id="S6.SS5.p2.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" role="UNKNOWN">f</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMApp>
                      <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                      <XMTok font="italic" role="UNKNOWN">U</XMTok>
                    </XMApp>
                    <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                  </XMApp>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S6.SS5.p2.m2.1"/>
                      <XMRef idref="S6.SS5.p2.m2.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS5.p2.m2.1">x</XMTok>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS5.p2.m2.2">u</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>.
As in general BBO methods (Section <ref labelref="LABEL:sec:bbo"/>), we distinguish the iterative and the incremental instantiations.</p>
      </para>
      <subsubsection inlist="toc" labels="LABEL:sec:iter_ac" xml:id="S6.SS5.SSS1">
        <tags>
          <tag>6.5.1</tag>
          <tag role="autoref">subsubsection 6.5.1</tag>
          <tag role="refnum">6.5.1</tag>
          <tag role="typerefnum">§6.5.1</tag>
        </tags>
        <title><tag close=" ">6.5.1</tag>Iterative approach</title>
        <para xml:id="S6.SS5.SSS1.p1">
          <p>In Section <ref labelref="LABEL:sec:bbo"/> we mentioned the possibility of a sequential approach where a surrogate model of the utility function is learned first, then derivative-based optimization is performed on this model. Actually, a key difference between using episodic-samples and step-samples appears in that case.
Learning a model <Math mode="inline" tex="\hat{J}({\mathbf{\bm{\theta}}})" text="hat@(J) * theta" xml:id="S6.SS5.SSS1.p1.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">J</XMTok>
                  </XMApp>
                  <XMDual>
                    <XMRef idref="S6.SS5.SSS1.p1.m1.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S6.SS5.SSS1.p1.m1.1">θ</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math> over <Math mode="inline" tex="{\mathbf{\bm{\theta}}}" text="theta" xml:id="S6.SS5.SSS1.p1.m2">
              <XMath>
                <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
              </XMath>
            </Math> is a regression problem that can be performed easily by just sampling directly the <Math mode="inline" tex="\Theta" text="Theta" xml:id="S6.SS5.SSS1.p1.m3">
              <XMath>
                <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
              </XMath>
            </Math> space.
By contrast, sampling the <Math mode="inline" tex="\mathcal{X}\times\mathcal{U}" text="X * U" xml:id="S6.SS5.SSS1.p1.m4">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">×</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">X</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">U</XMTok>
                </XMApp>
              </XMath>
            </Math> space is indirect, it requires to use some adequate policy.
As a consequence, a sequential approach to episodic policy search with a surrogate model in <Math mode="inline" tex="(\mathbf{\bm{x}},\mathbf{\bm{u}})" text="open-interval@(x, u)" xml:id="S6.SS5.SSS1.p1.m5">
              <XMath>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S6.SS5.SSS1.p1.m5.1"/>
                    <XMRef idref="S6.SS5.SSS1.p1.m5.2"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS5.SSS1.p1.m5.1">x</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS5.SSS1.p1.m5.2">u</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMath>
            </Math> would not work.
In practice, many policy search algorithms alternate collecting new step-samples from the current policy to get a new critic <Math mode="inline" tex="\hat{U}_{\eta_{i+1}}(\mathbf{\bm{x}},\mathbf{\bm{u}})" text="(hat@(U)) _ eta _ (i + 1) * open-interval@(x, u)" xml:id="S6.SS5.SSS1.p1.m6">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMApp>
                      <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                      <XMTok font="italic" role="UNKNOWN">U</XMTok>
                    </XMApp>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                      <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                      <XMApp>
                        <XMTok fontsize="50%" meaning="plus" role="ADDOP">+</XMTok>
                        <XMTok font="italic" fontsize="50%" role="UNKNOWN">i</XMTok>
                        <XMTok fontsize="50%" meaning="1" role="NUMBER">1</XMTok>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S6.SS5.SSS1.p1.m6.1"/>
                      <XMRef idref="S6.SS5.SSS1.p1.m6.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS5.SSS1.p1.m6.1">x</XMTok>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS5.SSS1.p1.m6.2">u</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math> and use this new critic to improve the current policy.</p>
        </para>
        <para xml:id="S6.SS5.SSS1.p2">
          <p>Methods following the iterative approach can be characterized as <text font="italic">policy iteration</text> methods: they learn a new critic at each iteration <cite class="ltx_citemacro_citep">(<bibref bibrefs="schulman2015trust" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>. As outlined in Figure 2 of <cite class="ltx_citemacro_citep">(<bibref bibrefs="stulp13paladyn" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>, algorithms from this family are similar to <text font="smallcaps">EDA</text>s, apart from the fact that they model the utility function in the state-action space instead of the policy parameter space.</p>
        </para>
        <para xml:id="S6.SS5.SSS1.p3">
          <p>Among these methods, one must distinguish three families: <text font="italic">likelihood ratio</text> methods like <text font="smallcaps">reinforce</text> and <text font="smallcaps">pepg</text>, <text font="italic">actor-critic</text> methods like <text font="smallcaps">nac</text> and e<text font="smallcaps">nac</text> and <text font="italic">EM-based</text> methods like <text font="smallcaps">PoWER</text> and the variants of <text font="smallcaps">reps</text>.</p>
        </para>
        <para xml:id="S6.SS5.SSS1.p4">
          <p>Though they derive from a different mathematical framework, likelihood ratio methods and EM-based methods are similar: they both use unbiased estimation of the gradient through Monte Carlo sampling and they are both mathematically designed so that the most rewarding trajectories get the highest probability.
<!--  %**** sample˙efficiency.tex Line 1550 **** 
     %By contrast, actor-critic methods do not rely on Monte Carlo sampling, they rather store a critic at the price of some bias in the gradient estimation.-->Besides, <text font="smallcaps">pepg</text> is a likelihood ratio method that uses policy parameter perturbation as exploration policy, while <text font="smallcaps">PoWER</text> is an EM-based method that does the same, so both methods are strongly related.</p>
        </para>
<!--  %the latter family performing better because they do not use a step size (see “cite–deisenroth2013survey˝ for a detailed explanation). -->        <para xml:id="S6.SS5.SSS1.p5">
          <p>In likelihood ratio methods, the expectation over <Math mode="inline" tex="p_{\mathbf{\bm{\theta}}}(\tau)" text="p _ theta * tau" xml:id="S6.SS5.SSS1.p5.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">p</XMTok>
                    <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                  </XMApp>
                  <XMDual>
                    <XMRef idref="S6.SS5.SSS1.p5.m1.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="italic" name="tau" role="UNKNOWN" xml:id="S6.SS5.SSS1.p5.m1.1">τ</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math> is approximated without bias as a sum over trajectories generated using <Math mode="inline" tex="\pol{{}_{\mathbf{\bm{\theta}}}}" text="pol@(_theta)" xml:id="S6.SS5.SSS1.p5.m2">
              <XMath>
                <XMApp>
                  <XMTok name="pol" role="OVERACCENT">→</XMTok>
                  <XMApp role="FLOATSUBSCRIPT" scriptpos="2">
                    <XMTok font="bold italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>.
In <text font="smallcaps">nac</text> and e<text font="smallcaps">nac</text>, due to the <text font="italic">compatibility condition</text>, the features of the critic depend on the gradient of the current policy with respect to policy parameters,
thus each time the policy is updated, new features must be computed for the critic. As a consequence, the critic must be learned again at each iteration with a batch method.
In <text font="smallcaps">PoWER</text> and the variants of <text font="smallcaps">reps</text>, instead of storing a critic between iterations, a Monte Carlo estimation of the utility is used, which is dependent on the current policy.
Interestingly, <text font="smallcaps">pi<Math mode="inline" tex="{}^{2}" text="^2" xml:id="S6.SS5.SSS1.p5.m3">
                <XMath>
                  <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                    <XMTok font="upright" fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                  </XMApp>
                </XMath>
              </Math></text> is also an iterative method, though it could in principle follow the incremental approach described in Section <ref labelref="LABEL:sec:onl"/>.
According to <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>, this is just because batch updates make it more stable.</p>
        </para>
        <para xml:id="S6.SS5.SSS1.p6">
          <p>The <text font="smallcaps">trpo</text> algorithm also follows an iterative approach and can use a deep neural network representation, thus it can be classified as a deep RL method. Its key component is the upper-bounded exploration policy outlined in Section <ref labelref="LABEL:sec:explo_sb"/>.
With respect to <text font="smallcaps">reps</text>, it also uses a conjugate gradient mechanism to improve data efficiency of policy optimization, and it seems to perform better than previous algorithms of the same family <cite class="ltx_citemacro_citep">(<bibref bibrefs="duan2016benchmarking" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.</p>
        </para>
        <para xml:id="S6.SS5.SSS1.p7">
          <p>Finally, the Guided Policy Search (<text font="smallcaps">gps</text>) algorithm <cite class="ltx_citemacro_citep">(<bibref bibrefs="levine2013guided,montgomery2016guided" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite> is another iterative deep RL method inspired from <text font="smallcaps">reinforce</text>, but adding guiding trajectories and able to learn policies represented by large deep neural networks.
It first learns a set of local open-loop policies using i<text font="smallcaps">lqg</text>, a variant of the model-based Differential Dynamic Programming algorithm <cite class="ltx_citemacro_citep">(<bibref bibrefs="tassa2012synthesis" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>, then it uses importance sampling based on samples generated by these policies to learn a more global closed-loop policy. Importance sampling is a well-known mechanism to reduce the variance of sample based estimation by attributing weight to the samples depending on their effect on the next estimate <cite class="ltx_citemacro_citep">(<bibref bibrefs="glynn1989importance" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>. For a discussion on the link between importance sampling methods and likelihood ratio methods such as <text font="smallcaps">reinforce</text>, see <cite class="ltx_citemacro_citep">(<bibref bibrefs="jie2010connection" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.</p>
        </para>
        <para xml:id="S6.SS5.SSS1.p8">
          <p>A key aspect of all the above methods is that they can use off-line processing of a batch of data collected in the previous iteration, but they require a significant amount of such data. All these methods can be characterized as <text font="italic">on-policy</text>, which is detrimental to their sample efficiency, but they do not suffer from bias.
By contrast, the incremental approaches covered in the next section update a critic over iterations, providing further data efficiency and further opportunities for sample reuse at the price of some bias.</p>
        </para>
      </subsubsection>
      <subsubsection inlist="toc" labels="LABEL:mes:drl_summary LABEL:mes:sim LABEL:sec:onl" xml:id="S6.SS5.SSS2">
        <tags>
          <tag>6.5.2</tag>
          <tag role="autoref">subsubsection 6.5.2</tag>
          <tag role="refnum">6.5.2</tag>
          <tag role="typerefnum">§6.5.2</tag>
        </tags>
        <title><tag close=" ">6.5.2</tag>Incremental approach</title>
        <para xml:id="S6.SS5.SSS2.p1">
          <p>The incremental approach corresponds to on-line learning, where the current step-sample is used at each step to improve the model of utility and the policy parameters.
<!--  %**** sample˙efficiency.tex Line 1575 **** -->In contrast with the iterative approach, it updates a version of the critic at each time step instead of throwing it away and computing it anew each time a new policy is generated. This approach favors data efficiency because the policy is improved as soon as possible, which in turn helps generating better samples.
The four i<text font="smallcaps">nac</text> algorithms proposed in <cite class="ltx_citemacro_citep">(<bibref bibrefs="bhatnagar07_NIPS" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite> belong to this family.</p>
        </para>
        <para xml:id="S6.SS5.SSS2.p2">
          <p>However, on-line learning does not provide opportunities for sample reuse as long as the samples are used immediately rather than stored.
The full sample efficiency of incremental approaches can be obtained by drawing from a replay buffer the samples used to improve the policy (see Section <ref labelref="LABEL:sec:bootstrap"/>).</p>
        </para>
        <para xml:id="S6.SS5.SSS2.p3">
          <p>This latter approach is the common structure of several deep RL algorithms: <text font="smallcaps">ddpg</text>, <text font="smallcaps">naf</text>, <text font="smallcaps">acer</text>, <text font="smallcaps">Q-prop</text> and <text font="smallcaps">pgql</text>.
Describing these algorithms in detail would require a paper in itself.
Here, we just give a brief overview of these algorithms and refer the reader to the corresponding papers for detail, and to Table <ref labelref="LABEL:tab:classif2"/> for a summary of the differences.</p>
        </para>
        <para xml:id="S6.SS5.SSS2.p4">
          <p>The <text font="smallcaps">ddpg</text> algorithm <cite class="ltx_citemacro_citep">(<bibref bibrefs="lillicrap2015continuous" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite> is based on the deterministic policy gradient theorem <cite class="ltx_citemacro_citep">(<bibref bibrefs="silver2014deterministic" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite> calling upon (<ref labelref="LABEL:eq:dpg"/>).
The algorithm directly approximates a model <Math mode="inline" tex="\hat{Q}(\mathbf{\bm{x}},\mathbf{\bm{u}})" text="hat@(Q) * open-interval@(x, u)" xml:id="S6.SS5.SSS2.p4.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                  </XMApp>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S6.SS5.SSS2.p4.m1.1"/>
                      <XMRef idref="S6.SS5.SSS2.p4.m1.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS5.SSS2.p4.m1.1">x</XMTok>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS5.SSS2.p4.m1.2">u</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math> as a deep neural network using a bootstrap method,
and uses backpropagation of the gradient as an iterative method to perform the resulting derivative-based optimization over the weights of the network.
In addition to the replay buffer shuffling and target network tricks described in Section <ref labelref="LABEL:sec:target_network"/>, it also uses <text font="italic">batch-normalization</text> to stabilize gradient backpropagation in the networks <cite class="ltx_citemacro_citep">(<bibref bibrefs="ioffe2015batch" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.
Finally, no care is taken in <text font="smallcaps">ddpg</text> about the compatibility condition and the algorithm performs vanilla gradient descent.</p>
        </para>
        <para xml:id="S6.SS5.SSS2.p5">
          <p>The <text font="smallcaps">a3c</text> algorithm is another actor-critic algorithm which brings several improvements over <text font="smallcaps">ddpg</text>, such as
natural gradient optimization by estimating the advantage function as a critic,
the propagation of value over <Math mode="inline" tex="n" text="n" xml:id="S6.SS5.SSS2.p5.m1">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">n</XMTok>
              </XMath>
            </Math> steps to limite the growth of the variance (called “<Math mode="inline" tex="n" text="n" xml:id="S6.SS5.SSS2.p5.m2">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">n</XMTok>
              </XMath>
            </Math>-step return” see <cite class="ltx_citemacro_citep">(<bibref bibrefs="schulman2015high" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>),
and the replacement of the replay buffer by the use of several parallel agents.
Since it does not use a replay buffer, <text font="smallcaps">a3c</text> is on-policy, in contrast with most other deep RL algorithms.</p>
        </para>
        <para xml:id="S6.SS5.SSS2.p6">
          <p>By contrast, the <text font="smallcaps">naf</text> algorithm is a value iteration method as <text font="smallcaps">q-learning</text> or <text font="smallcaps">Sarsa</text>. In <text font="smallcaps">naf</text>, the critic if a model of the advantage function <Math mode="inline" tex="A^{\pol{{}_{\mathbf{\bm{\theta}}}}}(\mathbf{\bm{x}},\mathbf{\bm{u}})" text="A ^ (pol@(_theta)) * open-interval@(x, u)" xml:id="S6.SS5.SSS2.p6.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">A</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" name="pol" role="OVERACCENT">→</XMTok>
                      <XMApp role="FLOATSUBSCRIPT" scriptpos="3">
                        <XMTok font="bold italic" fontsize="50%" name="theta" role="UNKNOWN">θ</XMTok>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S6.SS5.SSS2.p6.m1.1"/>
                      <XMRef idref="S6.SS5.SSS2.p6.m1.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS5.SSS2.p6.m1.1">x</XMTok>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMTok font="bold" role="UNKNOWN" xml:id="S6.SS5.SSS2.p6.m1.2">u</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math>, which guarantees a form of natural gradient optimization (see Section <ref labelref="LABEL:sec:ngo_vg"/>).
Furthermore, the model of this function is structured in such a way that the policy parameters <Math mode="inline" tex="{\mathbf{\bm{\theta}}}" text="theta" xml:id="S6.SS5.SSS2.p6.m2">
              <XMath>
                <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
              </XMath>
            </Math> are a subset of the more global set of parameters <Math mode="inline" tex="\eta" text="eta" xml:id="S6.SS5.SSS2.p6.m3">
              <XMath>
                <XMTok font="italic" name="eta" role="UNKNOWN">η</XMTok>
              </XMath>
            </Math> of <Math mode="inline" tex="\hat{U}_{\eta}" text="(hat@(U)) _ eta" xml:id="S6.SS5.SSS2.p6.m4">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">U</XMTok>
                  </XMApp>
                  <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                </XMApp>
              </XMath>
            </Math>.
By letting <Math mode="inline" tex="\hat{U}_{\eta}" text="(hat@(U)) _ eta" xml:id="S6.SS5.SSS2.p6.m5">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">U</XMTok>
                  </XMApp>
                  <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                </XMApp>
              </XMath>
            </Math> converge over iterations, the policy parameters <Math mode="inline" tex="{\mathbf{\bm{\theta}}}" text="theta" xml:id="S6.SS5.SSS2.p6.m6">
              <XMath>
                <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
              </XMath>
            </Math> are themselves optimized.
<!--  %**** sample˙efficiency.tex Line 1600 **** -->This more direct way to implement <Math mode="inline" tex="\delta{\mathbf{\bm{\theta}}}=improvement\_using(\hat{U}_{\eta_{i+1}})" text="delta * theta = i * m * p * r * o * v * e * m * e * n * t * _ * u * s * i * n * g * (hat@(U)) _ eta _ (i + 1)" xml:id="S6.SS5.SSS2.p6.m7">
              <XMath>
                <XMApp>
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" name="delta" role="UNKNOWN">δ</XMTok>
                    <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">i</XMTok>
                    <XMTok font="italic" role="UNKNOWN">m</XMTok>
                    <XMTok font="italic" role="UNKNOWN">p</XMTok>
                    <XMTok font="italic" role="UNKNOWN">r</XMTok>
                    <XMTok font="italic" role="UNKNOWN">o</XMTok>
                    <XMTok font="italic" role="UNKNOWN">v</XMTok>
                    <XMTok font="italic" role="UNKNOWN">e</XMTok>
                    <XMTok font="italic" role="UNKNOWN">m</XMTok>
                    <XMTok font="italic" role="UNKNOWN">e</XMTok>
                    <XMTok font="italic" role="UNKNOWN">n</XMTok>
                    <XMTok font="italic" role="UNKNOWN">t</XMTok>
                    <XMTok role="UNKNOWN">_</XMTok>
                    <XMTok font="italic" role="UNKNOWN">u</XMTok>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMTok font="italic" role="UNKNOWN">i</XMTok>
                    <XMTok font="italic" role="UNKNOWN">n</XMTok>
                    <XMTok font="italic" role="UNKNOWN">g</XMTok>
                    <XMDual>
                      <XMRef idref="S6.SS5.SSS2.p6.m7.1"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="S6.SS5.SSS2.p6.m7.1">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMApp>
                            <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                            <XMTok font="italic" role="UNKNOWN">U</XMTok>
                          </XMApp>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                            <XMTok font="italic" fontsize="70%" name="eta" role="UNKNOWN">η</XMTok>
                            <XMApp>
                              <XMTok fontsize="50%" meaning="plus" role="ADDOP">+</XMTok>
                              <XMTok font="italic" fontsize="50%" role="UNKNOWN">i</XMTok>
                              <XMTok fontsize="50%" meaning="1" role="NUMBER">1</XMTok>
                            </XMApp>
                          </XMApp>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math> is at the price of a constraint on the stucture of the critic, which must be quadratic in the features of the state. Besides, the sample efficiency of <text font="smallcaps">naf</text> is further improved in <cite class="ltx_citemacro_citep">(<bibref bibrefs="gu2016continuous" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite> by learning a forward model and switching to model-based RL.</p>
        </para>
        <para xml:id="S6.SS5.SSS2.p7">
          <p>While <text font="smallcaps">ddpg</text> and <text font="smallcaps">naf</text> learn a deterministic policy, the <text font="smallcaps">svg</text> algorithm learns a stochastic one as a deterministic function of exogenous noise <cite class="ltx_citemacro_citep">(<bibref bibrefs="heess2015learning" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.
A key feature of <text font="smallcaps">svg</text> is that by adjusting a single parameter, it can switch from a model-free to a model-based approach.</p>
        </para>
        <para xml:id="S6.SS5.SSS2.p8">
          <p>Deep off-policy actor-critic algorithms such as <text font="smallcaps">ddpg</text> and <text font="smallcaps">a3c</text> are quite unstable because reducing the variance of the utility function estimation is obtained at the cost of some bias, due to the use of off-policy samples. A new family of algorithms address the bias variance trade-off in order to get more sample efficient and more stable deep episodic policy search.
These algorithms have been characterized into a common interpolated policy gradient (<text font="smallcaps">ipg</text>) framework <cite class="ltx_citemacro_citep">(<bibref bibrefs="gu2017interpolated" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.</p>
        </para>
        <para xml:id="S6.SS5.SSS2.p9">
          <p>The <text font="smallcaps">acer</text> algorithm <cite class="ltx_citemacro_citep">(<bibref bibrefs="wang2016sample" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite> builds on the <text font="smallcaps">a3c</text> algorithm and adds three additional tricks to improve sample efficiency.
First, it introduces a <text font="italic">truncated importance sampling with bias correction</text> mechanism.
With respect to importance sampling, truncated importance sampling further reduces variance by truncating the largest weights, while bias correction reduces the bias inherent to off-policy actor-critic methods.
Second, <text font="smallcaps">acer</text> uses a stochastic dueling network architecture inspired from <cite class="ltx_citemacro_citep">(<bibref bibrefs="wang2015dueling" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite> to efficiently approximate the advantage function.
Third, it proposes an efficient variant of the “trust region” exploration mechanism of the <text font="smallcaps">trpo</text> algorithm described in Section <ref labelref="LABEL:sec:explo_sb"/>.
Finally, decorrelating the samples by shuffling them into the replay buffer prevents using a <Math mode="inline" tex="n" text="n" xml:id="S6.SS5.SSS2.p9.m1">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">n</XMTok>
              </XMath>
            </Math>-step return from samples stored in that buffer, since temporal succession is lost. As a consequence, using a <Math mode="inline" tex="n" text="n" xml:id="S6.SS5.SSS2.p9.m2">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">n</XMTok>
              </XMath>
            </Math>-step return and being off-policy may appear incompatible at first glance. However, the <text font="smallcaps">retrace</text> algorithm <cite class="ltx_citemacro_citep">(<bibref bibrefs="munos2016safe" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite> manages to perform off-policy <Math mode="inline" tex="n" text="n" xml:id="S6.SS5.SSS2.p9.m3">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">n</XMTok>
              </XMath>
            </Math>-step return updates of a critic, and is used in the <text font="smallcaps">acer</text> algorithm.</p>
        </para>
        <para xml:id="S6.SS5.SSS2.p10">
          <p>While <text font="smallcaps">acer</text> incorporates several mechanisms to control the bias on top of <text font="smallcaps">a3c</text> and the <text font="smallcaps">offpac</text> algorithm <cite class="ltx_citemacro_citep">(<bibref bibrefs="degris2012off" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>, the <text font="smallcaps">Q-prop</text> algorithm rather integrates the deterministic policy gradient equation (<ref labelref="LABEL:eq:dpg"/>) together with stochastic policy gradient equation from <text font="smallcaps">offpac</text> into a single gradient equation using a <text font="italic">control variate</text> formalization <cite class="ltx_citemacro_citep">(<bibref bibrefs="paisley2012variational" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.
More importantly, the gradient equation of <text font="smallcaps">Q-prop</text> integrates (on-policy) MC policy gradient methods and (off-policy) actor-critic methods into a single framework and can be seen as performing one or the other depending on some hyperparameters, or taking the best of both worlds. The result is a more stable algorithm that reduces the variance while controlling the bias, and that can incorporate the most recent advances of both MC policy gradient methods and actor-critic methods.</p>
        </para>
        <para xml:id="S6.SS5.SSS2.p11">
          <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="smallcaps">pgql<text font="italic"> <cite class="ltx_citemacro_citep">(<bibref bibrefs="o2016pgq" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></text></text></p>
        </para>
        <ERROR class="undefined">{tcolorbox}</ERROR>
        <para xml:id="S6.SS5.SSS2.p12">
          <p>[colback=red!10!white]<text font="bold">Message 21:</text>
Incremental improvement is generally more sample efficient than iterative improvement, but it can be unstable.</p>
        </para>
        <ERROR class="undefined">{tcolorbox}</ERROR>
        <para xml:id="S6.SS5.SSS2.p13">
          <p>[colback=red!10!white]<text font="bold">Message 22:</text>
<!--  %**** sample˙efficiency.tex Line 1625 **** -->By using a replay buffer, all the deep RL algorithms above combine the advantages of incremental and iterative learning, as discussed in Section <ref labelref="LABEL:sec:incr_seq"/>.</p>
        </para>
      </subsubsection>
    </subsection>
    <subsection inlist="toc" labels="LABEL:sec:sum_critic" xml:id="S6.SS6">
      <tags>
        <tag>6.6</tag>
        <tag role="autoref">subsection 6.6</tag>
        <tag role="refnum">6.6</tag>
        <tag role="typerefnum">§6.6</tag>
      </tags>
      <title><tag close=" ">6.6</tag>Summary</title>
      <para xml:id="S6.SS6.p1">
        <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="italic">More black dots means more sample efficiency.</text></p>
      </para>
      <para xml:id="S6.SS6.p2">
        <p>All episodic policy search algorithms using a critic perform local search.</p>
      </para>
      <table inlist="lot" labels="LABEL:tab:classif2" placement="hbtp" xml:id="S6.T3">
        <tags>
          <tag>Table 3</tag>
          <tag role="autoref">Table 3</tag>
          <tag role="refnum">3</tag>
          <tag role="typerefnum">Table 3</tag>
        </tags>
        <tabular class="ltx_centering" vattach="middle">
          <tbody>
            <tr>
              <td border="l r t"/>
              <td align="left" border="r t" colspan="2">Architecture</td>
              <td align="left" border="r t" colspan="2">Explo.</td>
              <td align="left" border="r t" colspan="2">Model Improv.</td>
              <td align="left" border="r t" colspan="3">Policy Improv.</td>
            </tr>
            <tr>
              <td border="l r t"/>
              <td align="center" border="t"><inline-block angle="90" depth="0.0pt" height="307.5pt" innerdepth="2.0pt" innerheight="7.0pt" innerwidth="307.5pt" width="9.0pt" xtranslate="-149.25pt" ytranslate="-148.25pt">
                  <p>Utility model: non-linear (<Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m1">
                      <XMath>
                        <XMTok name="bullet" role="MULOP">∙</XMTok>
                      </XMath>
                    </Math>), linear (<Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m2">
                      <XMath>
                        <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                      </XMath>
                    </Math>)</p>
                </inline-block></td>
              <td align="center" border="r t"><inline-block angle="90" depth="0.0pt" height="300.0pt" innerdepth="2.0pt" innerheight="7.0pt" innerwidth="300.0pt" width="9.0pt" xtranslate="-145.5pt" ytranslate="-144.5pt">
                  <p>Forward model: yes (<Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m3">
                      <XMath>
                        <XMTok name="bullet" role="MULOP">∙</XMTok>
                      </XMath>
                    </Math>), no (<Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m4">
                      <XMath>
                        <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                      </XMath>
                    </Math>), both (<Math mode="inline" tex="\star" text="star" xml:id="S6.T3.m5">
                      <XMath>
                        <XMTok name="star" role="MULOP">⋆</XMTok>
                      </XMath>
                    </Math>)</p>
                </inline-block></td>
              <td align="center" border="t"><inline-block angle="90" depth="0.0pt" height="165.0pt" innerdepth="2.0pt" innerheight="7.0pt" innerwidth="165.0pt" width="9.0pt" xtranslate="-78pt" ytranslate="-77pt">
                  <p>In <Math mode="inline" tex="(\mathbf{\bm{o}},\mathbf{\bm{u}})" text="open-interval@(o, u)" xml:id="S6.T3.m6">
                      <XMath>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="open-interval"/>
                            <XMRef idref="S6.T3.m6.1"/>
                            <XMRef idref="S6.T3.m6.2"/>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">(</XMTok>
                            <XMTok font="bold" role="UNKNOWN" xml:id="S6.T3.m6.1">o</XMTok>
                            <XMTok role="PUNCT">,</XMTok>
                            <XMTok font="bold" role="UNKNOWN" xml:id="S6.T3.m6.2">u</XMTok>
                            <XMTok role="CLOSE" stretchy="false">)</XMTok>
                          </XMWrap>
                        </XMDual>
                      </XMath>
                    </Math> (<Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m7">
                      <XMath>
                        <XMTok name="bullet" role="MULOP">∙</XMTok>
                      </XMath>
                    </Math>), in <Math mode="inline" tex="{\mathbf{\bm{\theta}}}" text="theta" xml:id="S6.T3.m8">
                      <XMath>
                        <XMTok font="bold italic" name="theta" role="UNKNOWN">θ</XMTok>
                      </XMath>
                    </Math> (<Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m9">
                      <XMath>
                        <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                      </XMath>
                    </Math>)</p>
                </inline-block></td>
              <td align="center" border="r t"><inline-block angle="90" depth="0.0pt" height="352.5pt" innerdepth="2.0pt" innerheight="7.0pt" innerwidth="352.5pt" width="9.0pt" xtranslate="-171.75pt" ytranslate="-170.75pt">
                  <p>Greedy (<Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m10">
                      <XMath>
                        <XMTok name="bullet" role="MULOP">∙</XMTok>
                      </XMath>
                    </Math>), stochastic search (<Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m11">
                      <XMath>
                        <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                      </XMath>
                    </Math>), advanced (<Math mode="inline" tex="\star" text="star" xml:id="S6.T3.m12">
                      <XMath>
                        <XMTok name="star" role="MULOP">⋆</XMTok>
                      </XMath>
                    </Math>)</p>
                </inline-block></td>
              <td align="center" border="t"><inline-block angle="90" depth="0.0pt" height="300.0pt" innerdepth="2.0pt" innerheight="7.0pt" innerwidth="300.0pt" width="9.0pt" xtranslate="-145.5pt" ytranslate="-144.5pt">
                  <p>Incremental (<Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m13">
                      <XMath>
                        <XMTok name="bullet" role="MULOP">∙</XMTok>
                      </XMath>
                    </Math>), iterative (<Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m14">
                      <XMath>
                        <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                      </XMath>
                    </Math>), both (<Math mode="inline" tex="\star" text="star" xml:id="S6.T3.m15">
                      <XMath>
                        <XMTok name="star" role="MULOP">⋆</XMTok>
                      </XMath>
                    </Math>)</p>
                </inline-block></td>
              <td align="center" border="r t"><inline-block angle="90" depth="0.0pt" height="292.5pt" innerdepth="2.0pt" innerheight="7.0pt" innerwidth="292.5pt" width="9.0pt" xtranslate="-141.75pt" ytranslate="-140.75pt">
                  <p>Off-policy (<Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m16">
                      <XMath>
                        <XMTok name="bullet" role="MULOP">∙</XMTok>
                      </XMath>
                    </Math>), on-policy (<Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m17">
                      <XMath>
                        <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                      </XMath>
                    </Math>), both (<Math mode="inline" tex="\star" text="star" xml:id="S6.T3.m18">
                      <XMath>
                        <XMTok name="star" role="MULOP">⋆</XMTok>
                      </XMath>
                    </Math>)</p>
                </inline-block></td>
              <td align="center" border="t"><inline-block angle="90" depth="0.0pt" height="255.0pt" innerdepth="2.0pt" innerheight="7.0pt" innerwidth="255.0pt" width="9.0pt" xtranslate="-123pt" ytranslate="-122pt">
                  <p>Gradient-based (<Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m19">
                      <XMath>
                        <XMTok name="bullet" role="MULOP">∙</XMTok>
                      </XMath>
                    </Math>), analytical (<Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m20">
                      <XMath>
                        <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                      </XMath>
                    </Math>)</p>
                </inline-block></td>
              <td align="center" border="t"><inline-block angle="90" depth="0.0pt" height="255.0pt" innerdepth="2.0pt" innerheight="7.0pt" innerwidth="255.0pt" width="9.0pt" xtranslate="-123pt" ytranslate="-122pt">
                  <p>Gradient: natural (<Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m21">
                      <XMath>
                        <XMTok name="bullet" role="MULOP">∙</XMTok>
                      </XMath>
                    </Math>), vanilla (<Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m22">
                      <XMath>
                        <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                      </XMath>
                    </Math>)</p>
                </inline-block></td>
              <td align="center" border="r t"><inline-block angle="90" depth="0.0pt" height="150.0pt" innerdepth="2.0pt" innerheight="7.0pt" innerwidth="150.0pt" width="9.0pt" xtranslate="-70.5pt" ytranslate="-69.5pt">
                  <p>Policy update method</p>
                </inline-block></td>
            </tr>
            <tr>
              <td align="left" border="l r tt"><text font="smallcaps">PoWER</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="kober2009learning" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center" border="tt"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m23">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r tt"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m24">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="tt"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m25">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r tt"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m26">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="tt"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m27">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r tt"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m28">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="tt"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m29">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="tt"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m30">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r tt">MC-EM</td>
            </tr>
            <tr>
              <td align="left" border="l r"><text font="smallcaps">vips</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="neumann2011variational" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m31">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m32">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m33">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m34">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m35">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m36">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m37">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m38">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">Var.</td>
            </tr>
            <tr>
              <td align="left" border="l r"><break/><text font="smallcaps">pi<Math mode="inline" tex="{}^{2}" text="^2" xml:id="S6.T3.m39">
                    <XMath>
                      <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                        <XMTok font="upright" fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="theodorou10generalized" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m40">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m41">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m42">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m43">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center">.</td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m44">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m45">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m46">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">PI</td>
            </tr>
            <tr>
              <td align="left" border="l r"><text font="smallcaps">reinforce</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="williams1992" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m47">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m48">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m49">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m50">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m51">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m52">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m53">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m54">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">LR</td>
            </tr>
            <tr>
              <td align="left" border="l r"><text font="smallcaps">g(po)mdp</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="baxter2001infinite" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m55">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m56">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m57">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m58">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m59">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m60">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m61">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m62">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">LR</td>
            </tr>
            <tr>
              <td align="left" border="l r"><text font="smallcaps">pepg</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="sehnke2010parameter" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m63">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m64">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m65">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m66">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m67">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">.</td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m68">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m69">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">LR</td>
            </tr>
            <tr>
              <td align="left" border="l r"><break/><text font="smallcaps">nac</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="peters08natural" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m70">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m71">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m72">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m73">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center">.</td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m74">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m75">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m76">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">NPG</td>
            </tr>
            <tr>
              <td align="left" border="l r">e<text font="smallcaps">nac</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="peters08natural" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m77">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m78">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m79">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m80">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center">.</td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m81">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m82">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m83">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">NPG</td>
            </tr>
            <tr>
              <td align="left" border="l r"><break/>SB-<text font="smallcaps">reps</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="peters2010relative" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m84">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m85">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m86">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\star" text="star" xml:id="S6.T3.m87">
                  <XMath>
                    <XMTok name="star" role="MULOP">⋆</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m88">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m89">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m90">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m91">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">IT</td>
            </tr>
            <tr>
              <td align="left" border="l r"><text font="smallcaps">hireps</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="daniel2012hierarchical" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m92">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m93">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m94">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\star" text="star" xml:id="S6.T3.m95">
                  <XMath>
                    <XMTok name="star" role="MULOP">⋆</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m96">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m97">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m98">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m99">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">IT</td>
            </tr>
            <tr>
              <td align="left" border="l r t"><break/><break/>i<text font="smallcaps">nac</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="bhatnagar07_NIPS" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center" border="t"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m100">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r t"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m101">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="t"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m102">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r t"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m103">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="t"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m104">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r t"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m105">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="t"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m106">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="t"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m107">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">NPG</td>
            </tr>
            <tr>
              <td align="left" border="l r"><text font="smallcaps">gps</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="levine2013guided" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m108">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m109">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m110">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\star" text="star" xml:id="S6.T3.m111">
                  <XMath>
                    <XMTok name="star" role="MULOP">⋆</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m112">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m113">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m114">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m115">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">PG</td>
            </tr>
            <tr>
              <td align="left" border="l r"><text font="smallcaps">trpo</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="schulman2015trust" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m116">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m117">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m118">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m119">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m120">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m121">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m122">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m123">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">NPG</td>
            </tr>
            <tr>
              <td align="left" border="l r"><break/><text font="smallcaps">ddpg</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="lillicrap2015continuous" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m124">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m125">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m126">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m127">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m128">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m129">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m130">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m131">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">PG</td>
            </tr>
            <tr>
              <td align="left" border="l r"><text font="smallcaps">a3c</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="mnih2016asynchronous" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m132">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m133">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m134">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m135">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m136">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m137">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m138">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m139">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">NPG</td>
            </tr>
            <tr>
              <td align="left" border="l r"><text font="smallcaps">naf</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="gu2016continuous" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m140">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m141">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m142">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m143">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m144">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m145">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m146">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m147">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">NPG</td>
            </tr>
            <tr>
              <td align="left" border="l r"><text font="smallcaps">svg</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="heess2015learning" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m148">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\star" text="star" xml:id="S6.T3.m149">
                  <XMath>
                    <XMTok name="star" role="MULOP">⋆</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m150">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m151">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m152">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m153">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m154">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m155">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">NPG</td>
            </tr>
            <tr>
              <td align="left" border="l r"><text font="smallcaps">acer</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="wang2016sample" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m156">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m157">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m158">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m159">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m160">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m161">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m162">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m163">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">NPG</td>
            </tr>
            <tr>
              <td align="left" border="l r"><text font="smallcaps">Q-prop</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="gu2016q" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m164">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m165">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m166">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m167">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\star" text="star" xml:id="S6.T3.m168">
                  <XMath>
                    <XMTok name="star" role="MULOP">⋆</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r"><Math mode="inline" tex="\star" text="star" xml:id="S6.T3.m169">
                  <XMath>
                    <XMTok name="star" role="MULOP">⋆</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m170">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m171">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="r">NPG</td>
            </tr>
            <tr>
              <td align="left" border="b l r"><text font="smallcaps">pgql</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="o2016pgq" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite></td>
              <td align="center" border="b"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m172">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="b r"><Math mode="inline" tex="\circ" text="compose" xml:id="S6.T3.m173">
                  <XMath>
                    <XMTok meaning="compose" name="circ" role="MULOP">∘</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="b"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m174">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="b r"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m175">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="b"><Math mode="inline" tex="\star" text="star" xml:id="S6.T3.m176">
                  <XMath>
                    <XMTok name="star" role="MULOP">⋆</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="b r"><Math mode="inline" tex="\star" text="star" xml:id="S6.T3.m177">
                  <XMath>
                    <XMTok name="star" role="MULOP">⋆</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="b"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m178">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="b"><Math mode="inline" tex="\bullet" text="bullet" xml:id="S6.T3.m179">
                  <XMath>
                    <XMTok name="bullet" role="MULOP">∙</XMTok>
                  </XMath>
                </Math></td>
              <td align="center" border="b r">NPG</td>
            </tr>
          </tbody>
        </tabular>
        <toccaption class="ltx_centering"><tag close=" ">3</tag>List of episodic policy search algorithms using a critic.
Above the line, they were studied in <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, below they were not.
The classification criteria are explained and discussed in Section <ref labelref="LABEL:sec:choices"/>.
For the policy update methods, the labels are as follows (following <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>).
LR: likelihood ratio.
MC-EM: Expectation-Maximization with Monte Carlo.
Var.: Variational method.
PI: path integral.
IT: Information-theoretic method.
SO: stochastic optimization
NSO: stochastic optimization with the natural gradient.
PG: policy gradient.
NPG: natural policy gradient.
</toccaption>
        <caption class="ltx_centering"><tag close=": ">Table 3</tag>List of episodic policy search algorithms using a critic.
Above the line, they were studied in <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, below they were not.
The classification criteria are explained and discussed in Section <ref labelref="LABEL:sec:choices"/>.
For the policy update methods, the labels are as follows (following <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>).
LR: likelihood ratio.
MC-EM: Expectation-Maximization with Monte Carlo.
Var.: Variational method.
PI: path integral.
IT: Information-theoretic method.
SO: stochastic optimization
NSO: stochastic optimization with the natural gradient.
PG: policy gradient.
NPG: natural policy gradient.
</caption>
      </table>
      <para xml:id="S6.SS6.p3">
        <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="italic">I would use abbreviations in each cell of the table. Now, I have to switch between the cell, see a black dot, and go to the top to (re)read what it means.</text>
<text color="#0000B3" font="bold" framecolor="#0000B3" framed="rectangle">Todo:</text><text color="#0000B3" font="italic">Yes, but see the last column: Here, you have to read the caption, that’s painful too…</text></p>
      </para>
<!--  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:choices" xml:id="S7">
    <tags>
      <tag>7</tag>
      <tag role="autoref">section 7</tag>
      <tag role="refnum">7</tag>
      <tag role="typerefnum">§7</tag>
    </tags>
    <title><tag close=" ">7</tag>Discussion</title>
<!--  %“todo–One thing I miss here. Deep learning allows for hierarchical structures to be learned. Other representation are ‘flat’. This is probably important too, isn’t it.˝ -->    <para xml:id="S7.p1">
      <p>In the previous sections we have presented methods learning a utility model in the policy parameter space and methods doing so in the state and action space separately.
It this final discussion, we do two things:
1) we discuss the implications of choosing one space rather than the other and
2) we highlight some properties which make deep RL methods more sample efficient than the previous generation of policy search methods.</p>
    </para>
    <subsection inlist="toc" labels="LABEL:sec:spaces" xml:id="S7.SS1">
      <tags>
        <tag>7.1</tag>
        <tag role="autoref">subsection 7.1</tag>
        <tag role="refnum">7.1</tag>
        <tag role="typerefnum">§7.1</tag>
      </tags>
      <title><tag close=" ">7.1</tag>Learning the utility function in the policy parameter space versus a critic</title>
      <para xml:id="S7.SS1.p1">
        <p>Learning the utility function in the policy parameter space versus a critic is a fundamental distinction in episodic policy search methods.
These two approaches correspond respectively to the episode-based and the step-based evaluation strategies outlined in <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.
Several elements speak in favor of the higher sample efficiency of the latter approach.
<!--  %**** sample˙efficiency.tex Line 1725 **** --></p>
        <itemize xml:id="S7.I1">
          <item xml:id="S7.I1.i1">
            <tags>
              <tag>1.</tag>
              <tag role="autoref">item 1</tag>
              <tag role="refnum">1</tag>
              <tag role="typerefnum">item 1</tag>
            </tags>
            <para xml:id="S7.I1.i1.p1">
              <p>Learning a critic with a bootstrap method can give rise to more sample reuse than learning a model of the utility function in <Math mode="inline" tex="\Theta" text="Theta" xml:id="S7.I1.i1.p1.m1">
                  <XMath>
                    <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
                  </XMath>
                </Math>.</p>
            </para>
          </item>
          <item xml:id="S7.I1.i2">
            <tags>
              <tag>2.</tag>
              <tag role="autoref">item 2</tag>
              <tag role="refnum">2</tag>
              <tag role="typerefnum">item 2</tag>
            </tags>
            <para xml:id="S7.I1.i2.p1">
              <p>Learning from each step-sample separately makes a better use of the information available from a trajectory than learning from episodic-samples.</p>
            </para>
          </item>
          <item xml:id="S7.I1.i3">
            <tags>
              <tag>3.</tag>
              <tag role="autoref">item 3</tag>
              <tag role="refnum">3</tag>
              <tag role="typerefnum">item 3</tag>
            </tags>
            <para xml:id="S7.I1.i3.p1">
              <p>The critic estimates the utility-to-go from the current state to the final state.
Thus it provides an estimate of the utility of the whole trajectory without having to run it.
Therefore, caching such values and using bootstrap updates should in principle be more sample efficient than using MC updates.</p>
            </para>
          </item>
          <item xml:id="S7.I1.i4">
            <tags>
              <tag>4.</tag>
              <tag role="autoref">item 4</tag>
              <tag role="refnum">4</tag>
              <tag role="typerefnum">item 4</tag>
            </tags>
            <para xml:id="S7.I1.i4.p1">
              <p>Under well specified constraints corresponding to the Markov property <cite class="ltx_citemacro_citep">(<bibref bibrefs="sigaud2010" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                    <bibrefphrase>, </bibrefphrase>
                  </bibref>)</cite>, improving the policy locally is guaranteed to improve it globally, which facilitates optimization over trajectories.</p>
            </para>
          </item>
        </itemize>
      </para>
      <para xml:id="S7.SS1.p2">
        <p>However, other aspects must be considered.</p>
      </para>
      <subsubsection inlist="toc" xml:id="S7.SS1.SSS1">
        <tags>
          <tag>7.1.1</tag>
          <tag role="autoref">subsubsection 7.1.1</tag>
          <tag role="refnum">7.1.1</tag>
          <tag role="typerefnum">§7.1.1</tag>
        </tags>
        <title><tag close=" ">7.1.1</tag>Size and structure of the corresponding spaces</title>
        <para xml:id="S7.SS1.SSS1.p1">
          <p>A key factor of sample efficiency is the size of the <Math mode="inline" tex="\mathcal{X}\times\mathcal{U}" text="X * U" xml:id="S7.SS1.SSS1.p1.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">×</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">X</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">U</XMTok>
                </XMApp>
              </XMath>
            </Math> space with respect to <Math mode="inline" tex="\Theta" text="Theta" xml:id="S7.SS1.SSS1.p1.m2">
              <XMath>
                <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
              </XMath>
            </Math>.
Indeed, whatever the model, learning a model is more expensive when the model domain is larger.
This insight is at the heart of <text font="italic">quality-diversity</text> methods which sample policies in a hand designed space that is smaller than the <Math mode="inline" tex="\Theta" text="Theta" xml:id="S7.SS1.SSS1.p1.m3">
              <XMath>
                <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
              </XMath>
            </Math> space <cite class="ltx_citemacro_citep">(<bibref bibrefs="cully17quality,pugh2015confronting" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.
Another important factor is the structure of the relationships between points in the <Math mode="inline" tex="\mathcal{X}\times\mathcal{U}" text="X * U" xml:id="S7.SS1.SSS1.p1.m4">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">×</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">X</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">U</XMTok>
                </XMApp>
              </XMath>
            </Math> space and in <Math mode="inline" tex="\Theta" text="Theta" xml:id="S7.SS1.SSS1.p1.m5">
              <XMath>
                <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
              </XMath>
            </Math>, which depends on how the policy is parametrized. Policies are often parametrized so that the relationship is smooth enough, and deep neural network seem to generally induce a smooth structure too, but some policy parametrizations may induce a large jump in the <Math mode="inline" tex="\mathcal{X}\times\mathcal{U}" text="X * U" xml:id="S7.SS1.SSS1.p1.m6">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">×</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">X</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">U</XMTok>
                </XMApp>
              </XMath>
            </Math> space for a small variation in <Math mode="inline" tex="\Theta" text="Theta" xml:id="S7.SS1.SSS1.p1.m7">
              <XMath>
                <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
              </XMath>
            </Math>. In the latter case, searching directly in <Math mode="inline" tex="\Theta" text="Theta" xml:id="S7.SS1.SSS1.p1.m8">
              <XMath>
                <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
              </XMath>
            </Math> may prove more efficient.
The two above factors might be dominant over all others.</p>
        </para>
        <para xml:id="S7.SS1.SSS1.p2">
          <p>In robotics, the size of <Math mode="inline" tex="\Theta" text="Theta" xml:id="S7.SS1.SSS1.p2.m1">
              <XMath>
                <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
              </XMath>
            </Math> is usually keet small so as to keep optimization fast enough, using <text font="smallcaps">dmp</text>s or other open-loop policies <cite class="ltx_citemacro_citep">(<bibref bibrefs="ijspeert2013dynamical" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.
Moreover, with open-loop policies, the immediate utility of an action at each state is not a useful information since the controller does not take the state as input.
<!--  %**** sample˙efficiency.tex Line 1750 **** -->As a consequence, using methods based on step-samples for <text font="smallcaps">dmp</text>s does not make much sense.
However, open-loop policies and <text font="smallcaps">dmp</text>s suffer from some drawbacks, such as a limited robustness to perturbations <cite class="ltx_citemacro_citep">(<bibref bibrefs="stulp13paladyn" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.</p>
        </para>
        <para xml:id="S7.SS1.SSS1.p3">
          <p>The emergence of deep RL methods using large neural networks as policy representation may change the picture as they make it possible to learn large closed-loop policies.
Furthermore, in that context, the size of <Math mode="inline" tex="\Theta" text="Theta" xml:id="S7.SS1.SSS1.p3.m1">
              <XMath>
                <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
              </XMath>
            </Math> can become larger than that of the <Math mode="inline" tex="\mathcal{X}\times\mathcal{U}" text="X * U" xml:id="S7.SS1.SSS1.p3.m2">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">×</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">X</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">U</XMTok>
                </XMApp>
              </XMath>
            </Math> space, which speaks in favor of learning a critic.
Besides, a utility function modelled in a larger space may suffer to fewer local minima, which may be true both of <Math mode="inline" tex="\hat{J}({\mathbf{\bm{\theta}}})" text="hat@(J) * theta" xml:id="S7.SS1.SSS1.p3.m3">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">J</XMTok>
                  </XMApp>
                  <XMDual>
                    <XMRef idref="S7.SS1.SSS1.p3.m3.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="bold italic" name="theta" role="UNKNOWN" xml:id="S7.SS1.SSS1.p3.m3.1">θ</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math> and of a critic.</p>
        </para>
      </subsubsection>
      <subsubsection inlist="toc" xml:id="S7.SS1.SSS2">
        <tags>
          <tag>7.1.2</tag>
          <tag role="autoref">subsubsection 7.1.2</tag>
          <tag role="refnum">7.1.2</tag>
          <tag role="typerefnum">§7.1.2</tag>
        </tags>
        <title><tag close=" ">7.1.2</tag>Learning a hierarchical representation</title>
        <para xml:id="S7.SS1.SSS2.p1">
          <p>Finally, the state-action spaces may naturally exhibit a hierarchical structure, which is not so obvious of the policy parameter space. As a consequence, methods modelling utility in the <Math mode="inline" tex="\mathcal{X}\times\mathcal{U}" text="X * U" xml:id="S7.SS1.SSS2.p1.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">×</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">X</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">U</XMTok>
                </XMApp>
              </XMath>
            </Math> space may benefit from learning intermediate representations at several levels in the hierarchy so as to reduce the dimensionality of the policy search problem.
Learning such intermediate and more compact representations can be performed off-line, which corresponds to the perspective of the DREAM project and is illustrated in <cite class="ltx_citemacro_citep">(<bibref bibrefs="zimmer2017bootstrapping" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.</p>
        </para>
      </subsubsection>
      <subsubsection inlist="toc" xml:id="S7.SS1.SSS3">
        <tags>
          <tag>7.1.3</tag>
          <tag role="autoref">subsubsection 7.1.3</tag>
          <tag role="refnum">7.1.3</tag>
          <tag role="typerefnum">§7.1.3</tag>
        </tags>
        <title><tag close=" ">7.1.3</tag>Policy parameter perturbation versus action perturbation</title>
        <para xml:id="S7.SS1.SSS3.p1">
          <p>Policy parameter perturbation and action perturbation correspond respectively to the episode-based and the step-based exploration strategies outlined in <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>. Interestingly, Table <ref labelref="LABEL:tab:classif2"/> shows that <text font="smallcaps">pepg</text>, <text font="smallcaps">PoWER</text> and <text font="smallcaps">pi<Math mode="inline" tex="{}^{2}" text="^2" xml:id="S7.SS1.SSS3.p1.m1">
                <XMath>
                  <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                    <XMTok font="upright" fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                  </XMApp>
                </XMath>
              </Math></text> use an episode-based exploration strategy whereas they use a step-based evaluation strategy.</p>
        </para>
        <para xml:id="S7.SS1.SSS3.p2">
          <p>In several surveys about episodic policy search for robotics, policy parameter perturbation methods are considered superior to action perturbation methods <cite class="ltx_citemacro_citep">(<bibref bibrefs="stulp13paladyn,deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>. However, although this analysis is backed-up with a few mathematical arguments, we now believe this is true mostly when <Math mode="inline" tex="\Theta" text="Theta" xml:id="S7.SS1.SSS3.p2.m1">
              <XMath>
                <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
              </XMath>
            </Math> is smaller than the <Math mode="inline" tex="\mathcal{X}\times\mathcal{U}" text="X * U" xml:id="S7.SS1.SSS3.p2.m2">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">×</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">X</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">U</XMTok>
                </XMApp>
              </XMath>
            </Math> space, with the same implications as already discussed in Section <ref labelref="LABEL:sec:spaces"/>.</p>
        </para>
        <para xml:id="S7.SS1.SSS3.p3">
          <p>By the way, all the recent deep RL methods use action perturbation, sometimes together with a Kullback-Leibler divergence constraint to ensure natural gradient updates, as is the case in <text font="smallcaps">trpo</text> and <text font="smallcaps">acer</text>.</p>
        </para>
        <para xml:id="S7.SS1.SSS3.p4">
          <p><cite class="ltx_citemacro_citep">(<bibref bibrefs="fortunato2017noisy,plappert2017parameter" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite></p>
        </para>
      </subsubsection>
      <subsubsection inlist="toc" labels="LABEL:sec:single" xml:id="S7.SS1.SSS4">
        <tags>
          <tag>7.1.4</tag>
          <tag role="autoref">subsubsection 7.1.4</tag>
          <tag role="refnum">7.1.4</tag>
          <tag role="typerefnum">§7.1.4</tag>
        </tags>
        <title><tag close=" ">7.1.4</tag>Single starting state versus multiple starting states</title>
<!--  %**** sample˙efficiency.tex Line 1775 **** -->        <para xml:id="S7.SS1.SSS4.p1">
          <p>Methods based on step-samples are conceptually designed to face the context of sampling from various initial states.
Nevertheless, they can still be applied to the context with a single initial state.</p>
        </para>
        <para xml:id="S7.SS1.SSS4.p2">
          <p>In the context of deep RL, using multiple starting states has been shown to be highly beneficial to the quality of the obtained policy in <cite class="ltx_citemacro_citep">(<bibref bibrefs="rajeswaran2017towards" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>, because it favors exploration and finds more general solutions.</p>
        </para>
      </subsubsection>
    </subsection>
    <subsection inlist="toc" labels="LABEL:sec:greedy_stoc" xml:id="S7.SS2">
      <tags>
        <tag>7.2</tag>
        <tag role="autoref">subsection 7.2</tag>
        <tag role="refnum">7.2</tag>
        <tag role="typerefnum">§7.2</tag>
      </tags>
      <title><tag close=" ">7.2</tag>Tuning a step size versus updating a covariance matrix</title>
      <para xml:id="S7.SS2.p1">
        <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="italic">should move in the main text</text></p>
      </para>
      <para xml:id="S7.SS2.p2">
        <p>In derivative-based optimization methods, the step size has to be adjusted so that the derivative-based optimization process does not jump outside the current hill. In that respect, <text font="smallcaps">EDA</text>s using covariance matrix adaptation remove the necessity for step-size tuning because they adapt the search to the shape of the hill.</p>
      </para>
      <para xml:id="S7.SS2.p3">
        <p>In the case of methods using step-samples, the picture is more complex. Some derivative-based optimization methods such as e<text font="smallcaps">nac</text> use a constant step-size, which is not efficient <cite class="ltx_citemacro_citep">(<bibref bibrefs="riedmiller08evaluation" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S7.SS2.p4">
        <p><text color="#CC0000" font="bold" framecolor="#CC0000" framed="rectangle">Todo:</text><text color="#CC0000" font="italic">work on 2 paragraphs below.</text></p>
      </para>
      <para xml:id="S7.SS2.p5">
        <p>Other derivative-based optimization methods like EM-based optimization and deep RL methods remove the need for tuning a step-size by calling upon the same kind of stochastic search as derivative-free optimization methods, or by using a bound on the Kullback-Leibler divergence. This latter approach combines the sample efficiency of <text font="italic">greedy</text> optimization using natural gradient optimization with a principled way to tune the step size. The most representative algorithms in that respect are <text font="smallcaps">trpo</text> and <text font="smallcaps">acer</text>.</p>
      </para>
      <para xml:id="S7.SS2.p6">
        <p>In some algorithms, following the terminology of <cite class="ltx_citemacro_citep">(<bibref bibrefs="deisenroth2013survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, the exploration policy is truly an “upper-level policy” that takes some local context as input. The corresponding algorithms have been identified as “advanced” with a star symbol in Table <ref labelref="LABEL:tab:classif1"/>, but the topic is not covered here. An exception is the <text font="smallcaps">gps</text> algorithm whose advanced exploration comes from the integration of guiding samples.</p>
      </para>
    </subsection>
    <subsection inlist="toc" labels="LABEL:sec:incr_seq" xml:id="S7.SS3">
      <tags>
        <tag>7.3</tag>
        <tag role="autoref">subsection 7.3</tag>
        <tag role="refnum">7.3</tag>
        <tag role="typerefnum">§7.3</tag>
      </tags>
      <title><tag close=" ">7.3</tag>Incremental versus iterative improvement</title>
      <para xml:id="S7.SS3.p1">
        <p>The distinction between iterative methods, which collect a batch of samples for computing a new policy,
and incremental methods, which update a critic in parallel to the policy, is central to our survey.
<!--  %**** sample˙efficiency.tex Line 1800 **** -->Importantly, all methods using episodic-samples are iterative. In that respect, we have outlined that they share many similarities with iterative methods based on step-samples.
Furthermore, most methods covered in the previous episodic policy search surveys were iterative, whereas most deep RL methods are incremental.</p>
      </para>
      <para xml:id="S7.SS3.p2">
        <p>From one side, iterative methods are more stable. Indeed, in the incremental context, if one does not wait for convergence of the critic before applying derivative-based optimization on the policy,
then utility estimation depends on policy improvement and <text font="italic">vice versa</text>, leading to a potential divergence of the policy search process.
Thus iterative methods are adapted to the context of off-line processing of a new policy, when the agent alternates between periods of activity and periods of improvement, as in the DREAM project (see Section <ref labelref="LABEL:sec:spaces"/>).</p>
      </para>
      <para xml:id="S7.SS3.p3">
        <p>But from the other side, in the context of on-line learning, incremental updates of the critic and the policy can lead to a much better sample reuse than iterative updates.
Indeed, updating a policy as soon as possible can be more sample efficient because a better policy generates better step-samples, which may result in further improvement in the current policy.</p>
      </para>
      <para xml:id="S7.SS3.p4">
        <p>Thus, from our perspective, by providing more accurate non-linear estimation methods and several tricks to stabilize the update of the critic, deep RL methods have much contributed to the emergence of incremental methods, resulting in a better sample efficiency of episodic policy search methods.</p>
      </para>
      <para xml:id="S7.SS3.p5">
        <p>A side message from our survey is that the formerly well established actor-critic versus direct policy gradient distinction is not the most adequate when one considers sample efficiency questions. Indeed, <text font="smallcaps">nac</text> and e<text font="smallcaps">nac</text> are classified as actor-critic but they are iterative methods which recompute a new critic at each iteration, in sharp contrast with the more recent incremental deep RL methods.</p>
      </para>
    </subsection>
    <subsection inlist="toc" labels="LABEL:sec:onp_ofp" xml:id="S7.SS4">
      <tags>
        <tag>7.4</tag>
        <tag role="autoref">subsection 7.4</tag>
        <tag role="refnum">7.4</tag>
        <tag role="typerefnum">§7.4</tag>
      </tags>
      <title><tag close=" ">7.4</tag>Off-policy versus on-policy updates</title>
      <para xml:id="S7.SS4.p1">
        <p>In on-policy methods like <text font="smallcaps">Sarsa</text>, the critic estimates the utility of the <text font="italic">current</text> policy, whereas in off-policy methods like <text font="smallcaps">q-learning</text>, it estimates the utility of an <text font="italic">optimal</text> policy. As a consequence, in the on-policy case the samples used to learn the critic must come from the current policy itself whereas in the off-policy case, they can come from any policy.</p>
      </para>
      <para xml:id="S7.SS4.p2">
        <p>In most iterative policy gradient methods, the samples are discarded from one iteration to the next and these methods are generally on-policy.
By contrast, incremental methods using a replay buffer are off-policy, while those which do not use one are generally on-policy, as is the case of <text font="smallcaps">a3c</text>.
Besides, using importance sampling is a well known method to turn an on-policy method into an off-policy one, see e.g. <cite class="ltx_citemacro_citep">(<bibref bibrefs="jie2010connection" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S7.SS4.p3">
        <p>When learning a critic incrementally, using off-policy updates is more flexible because the samples can come from any policy, but these off-policy updates introduce bias in the estimation of the critic.
As a result, off-policy methods like <text font="smallcaps">ddpg</text> and <text font="smallcaps">naf</text> are more sample efficient because they use a replay buffer, but they are also more prone to divergence.
<!--  %**** sample˙efficiency.tex Line 1825 **** -->In that respect, a key contribution of <text font="smallcaps">acer</text> and <text font="smallcaps">Q-prop</text> is that they provide an off-policy, sample efficient update method which strongly controls the bias, resulting in more stability.</p>
      </para>
    </subsection>
    <subsection inlist="toc" labels="LABEL:sec:final_drl" xml:id="S7.SS5">
      <tags>
        <tag>7.5</tag>
        <tag role="autoref">subsection 7.5</tag>
        <tag role="refnum">7.5</tag>
        <tag role="typerefnum">§7.5</tag>
      </tags>
      <title><tag close=" ">7.5</tag>Higher sample efficiency of recent deep RL methods</title>
      <para xml:id="S7.SS5.p1">
        <p>In former actor-critic methods like <text font="smallcaps">nac</text> and e<text font="smallcaps">nac</text>, a linear architecture was used for the critic, which facilitates some calculations but can also result in a poor approximation of utility in the state action space. This can itself prevent optimization of the policy.
In that respect, as shown in the first column of Table <ref labelref="LABEL:tab:classif2"/>, a key contribution of deep RL algorithms is that they brought efficient techniques to perform non-linear approximation of the critic.</p>
      </para>
      <subsubsection inlist="toc" labels="LABEL:sec:ngo_vg" xml:id="S7.SS5.SSS1">
        <tags>
          <tag>7.5.1</tag>
          <tag role="autoref">subsubsection 7.5.1</tag>
          <tag role="refnum">7.5.1</tag>
          <tag role="typerefnum">§7.5.1</tag>
        </tags>
        <title><tag close=" ">7.5.1</tag>Natural gradient versus vanilla gradient</title>
        <para xml:id="S7.SS5.SSS1.p1">
          <p>As noted in Section <ref labelref="LABEL:sec:ngo"/>, natural gradient optimization is generally more sample efficient than vanilla gradient optimization <cite class="ltx_citemacro_citep">(<bibref bibrefs="grondman2012survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.</p>
        </para>
        <para xml:id="S7.SS5.SSS1.p2">
          <p>The counterpart is that, in principle, it requires computing or estimating <Math mode="inline" tex="\mathbf{\bm{F}}" text="F" xml:id="S7.SS5.SSS1.p2.m1">
              <XMath>
                <XMTok font="bold" role="UNKNOWN">F</XMTok>
              </XMath>
            </Math> and inverting it, or directly estimating the inverse.
Thus, it is a way to improve sample efficiency through more expansive computations.
However, many episodic policy search methods manage to perform exact or approximate natural gradient optimization without calling upon the computation of <Math mode="inline" tex="\mathbf{\bm{F}}^{-1}" text="F ^ (- 1)" xml:id="S7.SS5.SSS1.p2.m2">
              <XMath>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">F</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>.</p>
        </para>
        <para xml:id="S7.SS5.SSS1.p3">
          <p>In the family of derivative-free optimization methods, this is the case of <text font="smallcaps">cma-es</text>, <text font="smallcaps">pi<Math mode="inline" tex="{}^{\mbox{\tiny BB}}" text="^[BB]" xml:id="S7.SS5.SSS1.p3.m1">
                <XMath>
                  <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                    <XMText><text fontsize="50%">BB</text></XMText>
                  </XMApp>
                </XMath>
              </Math></text> and <text font="smallcaps">pi<Math mode="inline" tex="{}^{2}" text="^2" xml:id="S7.SS5.SSS1.p3.m2">
                <XMath>
                  <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                    <XMTok font="upright" fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                  </XMApp>
                </XMath>
              </Math>-cma</text>, which perform approximate natural gradient optimization by using reward-weighted averaging (see <cite class="ltx_citemacro_citep">(<bibref bibrefs="akimoto2010bidirectional,glasmachers2010exponential" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite> for more details). By contrast, <text font="smallcaps">nes</text> and x<text font="smallcaps">nes</text> explicitly approximate <Math mode="inline" tex="\mathbf{\bm{F}}^{-1}" text="F ^ (- 1)" xml:id="S7.SS5.SSS1.p3.m3">
              <XMath>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">F</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math>.</p>
        </para>
        <para xml:id="S7.SS5.SSS1.p4">
          <p>In the family of linear actor-critic methods, this is the case of <text font="smallcaps">nac</text>, e<text font="smallcaps">nac</text> and three i<text font="smallcaps">nac</text> algorithms, where it has been shown that using compatible features and estimating the advantage function as a critic directly results in performing natural gradient optimization on the actor without the need for computing <Math mode="inline" tex="\mathbf{\bm{F}}^{-1}" text="F ^ (- 1)" xml:id="S7.SS5.SSS1.p4.m1">
              <XMath>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="bold" role="UNKNOWN">F</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math> <cite class="ltx_citemacro_citep">(<bibref bibrefs="peters2008reinforcement" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>. By contrast, the fourth i<text font="smallcaps">nac</text> algorithm estimates <Math mode="inline" tex="\mathbf{\bm{F}}" text="F" xml:id="S7.SS5.SSS1.p4.m2">
              <XMath>
                <XMTok font="bold" role="UNKNOWN">F</XMTok>
              </XMath>
            </Math> and inverts it.</p>
        </para>
        <para xml:id="S7.SS5.SSS1.p5">
          <p>In the family of deep RL methods, this is also the case of <text font="smallcaps">naf</text> and <text font="smallcaps">a3c</text>, which learn a model of the advantage function, of <text font="smallcaps">trpo</text> and <text font="smallcaps">acer</text>, which constrain the exploration with a Kullback-Leibler divergence constraint, and of <text font="smallcaps">Q-prop</text>, which uses compatible features. By contrast, <text font="smallcaps">ddpg</text> still relies of the vanilla gradient.</p>
        </para>
      </subsubsection>
      <subsubsection inlist="toc" labels="LABEL:mes:drl LABEL:sec:local_global" xml:id="S7.SS5.SSS2">
        <tags>
          <tag>7.5.2</tag>
          <tag role="autoref">subsubsection 7.5.2</tag>
          <tag role="refnum">7.5.2</tag>
          <tag role="typerefnum">§7.5.2</tag>
        </tags>
        <title><tag close=" ">7.5.2</tag>Global search versus local search</title>
<!--  %**** sample˙efficiency.tex Line 1850 **** -->        <para xml:id="S7.SS5.SSS2.p1">
          <p>Policy search is generally local, but learning a surrogate model of the utility function opens the way to global improvement, at the cost of a more expensive search for the optimum. Thus global search can only outperform local search if there are several local optima.
Actually, the number of local optima may strongly depend on the size of the parameter space, and one may hypothesize that, the larger the parameter space, the fewer local optima <cite class="ltx_citemacro_citep">(<bibref bibrefs="kawaguchi2016deep" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.</p>
        </para>
        <para xml:id="S7.SS5.SSS2.p2">
          <p>A method like <text font="smallcaps">trpo</text> is designed so that the performance always increases, thus it is unable to switch from a local optimum to another.
The fact that <text font="smallcaps">trpo</text> performs well <cite class="ltx_citemacro_citep">(<bibref bibrefs="duan2016benchmarking" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite> on several classical control problems suggests that, for a large enough <Math mode="inline" tex="\Theta" text="Theta" xml:id="S7.SS5.SSS2.p2.m1">
              <XMath>
                <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
              </XMath>
            </Math> space, the utility functions in such problems only present one local optimum in <Math mode="inline" tex="\Theta" text="Theta" xml:id="S7.SS5.SSS2.p2.m2">
              <XMath>
                <XMTok name="Theta" role="UNKNOWN">Θ</XMTok>
              </XMath>
            </Math>, or several equivalent local optima. Whether this is true of more complex problems needs to be further investigated.</p>
        </para>
        <ERROR class="undefined">{tcolorbox}</ERROR>
        <para xml:id="S7.SS5.SSS2.p3">
          <p>[colback=red!10!white]<text font="bold">Message 23:</text>
Deep RL, which learns a surrogate model and massively use derivative-based optimization of their deep neural network representations of the critic and the policy, are very sample efficient.</p>
        </para>
<!--  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -->      </subsubsection>
    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:conclu" xml:id="S8">
    <tags>
      <tag>8</tag>
      <tag role="autoref">section 8</tag>
      <tag role="refnum">8</tag>
      <tag role="typerefnum">§8</tag>
    </tags>
    <title><tag close=" ">8</tag>Conclusion</title>
    <para xml:id="S8.p1">
      <p>In <cite class="ltx_citemacro_citep">(<bibref bibrefs="stulp13paladyn" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, the authors have shown that episodic policy search applied to robotics was shifting from derivative-based actor-critic methods to <text font="smallcaps">EDA</text>s.
Part of this shift was due to the use of open-loop <text font="smallcaps">dmp</text>s <cite class="ltx_citemacro_citep">(<bibref bibrefs="ijspeert2013dynamical" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> as a policy representation, but another part resulted from the higher efficiency of <text font="smallcaps">EDA</text>s methods by that time.</p>
    </para>
    <para xml:id="S8.p2">
      <p>The most salient drawbacks of the former actor-critic methods is that they could not estimate the critic accurately due to the linear architecture. In addition, e<text font="smallcaps">nac</text> was suffering from its constant step-size, while <text font="smallcaps">PoWER</text> and <text font="smallcaps">pi<Math mode="inline" tex="{}^{2}" text="^2" xml:id="S8.p2.m1">
            <XMath>
              <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
                <XMTok font="upright" fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
              </XMApp>
            </XMath>
          </Math></text> removed the step size but at the price of the higher computational cost and variance of Monte Carlo estimation.</p>
    </para>
    <para xml:id="S8.p3">
      <p>The emergence of deep RL methods has completely changed this picture. The better approximation capability of non-linear critics and the incorporation of an adapted step size in derivative-based optimization has renewed the interest for actor-critic architectures which reduce the variance of policy gradient estimation at the price of some controlled bias and which can perform incrementally.
The target network trick has also mitigated the intrinsic instability of approximating a critic.
Finally, using a replay buffer has dramatically improved the sample efficiency of the methods.</p>
    </para>
<!--  %All these steps completely change the general picture with respect to the situation at the time when the previous surveys about “epse for robotics were published. 
     %**** sample˙efficiency.tex Line 1875 ****-->    <para xml:id="S8.p4">
      <p>So far, the focus of empirical comparisons has been more on final performance of the learned controller than on sample efficiency of the learning process.
For instance, a empirical comparison of some of the algorithms studied here is presented in <cite class="ltx_citemacro_citep">(<bibref bibrefs="duan2016benchmarking" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, but only final performance is compared and the reasons why an algorithm performs better than another is not analyzed in detail.</p>
    </para>
    <para xml:id="S8.p5">
      <p>More recently, the higher sample efficiency of deep RL methods with respect to <text font="smallcaps">EDA</text>s has started to be empirically studied in <cite class="ltx_citemacro_citep">(<bibref bibrefs="broissia2016actor" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Another work focuses on the fact that derivative-free optimization methods are still competitive in terms of performance reached, but shows that they are less sample efficient by about one order of magnitude <cite class="ltx_citemacro_citep">(<bibref bibrefs="salimans2017evolution" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Interestingly, this recent work started to investigate in which application contexts <text font="smallcaps">EDA</text>s are a competitive alternative to deep RL methods, an effort that should be pursued in the future.</p>
    </para>
    <para xml:id="S8.p6">
      <p><cite class="ltx_citemacro_citep">(<bibref bibrefs="islam2017reproducibility" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite></p>
    </para>
    <para xml:id="S8.p7">
      <p>In most of these previous empirical studies, a broad conceptual analysis centered on the algorithms was missing. The goal of this paper was to lay the foundations for such an analysis.</p>
    </para>
    <para xml:id="S8.p8">
      <p>Beyond the algorithms covered here, the current effort in the domain is on improving the stability of the policy search process while better controlling the bias-variance compromize <cite class="ltx_citemacro_citep">(<bibref bibrefs="gu2016q,wang2016sample" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>.
Another line of thought uses dedicated off-line processes to build higher level discrete state and action representations from the policy improvement process, so as to shift from episodic policy search to solving discrete RL problems <cite class="ltx_citemacro_citep">(<bibref bibrefs="zimmer2017bootstrapping" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>.</p>
    </para>
    <para xml:id="S8.p9">
      <p>In this paper, we have given a broad overview of episodic policy search methods. In the future it would be useful to build on this general perspective to write a survey focusing more specifically on the most recent deep RL methods and their mathematical derivation in connection to previous policy search work. Such a more technical survey is currently missing.</p>
    </para>
    <para xml:id="S8.p10">
      <p>Besides, we have restricted our study to single task episodic policy search methods.
Beyond episodic policy search, several research lines are making fast progress on the more general lifelong learning context, where the agent must learn how to achieve a growing number of tasks throughout its lifetime. Among other things, addressing the lifelong learning context would require specific studies of <text font="italic">active learning</text>, <text font="italic">transfer learning</text>, <text font="italic">multi-task learning</text> mechanisms, which has been left for future work.</p>
    </para>
  </section>
  <section xml:id="Sx1">
    <title>Acknowledgments</title>
    <para xml:id="Sx1.p1">
      <p>This work was supported by the European Commission, within the DREAM project, and has received funding from the European Union’s Horizon 2020 research and innovation programme under grant agreement No 640891. We thank David Filliat for valuable feedback about an earlier version of this article.</p>
    </para>
<!--  %“bibliographystyle–IEEEtran˝ 
     %“bibliographystyle–arxiv˝
     %**** sample˙efficiency.tex Line 1900 ****-->  </section>
  <bibliography citestyle="authoryear" files="/home/sigaud/Bureau/sigaud/Latex/Biblio/motor_learning,/home/sigaud/Bureau/sigaud/Latex/Biblio/rl,/home/sigaud/Bureau/sigaud/Latex/Biblio/perso,/home/sigaud/Bureau/sigaud/Latex/Biblio/continuous_rl,/home/sigaud/Bureau/sigaud/Latex/Biblio/robot_learning,/home/sigaud/Bureau/sigaud/Latex/Biblio/deep,/home/sigaud/Bureau/sigaud/Latex/Biblio/philo,/home/sigaud/Bureau/sigaud/Latex/Biblio/mabiblio,/home/sigaud/Bureau/sigaud/Latex/Biblio/curiosity" xml:id="bib">
    <title>References</title>
  </bibliography>
</document>
