<?xml version="1.0" encoding="UTF-8"?>
<?latexml searchpaths="/home/japhy/scienceReplication.artiswrong.com/paper_files/arxiv/2006.12917/latex_extracted"?>
<?latexml class="aamas" options="sigconf"?>
<?latexml RelaxNGSchema="LaTeXML"?>
<?latexml package="balance"?>
<?latexml package="amsmath"?>
<?latexml package="wrapfig"?>
<?latexml package="mathtools"?>
<document xmlns="http://dlmf.nist.gov/LaTeXML" class="ltx_authors_1line">
  <resource src="LaTeXML.css" type="text/css"/>
  <resource src="ltx-article.css" type="text/css"/>
  <resource src="ltx-amsart.css" type="text/css"/>
  <title>Show me the Way:<break/>Intrinsic Motivation from Demonstrations</title>
  <toctitle>Show me the Way</toctitle>
  <creator role="author">
    <personname>Léonard Hussenot</personname>
  </creator>
  <creator before=", " role="author">
    <personname>Robert Dadashi</personname>
  </creator>
  <creator before=", " role="author">
    <personname>Matthieu Geist</personname>
  </creator>
  <creator before=" and " role="author">
    <personname>Olivier Pietquin</personname>
  </creator>
  <abstract name="Abstract.">
    <p>The study of exploration in the domain of decision making has a long history but remains actively debated. From the vast literature that addressed this topic for decades under various points of view (<text font="italic">e.g.</text>, developmental psychology, experimental design, artificial intelligence), intrinsic motivation emerged as a concept that can practically be transferred to artificial agents. Especially, in the recent field of Deep Reinforcement Learning (RL), agents implement such a concept (mainly using a novelty argument) in the shape of an exploration bonus, added to the task reward, that encourages visiting the whole environment. This approach is supported by the large amount of theory on RL for which convergence to optimality assumes exhaustive exploration. Yet, Human Beings and mammals do not exhaustively explore the world and their motivation is not only based on novelty but also on various other factors (<text font="italic">e.g.</text>, curiosity, fun, style, pleasure, safety, competition, <text font="italic">etc.</text>). They optimize for life-long learning and train to learn transferable skills in playgrounds without obvious goals. They also apply innate or learned priors to save time and stay safe. For these reasons, we propose to learn an exploration bonus from demonstrations that could transfer these motivations to an artificial agent with little assumptions about their rationale. Using an inverse RL approach, we show that complex exploration behaviors, reflecting different motivations, can be learnt and efficiently used by RL agents to solve tasks for which exhaustive exploration is prohibitive.</p>
  </abstract>
  <ERROR class="undefined">\settopmatter</ERROR>
  <para xml:id="p1">
    <p>printacmref=false   








<!--  %****␣sample.tex␣Line␣25␣**** -->










<ERROR class="undefined">\setcopyright</ERROR>ifaamas
<ERROR class="undefined">\acmConference</ERROR>[AAMAS ’21]Proc. of the 20th International Conference on Autonomous Agents and Multiagent Systems (AAMAS 2021)May 3–7, 2021OnlineU. Endriss, A. Nowé, F. Dignum, A. Lomuscio (eds.)
<ERROR class="undefined">\copyrightyear</ERROR>2021
<ERROR class="undefined">\acmYear</ERROR>2021
<ERROR class="undefined">\acmDOI</ERROR>
<ERROR class="undefined">\acmPrice</ERROR>
<ERROR class="undefined">\acmISBN</ERROR>
<!--  %****␣sample.tex␣Line␣50␣**** --><ERROR class="undefined">\acmSubmissionID</ERROR>210


<ERROR class="undefined">\affiliation</ERROR>
<ERROR class="undefined">\institution</ERROR>Google Research, Brain Team<break/>Univ. Lille, CNRS, Inria Scool, UMR 9189 CRIStAL

<ERROR class="undefined">\affiliation</ERROR>
<ERROR class="undefined">\institution</ERROR>Google Research, Brain Team

<ERROR class="undefined">\affiliation</ERROR>
<ERROR class="undefined">\institution</ERROR>Google Research, Brain Team

<ERROR class="undefined">\affiliation</ERROR>
<ERROR class="undefined">\institution</ERROR>Google Research, Brain Team
<!--  %****␣sample.tex␣Line␣75␣**** --></p>
  </para>
<!--  %****␣sample.tex␣Line␣100␣**** -->  <section inlist="toc" xml:id="S1">
    <tags>
      <tag>1</tag>
      <tag role="autoref">section 1</tag>
      <tag role="refnum">1</tag>
      <tag role="typerefnum">§1</tag>
    </tags>
    <title><tag close=". ">1</tag>Introduction</title>
    <toctitle><tag close=" ">1</tag>Introduction</toctitle>
    <para xml:id="S1.p1">
      <p>Intrinsic motivation <cite class="ltx_citemacro_citep">(<bibref bibrefs="deci2010intrinsic" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> has emerged as one explanation for humans’ and animals’ impressive learning capabilities. Steered by the need to explore their environment (whether this need is satiable or not has been a fierce debate in the behavioral psychological community <cite class="ltx_citemacro_citep">(<bibref bibrefs="hunt1965intrinsic" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>), they are able to discover near-optimal strategies in very efficient ways. Designing artificial agents presenting such capabilities is a central goal of modern artificial intelligence and Reinforcement Learning (RL) is a popular candidate to do so.
RL has addressed a variety of sequential-decision-making problems whether in games <cite class="ltx_citemacro_citep">(<bibref bibrefs="tesauro1995temporal,mnih2015human,silver2016mastering" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> or robotics <cite class="ltx_citemacro_citep">(<bibref bibrefs="abbeel2004apprenticeship,andrychowicz2020learning" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Nevertheless, some simple problems remain unsolved. Current state-of-the-art methods struggle to find good policies in environments (1) where constant negative rewards may discourage the agent to explore (<text font="italic">e.g.</text>, the <text font="italic">Pitfall!</text> game from Atari), (2) where the reward is so sparse that an agent does not find any (<text font="italic">e.g.</text>, the <text font="italic">Montezuma’s Revenge</text> Atari game), (3) where state and action space are big (<text font="italic">e.g.</text>, text worlds). These tasks remain fairly easy for humans, though.
In order to tackle these specific problems, the use of reward bonuses, inspired by animal curiosity, was proposed to steer the agent’s exploration  <cite class="ltx_citemacro_citep">(<bibref bibrefs="csimcsek2006intrinsic,strehl2008analysis" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Even though different intrinsic bonuses have been proposed, a large majority rely on the same principle: reward for <text font="italic">novelty</text>. These methods mostly differ in how they compute this notion of <text font="italic">newness</text>. Count-based methods do it by counting how often the agent has encountered a given state  <cite class="ltx_citemacro_citep">(<bibref bibrefs="strehl2008analysis" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Pseudo-counts methods  <cite class="ltx_citemacro_citep">(<bibref bibrefs="bellemare2016unifying,ostrovski2017count" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> allow to approximate counts in large state spaces. Prediction error is also used to measure novelty, either by computing the agent’s ability to predict the future <cite class="ltx_citemacro_citep">(<bibref bibrefs="pathak2017curiosity" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> or random statistics about the current state <cite class="ltx_citemacro_citep">(<bibref bibrefs="burda2018exploration" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Some restrict novelty to state-action pairs that have an impact on the agent <cite class="ltx_citemacro_citep">(<bibref bibrefs="raileanu2020ride" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> or derive empowerment metrics <cite class="ltx_citemacro_citep">(<bibref bibrefs="mohamed2015variational" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> using mutual information. All these methods naturally encourage the discovery of new states through exhaustive exploration. Yet, in most realistic environments, exhaustive exploration is (1) not feasible due to the size of the state-action space, (2) not desirable as most behaviors are unlikely to be relevant for the task at hand.</p>
    </para>
    <para xml:id="S1.p2">
      <p>Nonetheless, human and more generally mammals exploration behaviors are governed by various motivations and constraints. Intelligent Beings do not have unlimited resources of time and energy. They optimize these resources to survive and reproduce but also to have fun <cite class="ltx_citemacro_citep">(<bibref bibrefs="holloway2004children" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, to help others <cite class="ltx_citemacro_citep">(<bibref bibrefs="byrne1990machiavellian" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> or to satisfy their curiosity. <cite class="ltx_citemacro_citet"><bibref bibrefs="oudeyer2009intrinsic" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> make the difference between homeostatic motivations (that encourage to stay in the “comfort zone” and generally correspond to desires that can be satiated) and heterostatic motivations (that push organisms out of equilibrium but cannot be satiated). These many desires shape the way organisms interact with their environment, encouraging them to discover new things but also to protect themselves, avoiding over-surprising events with mechanisms like fear <cite class="ltx_citemacro_citep">(<bibref bibrefs="lang2000fear" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. <cite class="ltx_citemacro_citet"><bibref bibrefs="berseth2019smirl" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> exemplified how to exploit such priors by implementing a “homeostasis” objective for RL, thereby showing how different from “novelty seeking” these priors can be. Eventually, the resource constraints stop organisms from exploring exhaustively their environment and push them to transfer knowledge from past experience. <cite class="ltx_citemacro_citet"><bibref bibrefs="hunt1965intrinsic" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> developed the idea of optimal incongruity: high-novelty is not rewarded as much as intermediate-level novelty, suggesting how curiosity is tightly connected to fear. More recently, <cite class="ltx_citemacro_citet"><bibref bibrefs="kidd2012goldilocks" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> supported this hypothesis with experiments on children curiosity. Overall, novelty methods fail to model correctly human curiosity as they consider that “the newer, the better”. This failure calls for a new way of definingour agent’s intrinsic motivations.</p>
    </para>
    <para xml:id="S1.p3">
      <p>In an arbitrary environment, exhaustive exploration is desirable and leads to convergence with theoretical guarantees <cite class="ltx_citemacro_citep">(<bibref bibrefs="strehl2009reinforcement" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. But when the exploration presents some structure, one can transfer skills and priors from similar environments.
<cite class="ltx_citemacro_citet"><bibref bibrefs="dubey2018investigating" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> exemplified, in the case of simple video games, how humans priors help us to solve new problems. The authors enlighten how humans struggle to play the same underlying video game with change of the object semantics, physics modifications (<text font="italic">e.g.</text>, the gravity is rotated) or with visual similarities transformations. Overall, they show how much of the human’s ability to solve a new game in a zero-shot manner is due to their prior on the environment.</p>
    </para>
    <para xml:id="S1.p4">
      <p>In this paper, instead of hard coding what we think an agent’s motivations should be (e.g. novelty), we propose to learn a bonus that captures these sources of motivation from demonstrations.
By adopting this approach, we expect to learn a bonus that implicitly helps reproducing a structured exploration behavior (<text font="italic">i.e.</text> using priors from the demonstrations to reduce the search space), in lieu of an exhaustive one. We also argue that, to a certain extent at least, this can happen without the need of extra modelling inspired by cognitive or behavioral research.
To do so, we cast this problem as an inverse RL problem with the difference that only some fraction of the reward optimized by an observed agent is hidden: the intrinsic motivation bonus. The task-related reward remains provided by the environment. We then build upon <cite class="ltx_citemacro_citet"><bibref bibrefs="klein2013cascaded" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> to propose a method that allows us to recover the intrinsic motivation from demonstrations.</p>
    </para>
    <para xml:id="S1.p5">
      <p>Therefore, our contributions are the following:</p>
      <enumerate xml:id="S1.I1">
        <item xml:id="S1.I1.i1">
          <tags>
            <tag>(1)</tag>
            <tag role="autoref">item 1</tag>
            <tag role="refnum">1</tag>
            <tag role="typerefnum">item 1</tag>
          </tags>
          <para xml:id="S1.I1.i1.p1">
            <p>a modelling that allows for disentangling the reward optimized by a demonstrator from its intrinsic motivation bonus;</p>
          </para>
        </item>
        <item xml:id="S1.I1.i2">
          <tags>
            <tag>(2)</tag>
            <tag role="autoref">item 2</tag>
            <tag role="refnum">2</tag>
            <tag role="typerefnum">item 2</tag>
          </tags>
          <para xml:id="S1.I1.i2.p1">
            <p>an architecture, that we call “<text font="italic">Show me the Way</text>” (SmtW), based on a cascade of supervised learning methods that extracts that exploration bonus from demonstrations;</p>
          </para>
        </item>
        <item xml:id="S1.I1.i3">
          <tags>
            <tag>(3)</tag>
            <tag role="autoref">item 3</tag>
            <tag role="refnum">3</tag>
            <tag role="typerefnum">item 3</tag>
          </tags>
          <para xml:id="S1.I1.i3.p1">
            <p>an empirical evaluation showing that SmtW is able to capture different exploration priors explained by various types of motivations.
<!--  %****␣sample.tex␣Line␣125␣**** --></p>
          </para>
        </item>
      </enumerate>
      <p>To evaluate SmtW, we validate a set of hypotheses on a controlled environment. We notably find that our method can learn structures and styles, transfer useful priors and encourages long-term planning.</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S2">
    <tags>
      <tag>2</tag>
      <tag role="autoref">section 2</tag>
      <tag role="refnum">2</tag>
      <tag role="typerefnum">§2</tag>
    </tags>
    <title><tag close=". ">2</tag>Background</title>
    <toctitle><tag close=" ">2</tag>Background</toctitle>
    <para xml:id="S2.p1">
      <p><text font="bold">Markov Decision Processes.</text> In Reinforcement Learning (RL), an agent learns to behave optimally through interactions with an environment. This is usually formalized as a Markov Decision Process (MDP) <cite class="ltx_citemacro_citep">(<bibref bibrefs="sutton2018reinforcement,puterman2014markov" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, a tuple <Math content-tex="(\States,\Actions,\Kernel,\R,\gamma)" mode="inline" tex="(\operatorname{S},\operatorname{A},\operatorname{P},\operatorname{R},\gamma)" text="vector@(States, Actions, Kernel, R, gamma)" xml:id="S2.p1.m1">
          <XMath>
            <XMDual>
              <XMApp>
                <XMTok meaning="vector"/>
                <XMRef idref="S2.p1.m1.1"/>
                <XMRef idref="S2.p1.m1.2"/>
                <XMRef idref="S2.p1.m1.3"/>
                <XMRef idref="S2.p1.m1.4"/>
                <XMRef idref="S2.p1.m1.5"/>
              </XMApp>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">(</XMTok>
                <XMDual role="OPFUNCTION" xml:id="S2.p1.m1.1">
                  <XMTok name="States" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">S</XMTok>
                </XMDual>
                <XMTok role="PUNCT">,</XMTok>
                <XMDual role="OPFUNCTION" xml:id="S2.p1.m1.2">
                  <XMTok name="Actions" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">A</XMTok>
                </XMDual>
                <XMTok role="PUNCT">,</XMTok>
                <XMDual role="OPFUNCTION" xml:id="S2.p1.m1.3">
                  <XMTok name="Kernel" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">P</XMTok>
                </XMDual>
                <XMTok role="PUNCT">,</XMTok>
                <XMDual role="OPFUNCTION" xml:id="S2.p1.m1.4">
                  <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                </XMDual>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok font="italic" name="gamma" role="UNKNOWN" xml:id="S2.p1.m1.5">γ</XMTok>
                <XMTok role="CLOSE" stretchy="false">)</XMTok>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math> with <Math content-tex="\States" mode="inline" tex="\operatorname{S}" text="States" xml:id="S2.p1.m2">
          <XMath>
            <XMDual role="OPFUNCTION">
              <XMTok name="States" role="OPFUNCTION" scriptpos="post"/>
              <XMTok role="OPFUNCTION" scriptpos="post">S</XMTok>
            </XMDual>
          </XMath>
        </Math> the set of states, <Math content-tex="\Actions" mode="inline" tex="\operatorname{A}" text="Actions" xml:id="S2.p1.m3">
          <XMath>
            <XMDual role="OPFUNCTION">
              <XMTok name="Actions" role="OPFUNCTION" scriptpos="post"/>
              <XMTok role="OPFUNCTION" scriptpos="post">A</XMTok>
            </XMDual>
          </XMath>
        </Math> the set of actions (assumed discrete here), <Math content-tex="\Kernel:\States\times\Actions\to\textit{P}(\States)" mode="inline" tex="\operatorname{P}:\operatorname{S}\times\operatorname{A}\to\textit{P}(%&#10;\operatorname{S})" text="Kernel colon States * Actions to [P] * States" xml:id="S2.p1.m4">
          <XMath>
            <XMApp>
              <XMTok name="colon" role="METARELOP">:</XMTok>
              <XMDual role="OPFUNCTION">
                <XMTok name="Kernel" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">P</XMTok>
              </XMDual>
              <XMApp>
                <XMTok name="to" role="ARROW">→</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">×</XMTok>
                  <XMDual role="OPFUNCTION">
                    <XMTok name="States" role="OPFUNCTION" scriptpos="post"/>
                    <XMTok role="OPFUNCTION" scriptpos="post">S</XMTok>
                  </XMDual>
                  <XMDual role="OPFUNCTION">
                    <XMTok name="Actions" role="OPFUNCTION" scriptpos="post"/>
                    <XMTok role="OPFUNCTION" scriptpos="post">A</XMTok>
                  </XMDual>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMText><text font="italic">P</text></XMText>
                  <XMDual>
                    <XMRef idref="S2.p1.m4.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMDual role="OPFUNCTION" xml:id="S2.p1.m4.1">
                        <XMTok name="States" role="OPFUNCTION" scriptpos="post"/>
                        <XMTok role="OPFUNCTION" scriptpos="post">S</XMTok>
                      </XMDual>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> the Markovian transition kernel defining the dynamic of the environment, <Math content-tex="\R:\States\times\Actions\to\mathbb{R}" mode="inline" tex="\operatorname{R}:\operatorname{S}\times\operatorname{A}\to\mathbb{R}" text="R colon States * Actions to R" xml:id="S2.p1.m5">
          <XMath>
            <XMApp>
              <XMTok name="colon" role="METARELOP">:</XMTok>
              <XMDual role="OPFUNCTION">
                <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
              </XMDual>
              <XMApp>
                <XMTok name="to" role="ARROW">→</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">×</XMTok>
                  <XMDual role="OPFUNCTION">
                    <XMTok name="States" role="OPFUNCTION" scriptpos="post"/>
                    <XMTok role="OPFUNCTION" scriptpos="post">S</XMTok>
                  </XMDual>
                  <XMDual role="OPFUNCTION">
                    <XMTok name="Actions" role="OPFUNCTION" scriptpos="post"/>
                    <XMTok role="OPFUNCTION" scriptpos="post">A</XMTok>
                  </XMDual>
                </XMApp>
                <XMTok font="blackboard" role="UNKNOWN">R</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> a bounded reward function and <Math mode="inline" tex="\gamma\in\left[0,1\right[" text="gamma element-of list@(0, 1)" xml:id="S2.p1.m6">
          <XMath>
            <XMApp>
              <XMTok meaning="element-of" name="in" role="RELOP">∈</XMTok>
              <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
              <XMDual>
                <XMApp>
                  <XMTok meaning="list"/>
                  <XMRef idref="S2.p1.m6.1"/>
                  <XMRef idref="S2.p1.m6.2"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="true">[</XMTok>
                  <XMTok meaning="0" role="NUMBER" xml:id="S2.p1.m6.1">0</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok meaning="1" role="NUMBER" xml:id="S2.p1.m6.2">1</XMTok>
                  <XMTok role="CLOSE" stretchy="true">[</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math> a discount factor. The agent interacts with the environment through a (here deterministic) policy <Math content-tex="\pi:\States\to\Actions" mode="inline" tex="\pi:\operatorname{S}\to\operatorname{A}" text="pi colon States to Actions" xml:id="S2.p1.m7">
          <XMath>
            <XMApp>
              <XMTok name="colon" role="METARELOP">:</XMTok>
              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
              <XMApp>
                <XMTok name="to" role="ARROW">→</XMTok>
                <XMDual role="OPFUNCTION">
                  <XMTok name="States" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">S</XMTok>
                </XMDual>
                <XMDual role="OPFUNCTION">
                  <XMTok name="Actions" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">A</XMTok>
                </XMDual>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>. The quality of a given policy is quantified by the associated state-action value function, or <Math mode="inline" tex="Q" text="Q" xml:id="S2.p1.m8">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">Q</XMTok>
          </XMath>
        </Math>-function. It is the expected discounted cumulative reward for starting from <Math mode="inline" tex="s" text="s" xml:id="S2.p1.m9">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">s</XMTok>
          </XMath>
        </Math>, taking action <Math mode="inline" tex="a" text="a" xml:id="S2.p1.m10">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">a</XMTok>
          </XMath>
        </Math>, and following <Math mode="inline" tex="\pi" text="pi" xml:id="S2.p1.m11">
          <XMath>
            <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
          </XMath>
        </Math> afterward: <Math mode="inline" tex="Q^{\pi}(s,a)=\mathbb{E}_{\pi}[\sum_{t\geq 0}\gamma^{t}r_{t}|s_{0}=s,a_{0}=a]" xml:id="S2.p1.m12">
          <XMath>
            <XMApp>
              <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">Q</XMTok>
              <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
            </XMApp>
            <XMWrap>
              <XMTok role="OPEN" stretchy="false">(</XMTok>
              <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m12.1">s</XMTok>
              <XMTok role="PUNCT">,</XMTok>
              <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m12.2">a</XMTok>
              <XMTok role="CLOSE" stretchy="false">)</XMTok>
            </XMWrap>
            <XMTok meaning="equals" role="RELOP">=</XMTok>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="blackboard" role="UNKNOWN">E</XMTok>
              <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
            </XMApp>
            <XMWrap>
              <XMTok role="OPEN" stretchy="false">[</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok mathstyle="text" meaning="sum" role="SUMOP" scriptpos="post">∑</XMTok>
                <XMApp>
                  <XMTok fontsize="70%" meaning="greater-than-or-equals" name="geq" role="RELOP">≥</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                </XMApp>
              </XMApp>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
              </XMApp>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">r</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
              </XMApp>
              <XMTok role="VERTBAR" stretchy="false">|</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
              </XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMTok font="italic" role="UNKNOWN">s</XMTok>
              <XMTok role="PUNCT">,</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">a</XMTok>
                <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
              </XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMTok font="italic" role="UNKNOWN">a</XMTok>
              <XMTok role="CLOSE" stretchy="false">]</XMTok>
            </XMWrap>
          </XMath>
        </Math>,
with <Math mode="inline" tex="a_{t}=\pi(s_{t})" text="a _ t = pi * s _ t" xml:id="S2.p1.m13">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">a</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
              </XMApp>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                <XMDual>
                  <XMRef idref="S2.p1.m13.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp xml:id="S2.p1.m13.1">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">s</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>, <Math content-tex="r_{t}=\R(s_{t},a_{t})" mode="inline" tex="r_{t}=\operatorname{R}(s_{t},a_{t})" text="r _ t = R@(s _ t, a _ t)" xml:id="S2.p1.m14">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">r</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
              </XMApp>
              <XMDual>
                <XMApp>
                  <XMRef idref="S2.p1.m14.1"/>
                  <XMRef idref="S2.p1.m14.2"/>
                  <XMRef idref="S2.p1.m14.3"/>
                </XMApp>
                <XMApp>
                  <XMDual role="OPFUNCTION" xml:id="S2.p1.m14.1">
                    <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                    <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                  </XMDual>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp xml:id="S2.p1.m14.2">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">s</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMApp xml:id="S2.p1.m14.3">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">a</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMApp>
              </XMDual>
            </XMApp>
          </XMath>
        </Math> and <Math content-tex="s_{t+1}\sim\Kernel(.|s_{t},a_{t})" mode="inline" tex="s_{t+1}\sim\operatorname{P}(.|s_{t},a_{t})" xml:id="S2.p1.m15">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">s</XMTok>
              <XMApp>
                <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
              </XMApp>
            </XMApp>
            <XMTok meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
            <XMDual role="OPFUNCTION">
              <XMTok name="Kernel" role="OPFUNCTION" scriptpos="post"/>
              <XMTok role="OPFUNCTION" scriptpos="post">P</XMTok>
            </XMDual>
            <XMWrap>
              <XMTok role="OPEN" stretchy="false">(</XMTok>
              <XMTok role="PERIOD">.</XMTok>
              <XMTok role="VERTBAR" stretchy="false">|</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
              </XMApp>
              <XMTok role="PUNCT">,</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">a</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
              </XMApp>
              <XMTok role="CLOSE" stretchy="false">)</XMTok>
            </XMWrap>
          </XMath>
        </Math>.
By construction, it satisfies the Bellman equation: for any <Math mode="inline" tex="s,a" text="list@(s, a)" xml:id="S2.p1.m16">
          <XMath>
            <XMDual>
              <XMApp>
                <XMTok meaning="list"/>
                <XMRef idref="S2.p1.m16.1"/>
                <XMRef idref="S2.p1.m16.2"/>
              </XMApp>
              <XMWrap>
                <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m16.1">s</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m16.2">a</XMTok>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math>, <Math content-tex="Q^{\pi}(s,a)=\R(s,a)+\gamma\sum_{s^{\prime}}\Kernel(s^{\prime}|s,a)Q^{\pi}(s^{%&#10;\prime},\pi(s^{\prime}))" mode="inline" tex="Q^{\pi}(s,a)=\operatorname{R}(s,a)+\gamma\sum_{s^{\prime}}\operatorname{P}(s^{%&#10;\prime}|s,a)Q^{\pi}(s^{\prime},\pi(s^{\prime}))" text="Q ^ pi * open-interval@(s, a) = R@(s, a) + gamma * (sum _ (s ^ prime))@(Kernel@(conditional@(s ^ prime, list@(s, a))) * Q ^ pi * open-interval@(s ^ prime, pi * s ^ prime))" xml:id="S2.p1.m17">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                  <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
                </XMApp>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S2.p1.m17.1"/>
                    <XMRef idref="S2.p1.m17.2"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m17.1">s</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m17.2">a</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
              <XMApp>
                <XMTok meaning="plus" role="ADDOP">+</XMTok>
                <XMDual>
                  <XMApp>
                    <XMRef idref="S2.p1.m17.3"/>
                    <XMRef idref="S2.p1.m17.4"/>
                    <XMRef idref="S2.p1.m17.5"/>
                  </XMApp>
                  <XMApp>
                    <XMDual role="OPFUNCTION" xml:id="S2.p1.m17.3">
                      <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                      <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                    </XMDual>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m17.4">s</XMTok>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m17.5">a</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMApp>
                </XMDual>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
                  <XMApp>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok mathstyle="text" meaning="sum" role="SUMOP" scriptpos="post">∑</XMTok>
                      <XMApp>
                        <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                        <XMTok fontsize="50%" name="prime" role="SUPOP">′</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMDual>
                        <XMApp>
                          <XMRef idref="S2.p1.m17.8"/>
                          <XMRef idref="S2.p1.m17.9"/>
                        </XMApp>
                        <XMApp>
                          <XMDual role="OPFUNCTION" xml:id="S2.p1.m17.8">
                            <XMTok name="Kernel" role="OPFUNCTION" scriptpos="post"/>
                            <XMTok role="OPFUNCTION" scriptpos="post">P</XMTok>
                          </XMDual>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">(</XMTok>
                            <XMApp xml:id="S2.p1.m17.9">
                              <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                              <XMApp>
                                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                                <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                              </XMApp>
                              <XMDual>
                                <XMApp>
                                  <XMTok meaning="list"/>
                                  <XMRef idref="S2.p1.m17.6"/>
                                  <XMRef idref="S2.p1.m17.7"/>
                                </XMApp>
                                <XMWrap>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m17.6">s</XMTok>
                                  <XMTok role="PUNCT">,</XMTok>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m17.7">a</XMTok>
                                </XMWrap>
                              </XMDual>
                            </XMApp>
                            <XMTok role="CLOSE" stretchy="false">)</XMTok>
                          </XMWrap>
                        </XMApp>
                      </XMDual>
                      <XMApp>
                        <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                        <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
                      </XMApp>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S2.p1.m17.10"/>
                          <XMRef idref="S2.p1.m17.11"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMApp xml:id="S2.p1.m17.10">
                            <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="italic" role="UNKNOWN">s</XMTok>
                            <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                          </XMApp>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMApp xml:id="S2.p1.m17.11">
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                            <XMDual>
                              <XMRef idref="S2.p1.m17.11.1"/>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">(</XMTok>
                                <XMApp xml:id="S2.p1.m17.11.1">
                                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                                  <XMTok font="italic" role="UNKNOWN">s</XMTok>
                                  <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false">)</XMTok>
                              </XMWrap>
                            </XMDual>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                  </XMApp>
                </XMApp>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>.
An optimal policy <Math mode="inline" tex="\pi^{*}" text="pi ^ *" xml:id="S2.p1.m18">
          <XMath>
            <XMApp>
              <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
              <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
            </XMApp>
          </XMath>
        </Math> satisfies component-wise <Math mode="inline" tex="Q^{\pi_{*}}\geq Q^{\pi}" text="Q ^ (pi _ *) &gt;= Q ^ pi" xml:id="S2.p1.m19">
          <XMath>
            <XMApp>
              <XMTok meaning="greater-than-or-equals" name="geq" role="RELOP">≥</XMTok>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                  <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
                  <XMTok fontsize="50%" meaning="times" role="MULOP">*</XMTok>
                </XMApp>
              </XMApp>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>, for any policy <Math mode="inline" tex="\pi" text="pi" xml:id="S2.p1.m20">
          <XMath>
            <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
          </XMath>
        </Math>. Let <Math mode="inline" tex="Q^{*}=Q^{\pi_{*}}" text="Q ^ * = Q ^ (pi _ *)" xml:id="S2.p1.m21">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
              </XMApp>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                  <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
                  <XMTok fontsize="50%" meaning="times" role="MULOP">*</XMTok>
                </XMApp>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> be the associated (unique) optimal <Math mode="inline" tex="Q" text="Q" xml:id="S2.p1.m22">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">Q</XMTok>
          </XMath>
        </Math>-function, any deterministic optimal policy is greedy with respect to it: <Math mode="inline" tex="\pi^{*}(s)\in\operatorname*{argmax}_{a}Q^{*}(s,a)" text="pi ^ * * s element-of (argmax _ a)@(Q ^ *) * open-interval@(s, a)" xml:id="S2.p1.m23">
          <XMath>
            <XMApp>
              <XMTok meaning="element-of" name="in" role="RELOP">∈</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                  <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                </XMApp>
                <XMDual>
                  <XMRef idref="S2.p1.m23.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m23.1">s</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMApp scriptpos="mid">
                    <XMTok role="SUBSCRIPTOP" scriptpos="mid1"/>
                    <XMTok role="OPERATOR" scriptpos="mid">argmax</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                    <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                  </XMApp>
                </XMApp>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S2.p1.m23.2"/>
                    <XMRef idref="S2.p1.m23.3"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m23.2">s</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m23.3">a</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>.</p>
    </para>
    <para xml:id="S2.p2">
      <p><text font="bold">Exploration Bonus.</text> A common strategy to encourage exploration is to augment the reward function with a bonus. This bonus generally depends on past history. For example, a bonus rewarding novelty requires remembering what has been experienced so far. Write <Math mode="inline" tex="h_{t}=(s_{0},a_{0},\dots,s_{t-1},a_{t-1},s_{t})" text="h _ t = vector@(s _ 0, a _ 0, dots, s _ (t - 1), a _ (t - 1), s _ t)" xml:id="S2.p2.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">h</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
              </XMApp>
              <XMDual>
                <XMApp>
                  <XMTok meaning="vector"/>
                  <XMRef idref="S2.p2.m1.2"/>
                  <XMRef idref="S2.p2.m1.3"/>
                  <XMRef idref="S2.p2.m1.1"/>
                  <XMRef idref="S2.p2.m1.4"/>
                  <XMRef idref="S2.p2.m1.5"/>
                  <XMRef idref="S2.p2.m1.6"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp xml:id="S2.p2.m1.2">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S2.p2.m1.3">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">a</XMTok>
                    <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok name="dots" role="ID" xml:id="S2.p2.m1.1">…</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S2.p2.m1.4">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S2.p2.m1.5">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">a</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S2.p2.m1.6">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math> the history up to time <Math mode="inline" tex="t" text="t" xml:id="S2.p2.m2">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">t</XMTok>
          </XMath>
        </Math>, and <Math content-tex="\Memory" mode="inline" tex="\operatorname{H}" text="Memory" xml:id="S2.p2.m3">
          <XMath>
            <XMDual role="OPFUNCTION">
              <XMTok name="Memory" role="OPFUNCTION" scriptpos="post"/>
              <XMTok role="OPFUNCTION" scriptpos="post">H</XMTok>
            </XMDual>
          </XMath>
        </Math> the set of all histories. Generally speaking, we abstract a bonus as <Math content-tex="\B:\Memory\times\Actions\rightarrow\mathbb{R}" mode="inline" tex="\operatorname{B}:\operatorname{H}\times\operatorname{A}\rightarrow\mathbb{R}" text="B colon Memory * Actions rightarrow R" xml:id="S2.p2.m4">
          <XMath>
            <XMApp>
              <XMTok name="colon" role="METARELOP">:</XMTok>
              <XMDual role="OPFUNCTION">
                <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
              </XMDual>
              <XMApp>
                <XMTok name="rightarrow" role="ARROW">→</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">×</XMTok>
                  <XMDual role="OPFUNCTION">
                    <XMTok name="Memory" role="OPFUNCTION" scriptpos="post"/>
                    <XMTok role="OPFUNCTION" scriptpos="post">H</XMTok>
                  </XMDual>
                  <XMDual role="OPFUNCTION">
                    <XMTok name="Actions" role="OPFUNCTION" scriptpos="post"/>
                    <XMTok role="OPFUNCTION" scriptpos="post">A</XMTok>
                  </XMDual>
                </XMApp>
                <XMTok font="blackboard" role="UNKNOWN">R</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>, and use it for addressing the dilemma between exploration and exploitation, which thus amounts for the agent to optimize for <Math mode="inline" tex="R(s_{t},a_{t})+B(h_{t},a_{t})" text="R * open-interval@(s _ t, a _ t) + B * open-interval@(h _ t, a _ t)" xml:id="S2.p2.m5">
          <XMath>
            <XMApp>
              <XMTok meaning="plus" role="ADDOP">+</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">R</XMTok>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S2.p2.m5.1"/>
                    <XMRef idref="S2.p2.m5.2"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp xml:id="S2.p2.m5.1">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">s</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMApp xml:id="S2.p2.m5.2">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">a</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">B</XMTok>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S2.p2.m5.3"/>
                    <XMRef idref="S2.p2.m5.4"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp xml:id="S2.p2.m5.3">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">h</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMApp xml:id="S2.p2.m5.4">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">a</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> instead of simply <Math mode="inline" tex="R(s_{t},a_{t})" text="R * open-interval@(s _ t, a _ t)" xml:id="S2.p2.m6">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" role="UNKNOWN">R</XMTok>
              <XMDual>
                <XMApp>
                  <XMTok meaning="open-interval"/>
                  <XMRef idref="S2.p2.m6.1"/>
                  <XMRef idref="S2.p2.m6.2"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp xml:id="S2.p2.m6.1">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S2.p2.m6.2">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">a</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math>.
This state-of-the-art exploration bonuses all rely on memory, <text font="italic">e.g.</text> by counting the state visitation  <cite class="ltx_citemacro_citep">(<bibref bibrefs="strehl2008analysis" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> or through updates of a neural networks  <cite class="ltx_citemacro_citep">(<bibref bibrefs="burda2018exploration,ostrovski2017count" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>.
These exploration bonuses are designed to express the prior that any source of novelty is good for exploration.</p>
    </para>
    <para xml:id="S2.p3">
      <p>Such a prior on what is good for exploration is task-specific. <cite class="ltx_citemacro_citet"><bibref bibrefs="taiga2020bonus" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> showed that state-of-the-art bonuses were degrading performances in most Atari games.</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S3">
    <tags>
      <tag>3</tag>
      <tag role="autoref">section 3</tag>
      <tag role="refnum">3</tag>
      <tag role="typerefnum">§3</tag>
    </tags>
    <title><tag close=". ">3</tag>Show me the Way</title>
    <toctitle><tag close=" ">3</tag>Show me the Way</toctitle>
    <para xml:id="S3.p1">
      <p>Rather than handcrafting a bonus that encodes what we think intrinsic motivation should be (<text font="italic">e.g.</text> using novelty), we propose to learn it from demonstrations of exploratory behaviours.</p>
    </para>
    <para xml:id="S3.p2">
      <p>We thus assume that the demonstrator learns to solve a task by exploring its environment and a simple solution would be to perform behavior cloning. Because the demonstrator is likely to use past interactions to make decisions (remembering what has been already tried so far), we could frame our problem as learning, in a supervised manner, a mapping from histories to actions. Yet, behavioral cloning suffers the behavioral drift <cite class="ltx_citemacro_citep">(<bibref bibrefs="ross2010efficient" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, which would be exacerbated in the case of history dependent policies . Moreover, we would like to transfer this behavior to new environments and possibly to new tasks.</p>
    </para>
    <para xml:id="S3.p3">
      <p>Imitation learning classically assumes that experts are optimizing an MDP with an unknown reward function <Math content-tex="\R_{E}(s,a)" mode="inline" tex="\operatorname{R}_{E}(s,a)" text="(R _ E)@(s, a)" xml:id="S3.p3.m1">
          <XMath>
            <XMDual>
              <XMApp>
                <XMRef idref="S3.p3.m1.3"/>
                <XMRef idref="S3.p3.m1.1"/>
                <XMRef idref="S3.p3.m1.2"/>
              </XMApp>
              <XMApp>
                <XMApp xml:id="S3.p3.m1.3">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMDual role="OPFUNCTION">
                    <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                    <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                  </XMDual>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.p3.m1.1">s</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.p3.m1.2">a</XMTok>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMApp>
            </XMDual>
          </XMath>
        </Math>. Note that this introduces a modelling bias, <text font="italic">i.e.</text> a human performing a task is not necessarily explicitly solving an MDP. In this study, we do not tackle the standard problem of inferring an optimal behavior from demonstrations, but of estimating an exploratory behavior. Yet, the latter can be reduced to the former. Taking inspiration from RL, and especially from the body of work about the exploration-exploitation dilemma, we assume that our demonstrator is optimizing for an unknown reward function <Math content-tex="\R_{E}(s,a)" mode="inline" tex="\operatorname{R}_{E}(s,a)" text="(R _ E)@(s, a)" xml:id="S3.p3.m2">
          <XMath>
            <XMDual>
              <XMApp>
                <XMRef idref="S3.p3.m2.3"/>
                <XMRef idref="S3.p3.m2.1"/>
                <XMRef idref="S3.p3.m2.2"/>
              </XMApp>
              <XMApp>
                <XMApp xml:id="S3.p3.m2.3">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMDual role="OPFUNCTION">
                    <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                    <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                  </XMDual>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.p3.m2.1">s</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.p3.m2.2">a</XMTok>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMApp>
            </XMDual>
          </XMath>
        </Math>, standing for the task objective, augmented with a trajectory-dependent intrinsic bonus <Math content-tex="\B_{E}(h,a)" mode="inline" tex="\operatorname{B}_{E}(h,a)" text="(B _ E)@(h, a)" xml:id="S3.p3.m3">
          <XMath>
            <XMDual>
              <XMApp>
                <XMRef idref="S3.p3.m3.3"/>
                <XMRef idref="S3.p3.m3.1"/>
                <XMRef idref="S3.p3.m3.2"/>
              </XMApp>
              <XMApp>
                <XMApp xml:id="S3.p3.m3.3">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMDual role="OPFUNCTION">
                    <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                    <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                  </XMDual>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.p3.m3.1">h</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.p3.m3.2">a</XMTok>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMApp>
            </XMDual>
          </XMath>
        </Math>, standing for how the environment is explored. Making this bonus depend on past interactions is important as we can reasonably assume that exploration is based on memory (one would not try to always reproduce situations that were already seen). Then, we assume that the expert is optimal for the bonus-augmented reward <Math content-tex="\R_{E}(s,a)+\B_{E}(h,a)" mode="inline" tex="\operatorname{R}_{E}(s,a)+\operatorname{B}_{E}(h,a)" text="(R _ E)@(s, a) + (B _ E)@(h, a)" xml:id="S3.p3.m4">
          <XMath>
            <XMApp>
              <XMTok meaning="plus" role="ADDOP">+</XMTok>
              <XMDual>
                <XMApp>
                  <XMRef idref="S3.p3.m4.5"/>
                  <XMRef idref="S3.p3.m4.1"/>
                  <XMRef idref="S3.p3.m4.2"/>
                </XMApp>
                <XMApp>
                  <XMApp xml:id="S3.p3.m4.5">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMDual role="OPFUNCTION">
                      <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                      <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                    </XMDual>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p3.m4.1">s</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p3.m4.2">a</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMApp>
              </XMDual>
              <XMDual>
                <XMApp>
                  <XMRef idref="S3.p3.m4.6"/>
                  <XMRef idref="S3.p3.m4.3"/>
                  <XMRef idref="S3.p3.m4.4"/>
                </XMApp>
                <XMApp>
                  <XMApp xml:id="S3.p3.m4.6">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMDual role="OPFUNCTION">
                      <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                      <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                    </XMDual>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p3.m4.3">h</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p3.m4.4">a</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMApp>
              </XMDual>
            </XMApp>
          </XMath>
        </Math>, in the augmented MDP <Math content-tex="\{\Memory,A,P,\R_{E}+\B_{E},\gamma\}" mode="inline" tex="\{\operatorname{H},A,P,\operatorname{R}_{E}+\operatorname{B}_{E},\gamma\}" text="set@(Memory, A, P, R _ E + B _ E, gamma)" xml:id="S3.p3.m5">
          <XMath>
            <XMDual>
              <XMApp>
                <XMTok meaning="set"/>
                <XMRef idref="S3.p3.m5.1"/>
                <XMRef idref="S3.p3.m5.2"/>
                <XMRef idref="S3.p3.m5.3"/>
                <XMRef idref="S3.p3.m5.5"/>
                <XMRef idref="S3.p3.m5.4"/>
              </XMApp>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">{</XMTok>
                <XMDual role="OPFUNCTION" xml:id="S3.p3.m5.1">
                  <XMTok name="Memory" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">H</XMTok>
                </XMDual>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok font="italic" role="UNKNOWN" xml:id="S3.p3.m5.2">A</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok font="italic" role="UNKNOWN" xml:id="S3.p3.m5.3">P</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp xml:id="S3.p3.m5.5">
                  <XMTok meaning="plus" role="ADDOP">+</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMDual role="OPFUNCTION">
                      <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                      <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                    </XMDual>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMDual role="OPFUNCTION">
                      <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                      <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                    </XMDual>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                  </XMApp>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok font="italic" name="gamma" role="UNKNOWN" xml:id="S3.p3.m5.4">γ</XMTok>
                <XMTok role="CLOSE" stretchy="false">}</XMTok>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math>. That is the original MDP, with the state space being replaced by the set of all trajectories and the reward function being augmented with a bonus function.
So, we reduced our original problem to Inverse Reinforcement Learning (IRL): learn a function <Math content-tex="\widehat{RB}:\Memory\times\Actions\rightarrow\mathbb{R}" mode="inline" tex="\widehat{RB}:\operatorname{H}\times\operatorname{A}\rightarrow\mathbb{R}" text="widehat@(R * B) colon Memory * Actions rightarrow R" xml:id="S3.p3.m6">
          <XMath>
            <XMApp>
              <XMTok name="colon" role="METARELOP">:</XMTok>
              <XMApp>
                <XMTok name="widehat" role="OVERACCENT">^</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">R</XMTok>
                  <XMTok font="italic" role="UNKNOWN">B</XMTok>
                </XMApp>
              </XMApp>
              <XMApp>
                <XMTok name="rightarrow" role="ARROW">→</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">×</XMTok>
                  <XMDual role="OPFUNCTION">
                    <XMTok name="Memory" role="OPFUNCTION" scriptpos="post"/>
                    <XMTok role="OPFUNCTION" scriptpos="post">H</XMTok>
                  </XMDual>
                  <XMDual role="OPFUNCTION">
                    <XMTok name="Actions" role="OPFUNCTION" scriptpos="post"/>
                    <XMTok role="OPFUNCTION" scriptpos="post">A</XMTok>
                  </XMDual>
                </XMApp>
                <XMTok font="blackboard" role="UNKNOWN">R</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> such that the demonstrator is the (unique) optimal policy. By design, <Math content-tex="\widehat{RB}=\R_{E}+\B_{E}" mode="inline" tex="\widehat{RB}=\operatorname{R}_{E}+\operatorname{B}_{E}" text="widehat@(R * B) = R _ E + B _ E" xml:id="S3.p3.m7">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok name="widehat" role="OVERACCENT">^</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">R</XMTok>
                  <XMTok font="italic" role="UNKNOWN">B</XMTok>
                </XMApp>
              </XMApp>
              <XMApp>
                <XMTok meaning="plus" role="ADDOP">+</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMDual role="OPFUNCTION">
                    <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                    <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                  </XMDual>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                </XMApp>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMDual role="OPFUNCTION">
                    <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                    <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                  </XMDual>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                </XMApp>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> is a solution to this problem (even if it is not learnable exactly, as the optimal policy is invariant to many reward transformations <cite class="ltx_citemacro_citep">(<bibref bibrefs="ng1999policy" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>). Yet, we also assume that we know the task’s reward <Math mode="inline" tex="R" text="R" xml:id="S3.p3.m8">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">R</XMTok>
          </XMath>
        </Math>, or at least that we observe it in the demonstrations. Formally, it may be different from the reward <Math mode="inline" tex="R_{E}" text="R _ E" xml:id="S3.p3.m9">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">R</XMTok>
              <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
            </XMApp>
          </XMath>
        </Math> (even if we assume that it leads to a similar optimal behavior), but we can leverage it to disentangle the task contribution and the exploration one. For doing so, we propose to learn a bonus function <Math mode="inline" tex="\hat{B}:\mathcal{H}\times A\rightarrow\mathbb{R}" text="hat@(B) colon H * A rightarrow R" xml:id="S3.p3.m10">
          <XMath>
            <XMApp>
              <XMTok name="colon" role="METARELOP">:</XMTok>
              <XMApp>
                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                <XMTok font="italic" role="UNKNOWN">B</XMTok>
              </XMApp>
              <XMApp>
                <XMTok name="rightarrow" role="ARROW">→</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">×</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">H</XMTok>
                  <XMTok font="italic" role="UNKNOWN">A</XMTok>
                </XMApp>
                <XMTok font="blackboard" role="UNKNOWN">R</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> such that the demonstrator is optimal for <Math mode="inline" tex="R+\hat{B}" text="R + hat@(B)" xml:id="S3.p3.m11">
          <XMath>
            <XMApp>
              <XMTok meaning="plus" role="ADDOP">+</XMTok>
              <XMTok font="italic" role="UNKNOWN">R</XMTok>
              <XMApp>
                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                <XMTok font="italic" role="UNKNOWN">B</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>. Notice that it does not change the problem, as it is just a reparameterization of the previous one (by setting <Math content-tex="\hat{B}=\widehat{RB}-\R" mode="inline" tex="\hat{B}=\widehat{RB}-\operatorname{R}" text="hat@(B) = widehat@(R * B) - R" xml:id="S3.p3.m12">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                <XMTok font="italic" role="UNKNOWN">B</XMTok>
              </XMApp>
              <XMApp>
                <XMTok meaning="minus" role="ADDOP">-</XMTok>
                <XMApp>
                  <XMTok name="widehat" role="OVERACCENT">^</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">R</XMTok>
                    <XMTok font="italic" role="UNKNOWN">B</XMTok>
                  </XMApp>
                </XMApp>
                <XMDual role="OPFUNCTION">
                  <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                </XMDual>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>).</p>
    </para>
<!--  %****␣sample.tex␣Line␣150␣**** -->    <para xml:id="S3.p4">
      <p>Thus, our IRL problem has additional constraints. First, we want to recover the bonus, for transferring to new environments or new tasks, this preclude using imitation learning methods that do not explicitly recover rewards (IRL is mandatory). Second, our specific parameterization (<Math content-tex="\R+\B" mode="inline" tex="\operatorname{R}+\operatorname{B}" text="R + B" xml:id="S3.p4.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="plus" role="ADDOP">+</XMTok>
              <XMDual role="OPFUNCTION">
                <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
              </XMDual>
              <XMDual role="OPFUNCTION">
                <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
              </XMDual>
            </XMApp>
          </XMath>
        </Math>) precludes using IRL methods that would not allow using the observation of the task reward <Math content-tex="\R" mode="inline" tex="\operatorname{R}" text="R" xml:id="S3.p4.m2">
          <XMath>
            <XMDual role="OPFUNCTION">
              <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
              <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
            </XMDual>
          </XMath>
        </Math> along the expert trajectories. Third, the function we want to estimate is history-dependent which requires an IRL method able to use sequences as inputs.</p>
    </para>
    <para xml:id="S3.p5">
      <p><text font="bold">Formalism.</text>
We assume to have access to demonstrations that are optimal according to the (known) reward of the environment <emph font="italic">plus</emph> an (unknown) intrinsic bonus. The environment being assumed Markovian, knowing the current state is enough to act optimally according to the task (optimizing for the environment’s reward). Yet, the demonstrator also optimizes its exploration bonus, that depends on the past. To formalize things, we consider that the demonstrations are provided by a policy <Math content-tex="\pi_{E}:\Memory\rightarrow\Actions" mode="inline" tex="\pi_{E}:\operatorname{H}\rightarrow\operatorname{A}" text="pi _ E colon Memory rightarrow Actions" xml:id="S3.p5.m1">
          <XMath>
            <XMApp>
              <XMTok name="colon" role="METARELOP">:</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
              </XMApp>
              <XMApp>
                <XMTok name="rightarrow" role="ARROW">→</XMTok>
                <XMDual role="OPFUNCTION">
                  <XMTok name="Memory" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">H</XMTok>
                </XMDual>
                <XMDual role="OPFUNCTION">
                  <XMTok name="Actions" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">A</XMTok>
                </XMDual>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>, and that the policy is optimal for the augmented MDP <Math content-tex="(\Memory,\Actions,\Kernel,\R_{E}+\B_{E})" mode="inline" tex="(\operatorname{H},\operatorname{A},\operatorname{P},\operatorname{R}_{E}+%&#10;\operatorname{B}_{E})" text="vector@(Memory, Actions, Kernel, R _ E + B _ E)" xml:id="S3.p5.m2">
          <XMath>
            <XMDual>
              <XMApp>
                <XMTok meaning="vector"/>
                <XMRef idref="S3.p5.m2.1"/>
                <XMRef idref="S3.p5.m2.2"/>
                <XMRef idref="S3.p5.m2.3"/>
                <XMRef idref="S3.p5.m2.4"/>
              </XMApp>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">(</XMTok>
                <XMDual role="OPFUNCTION" xml:id="S3.p5.m2.1">
                  <XMTok name="Memory" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">H</XMTok>
                </XMDual>
                <XMTok role="PUNCT">,</XMTok>
                <XMDual role="OPFUNCTION" xml:id="S3.p5.m2.2">
                  <XMTok name="Actions" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">A</XMTok>
                </XMDual>
                <XMTok role="PUNCT">,</XMTok>
                <XMDual role="OPFUNCTION" xml:id="S3.p5.m2.3">
                  <XMTok name="Kernel" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">P</XMTok>
                </XMDual>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp xml:id="S3.p5.m2.4">
                  <XMTok meaning="plus" role="ADDOP">+</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMDual role="OPFUNCTION">
                      <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                      <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                    </XMDual>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMDual role="OPFUNCTION">
                      <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                      <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                    </XMDual>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                  </XMApp>
                </XMApp>
                <XMTok role="CLOSE" stretchy="false">)</XMTok>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math>, where <Math content-tex="\Memory" mode="inline" tex="\operatorname{H}" text="Memory" xml:id="S3.p5.m3">
          <XMath>
            <XMDual role="OPFUNCTION">
              <XMTok name="Memory" role="OPFUNCTION" scriptpos="post"/>
              <XMTok role="OPFUNCTION" scriptpos="post">H</XMTok>
            </XMDual>
          </XMath>
        </Math> replaces <Math content-tex="\States" mode="inline" tex="\operatorname{S}" text="States" xml:id="S3.p5.m4">
          <XMath>
            <XMDual role="OPFUNCTION">
              <XMTok name="States" role="OPFUNCTION" scriptpos="post"/>
              <XMTok role="OPFUNCTION" scriptpos="post">S</XMTok>
            </XMDual>
          </XMath>
        </Math> and <Math content-tex="\R_{E}+\B_{E}" mode="inline" tex="\operatorname{R}_{E}+\operatorname{B}_{E}" text="R _ E + B _ E" xml:id="S3.p5.m5">
          <XMath>
            <XMApp>
              <XMTok meaning="plus" role="ADDOP">+</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMDual role="OPFUNCTION">
                  <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                </XMDual>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
              </XMApp>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMDual role="OPFUNCTION">
                  <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                </XMDual>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> replaces <Math content-tex="\R" mode="inline" tex="\operatorname{R}" text="R" xml:id="S3.p5.m6">
          <XMath>
            <XMDual role="OPFUNCTION">
              <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
              <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
            </XMDual>
          </XMath>
        </Math>.</p>
    </para>
    <para xml:id="S3.p6">
      <p>We frame our problem as learning the bonus <Math content-tex="\B_{E}" mode="inline" tex="\operatorname{B}_{E}" text="B _ E" xml:id="S3.p6.m1">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMDual role="OPFUNCTION">
                <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
              </XMDual>
              <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
            </XMApp>
          </XMath>
        </Math> from trajectories sampled from <Math mode="inline" tex="\pi_{E}" text="pi _ E" xml:id="S3.p6.m2">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
              <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
            </XMApp>
          </XMath>
        </Math>.</p>
    </para>
    <para xml:id="S3.p7">
      <p><text font="bold">Our approach.</text>
If we cannot naively apply any existing IRL algorithm to our problem, it can be a source of inspiration. Especially, one suits well our problem: the set-policy framework <cite class="ltx_citemacro_citep">(<bibref bibrefs="piot2016bridging" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. It shows that a formal bijection between supervised learning and IRL exists. Among the covered algorithms, the <text font="italic">Cascaded Supervised approach to IRL</text> (CSI) <cite class="ltx_citemacro_citep">(<bibref bibrefs="klein2013cascaded" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> is of particular interest to us. We refer the reader to these papers for more details and we explain in details here how the CSI paradigm can be readily applied to our setting.</p>
    </para>
    <para xml:id="S3.p8">
      <p>The demonstrator’s policy, <Math mode="inline" tex="\pi_{E}" text="pi _ E" xml:id="S3.p8.m1">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
              <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
            </XMApp>
          </XMath>
        </Math>, is assumed optimal for <Math content-tex="\R_{E}+\B_{E}" mode="inline" tex="\operatorname{R}_{E}+\operatorname{B}_{E}" text="R _ E + B _ E" xml:id="S3.p8.m2">
          <XMath>
            <XMApp>
              <XMTok meaning="plus" role="ADDOP">+</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMDual role="OPFUNCTION">
                  <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                </XMDual>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
              </XMApp>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMDual role="OPFUNCTION">
                  <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                </XMDual>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> (which is unknown), so</p>
      <equation xml:id="S3.Ex1">
        <Math content-tex="\pi_{E}=\pi_{\R_{E}+\B_{E}}^{*}." mode="display" tex="\pi_{E}=\pi_{\operatorname{R}_{E}+\operatorname{B}_{E}}^{*}." text="pi _ E = (pi _ (R _ E + B _ E)) ^ *" xml:id="S3.Ex1.m1">
          <XMath>
            <XMDual>
              <XMRef idref="S3.Ex1.m1.1"/>
              <XMWrap>
                <XMApp xml:id="S3.Ex1.m1.1">
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                      <XMApp>
                        <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                          <XMDual role="OPFUNCTION">
                            <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                            <XMTok fontsize="70%" role="OPFUNCTION" scriptpos="post">R</XMTok>
                          </XMDual>
                          <XMTok font="italic" fontsize="50%" role="UNKNOWN">E</XMTok>
                        </XMApp>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                          <XMDual role="OPFUNCTION">
                            <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                            <XMTok fontsize="70%" role="OPFUNCTION" scriptpos="post">B</XMTok>
                          </XMDual>
                          <XMTok font="italic" fontsize="50%" role="UNKNOWN">E</XMTok>
                        </XMApp>
                      </XMApp>
                    </XMApp>
                    <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                  </XMApp>
                </XMApp>
                <XMTok role="PERIOD">.</XMTok>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math>
      </equation>
      <p>Write <Math content-tex="Q_{E}(h,a)=Q^{*}_{\R_{E}+\B_{E}}(h,a)" mode="inline" tex="Q_{E}(h,a)=Q^{*}_{\operatorname{R}_{E}+\operatorname{B}_{E}}(h,a)" text="Q _ E * open-interval@(h, a) = (Q ^ *) _ (R _ E + B _ E) * open-interval@(h, a)" xml:id="S3.p8.m3">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                </XMApp>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S3.p8.m3.1"/>
                    <XMRef idref="S3.p8.m3.2"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p8.m3.1">h</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p8.m3.2">a</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                    <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                      <XMDual role="OPFUNCTION">
                        <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                        <XMTok fontsize="70%" role="OPFUNCTION" scriptpos="post">R</XMTok>
                      </XMDual>
                      <XMTok font="italic" fontsize="50%" role="UNKNOWN">E</XMTok>
                    </XMApp>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                      <XMDual role="OPFUNCTION">
                        <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                        <XMTok fontsize="70%" role="OPFUNCTION" scriptpos="post">B</XMTok>
                      </XMDual>
                      <XMTok font="italic" fontsize="50%" role="UNKNOWN">E</XMTok>
                    </XMApp>
                  </XMApp>
                </XMApp>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S3.p8.m3.3"/>
                    <XMRef idref="S3.p8.m3.4"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p8.m3.3">h</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p8.m3.4">a</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> the associated optimal <Math mode="inline" tex="Q" text="Q" xml:id="S3.p8.m4">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">Q</XMTok>
          </XMath>
        </Math>-function. It satisfies the Bellman optimality equation (writing <Math mode="inline" tex="h=(\dots,s)" text="h = open-interval@(dots, s)" xml:id="S3.p8.m5">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMTok font="italic" role="UNKNOWN">h</XMTok>
              <XMDual>
                <XMApp>
                  <XMTok meaning="open-interval"/>
                  <XMRef idref="S3.p8.m5.1"/>
                  <XMRef idref="S3.p8.m5.2"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMTok name="dots" role="ID" xml:id="S3.p8.m5.1">…</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.p8.m5.2">s</XMTok>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math>, that is <Math mode="inline" tex="s" text="s" xml:id="S3.p8.m6">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">s</XMTok>
          </XMath>
        </Math> the last state of the trajectory <Math mode="inline" tex="h" text="h" xml:id="S3.p8.m7">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">h</XMTok>
          </XMath>
        </Math> and <Math mode="inline" tex="h^{\prime}=(h,a,s^{\prime})" text="h ^ prime = vector@(h, a, s ^ prime)" xml:id="S3.p8.m8">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">h</XMTok>
                <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
              </XMApp>
              <XMDual>
                <XMApp>
                  <XMTok meaning="vector"/>
                  <XMRef idref="S3.p8.m8.1"/>
                  <XMRef idref="S3.p8.m8.2"/>
                  <XMRef idref="S3.p8.m8.3"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.p8.m8.1">h</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.p8.m8.2">a</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S3.p8.m8.3">
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math>):</p>
      <equation xml:id="S3.E1">
        <tags>
          <tag>(1)</tag>
          <tag role="autoref">Equation 1</tag>
          <tag role="refnum">1</tag>
        </tags>
        <Math content-tex="\begin{split}\displaystyle Q_{E}(h,a)&amp;\displaystyle=\R_{E}(s,a)+\B_{E}(h,a)+%&#10;\gamma\mathbb{E}_{s^{\prime}|s,a}[\max_{a^{\prime}}Q_{E}(h^{\prime},a^{\prime}%&#10;)]\\&#10;&amp;\displaystyle=\R_{E}(s,a)+\B_{E}(h,a)+\gamma\mathbb{E}_{s^{\prime}|s,a}[Q_{E}%&#10;(h^{\prime},\pi_{E}(h^{\prime}))].\end{split}" mode="display" tex="\begin{split}\displaystyle Q_{E}(h,a)&amp;\displaystyle=\operatorname{R}_{E}(s,a)+%&#10;\operatorname{B}_{E}(h,a)+\gamma\mathbb{E}_{s^{\prime}|s,a}[\max_{a^{\prime}}Q%&#10;_{E}(h^{\prime},a^{\prime})]\\&#10;&amp;\displaystyle=\operatorname{R}_{E}(s,a)+\operatorname{B}_{E}(h,a)+\gamma%&#10;\mathbb{E}_{s^{\prime}|s,a}[Q_{E}(h^{\prime},\pi_{E}(h^{\prime}))].\end{split}" text="Q _ E * open-interval@(h, a) = (R _ E)@(s, a) + (B _ E)@(h, a) + gamma * E _ (conditional@(s ^ prime, list@(s, a))) * delimited-[]@((maximum _ (a ^ prime))@(Q _ E) * open-interval@(h ^ prime, a ^ prime)) = (R _ E)@(s, a) + (B _ E)@(h, a) + gamma * E _ (conditional@(s ^ prime, list@(s, a))) * delimited-[]@(Q _ E * open-interval@(h ^ prime, pi _ E * h ^ prime))" xml:id="S3.E1.m1">
          <XMath>
            <XMDual>
              <XMDual>
                <XMRef idref="S3.E1.m1.77"/>
                <XMWrap>
                  <XMApp xml:id="S3.E1.m1.77">
                    <XMTok meaning="multirelation"/>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                        <XMRef idref="S3.E1.m1.1"/>
                        <XMRef idref="S3.E1.m1.2.1"/>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="open-interval"/>
                        <XMRef idref="S3.E1.m1.4"/>
                        <XMRef idref="S3.E1.m1.6"/>
                      </XMApp>
                    </XMApp>
                    <XMRef idref="S3.E1.m1.8"/>
                    <XMApp>
                      <XMRef idref="S3.E1.m1.16"/>
                      <XMDual>
                        <XMApp>
                          <XMRef idref="S3.E1.m1.77.1"/>
                          <XMRef idref="S3.E1.m1.12"/>
                          <XMRef idref="S3.E1.m1.14"/>
                        </XMApp>
                        <XMApp>
                          <XMApp xml:id="S3.E1.m1.77.1">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                            <XMRef idref="S3.E1.m1.9"/>
                            <XMRef idref="S3.E1.m1.10.1"/>
                          </XMApp>
                          <XMWrap>
                            <XMRef idref="S3.E1.m1.11"/>
                            <XMRef idref="S3.E1.m1.12"/>
                            <XMRef idref="S3.E1.m1.13"/>
                            <XMRef idref="S3.E1.m1.14"/>
                            <XMRef idref="S3.E1.m1.15"/>
                          </XMWrap>
                        </XMApp>
                      </XMDual>
                      <XMDual>
                        <XMApp>
                          <XMRef idref="S3.E1.m1.77.2"/>
                          <XMRef idref="S3.E1.m1.20"/>
                          <XMRef idref="S3.E1.m1.22"/>
                        </XMApp>
                        <XMApp>
                          <XMApp xml:id="S3.E1.m1.77.2">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                            <XMRef idref="S3.E1.m1.17"/>
                            <XMRef idref="S3.E1.m1.18.1"/>
                          </XMApp>
                          <XMWrap>
                            <XMRef idref="S3.E1.m1.19"/>
                            <XMRef idref="S3.E1.m1.20"/>
                            <XMRef idref="S3.E1.m1.21"/>
                            <XMRef idref="S3.E1.m1.22"/>
                            <XMRef idref="S3.E1.m1.23"/>
                          </XMWrap>
                        </XMApp>
                      </XMDual>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMRef idref="S3.E1.m1.25"/>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                          <XMRef idref="S3.E1.m1.26"/>
                          <XMRef idref="S3.E1.m1.27.1"/>
                        </XMApp>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="delimited-[]"/>
                            <XMRef idref="S3.E1.m1.77.3"/>
                          </XMApp>
                          <XMWrap>
                            <XMRef idref="S3.E1.m1.28"/>
                            <XMApp xml:id="S3.E1.m1.77.3">
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMApp>
                                <XMApp scriptpos="mid">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="mid7"/>
                                  <XMRef idref="S3.E1.m1.29"/>
                                  <XMRef idref="S3.E1.m1.30.1"/>
                                </XMApp>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMRef idref="S3.E1.m1.31"/>
                                  <XMRef idref="S3.E1.m1.32.1"/>
                                </XMApp>
                              </XMApp>
                              <XMDual>
                                <XMApp>
                                  <XMTok meaning="open-interval"/>
                                  <XMRef idref="S3.E1.m1.77.3.1"/>
                                  <XMRef idref="S3.E1.m1.77.3.2"/>
                                </XMApp>
                                <XMWrap>
                                  <XMRef idref="S3.E1.m1.33"/>
                                  <XMApp xml:id="S3.E1.m1.77.3.1">
                                    <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                    <XMRef idref="S3.E1.m1.34"/>
                                    <XMRef idref="S3.E1.m1.35.1"/>
                                  </XMApp>
                                  <XMRef idref="S3.E1.m1.36"/>
                                  <XMApp xml:id="S3.E1.m1.77.3.2">
                                    <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                    <XMRef idref="S3.E1.m1.37"/>
                                    <XMRef idref="S3.E1.m1.38.1"/>
                                  </XMApp>
                                  <XMRef idref="S3.E1.m1.39"/>
                                </XMWrap>
                              </XMDual>
                            </XMApp>
                            <XMRef idref="S3.E1.m1.40"/>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                    </XMApp>
                    <XMRef idref="S3.E1.m1.41"/>
                    <XMApp>
                      <XMRef idref="S3.E1.m1.49"/>
                      <XMDual>
                        <XMApp>
                          <XMRef idref="S3.E1.m1.77.4"/>
                          <XMRef idref="S3.E1.m1.45"/>
                          <XMRef idref="S3.E1.m1.47"/>
                        </XMApp>
                        <XMApp>
                          <XMApp xml:id="S3.E1.m1.77.4">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                            <XMRef idref="S3.E1.m1.42"/>
                            <XMRef idref="S3.E1.m1.43.1"/>
                          </XMApp>
                          <XMWrap>
                            <XMRef idref="S3.E1.m1.44"/>
                            <XMRef idref="S3.E1.m1.45"/>
                            <XMRef idref="S3.E1.m1.46"/>
                            <XMRef idref="S3.E1.m1.47"/>
                            <XMRef idref="S3.E1.m1.48"/>
                          </XMWrap>
                        </XMApp>
                      </XMDual>
                      <XMDual>
                        <XMApp>
                          <XMRef idref="S3.E1.m1.77.5"/>
                          <XMRef idref="S3.E1.m1.53"/>
                          <XMRef idref="S3.E1.m1.55"/>
                        </XMApp>
                        <XMApp>
                          <XMApp xml:id="S3.E1.m1.77.5">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                            <XMRef idref="S3.E1.m1.50"/>
                            <XMRef idref="S3.E1.m1.51.1"/>
                          </XMApp>
                          <XMWrap>
                            <XMRef idref="S3.E1.m1.52"/>
                            <XMRef idref="S3.E1.m1.53"/>
                            <XMRef idref="S3.E1.m1.54"/>
                            <XMRef idref="S3.E1.m1.55"/>
                            <XMRef idref="S3.E1.m1.56"/>
                          </XMWrap>
                        </XMApp>
                      </XMDual>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMRef idref="S3.E1.m1.58"/>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                          <XMRef idref="S3.E1.m1.59"/>
                          <XMRef idref="S3.E1.m1.60.1"/>
                        </XMApp>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="delimited-[]"/>
                            <XMRef idref="S3.E1.m1.77.6"/>
                          </XMApp>
                          <XMWrap>
                            <XMRef idref="S3.E1.m1.61"/>
                            <XMApp xml:id="S3.E1.m1.77.6">
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMRef idref="S3.E1.m1.62"/>
                                <XMRef idref="S3.E1.m1.63.1"/>
                              </XMApp>
                              <XMDual>
                                <XMApp>
                                  <XMTok meaning="open-interval"/>
                                  <XMRef idref="S3.E1.m1.77.6.1"/>
                                  <XMRef idref="S3.E1.m1.77.6.2"/>
                                </XMApp>
                                <XMWrap>
                                  <XMRef idref="S3.E1.m1.64"/>
                                  <XMApp xml:id="S3.E1.m1.77.6.1">
                                    <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                    <XMRef idref="S3.E1.m1.65"/>
                                    <XMRef idref="S3.E1.m1.66.1"/>
                                  </XMApp>
                                  <XMRef idref="S3.E1.m1.67"/>
                                  <XMApp xml:id="S3.E1.m1.77.6.2">
                                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                    <XMApp>
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMRef idref="S3.E1.m1.68"/>
                                      <XMRef idref="S3.E1.m1.69.1"/>
                                    </XMApp>
                                    <XMDual>
                                      <XMRef idref="S3.E1.m1.77.6.2.1"/>
                                      <XMWrap>
                                        <XMRef idref="S3.E1.m1.70"/>
                                        <XMApp xml:id="S3.E1.m1.77.6.2.1">
                                          <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                          <XMRef idref="S3.E1.m1.71"/>
                                          <XMRef idref="S3.E1.m1.72.1"/>
                                        </XMApp>
                                        <XMRef idref="S3.E1.m1.73"/>
                                      </XMWrap>
                                    </XMDual>
                                  </XMApp>
                                  <XMRef idref="S3.E1.m1.74"/>
                                </XMWrap>
                              </XMDual>
                            </XMApp>
                            <XMRef idref="S3.E1.m1.75"/>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMTok role="PERIOD"/>
                </XMWrap>
              </XMDual>
              <XMArray colsep="0pt" name="aligned">
                <XMRow>
                  <XMCell align="right">
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                        <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.1">Q</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E1.m1.2.1">E</XMTok>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false" xml:id="S3.E1.m1.3">(</XMTok>
                        <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.4">h</XMTok>
                        <XMTok role="PUNCT" xml:id="S3.E1.m1.5">,</XMTok>
                        <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.6">a</XMTok>
                        <XMTok role="CLOSE" stretchy="false" xml:id="S3.E1.m1.7">)</XMTok>
                      </XMWrap>
                    </XMApp>
                  </XMCell>
                  <XMCell align="left">
                    <XMApp>
                      <XMTok meaning="equals" role="RELOP" xml:id="S3.E1.m1.8">=</XMTok>
                      <XMTok meaning="absent"/>
                      <XMApp>
                        <XMTok meaning="plus" role="ADDOP" xml:id="S3.E1.m1.16">+</XMTok>
                        <XMApp>
                          <XMApp xml:id="S3.E1.m1.78">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                            <XMDual role="OPFUNCTION" xml:id="S3.E1.m1.9">
                              <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                              <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                            </XMDual>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E1.m1.10.1">E</XMTok>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false" xml:id="S3.E1.m1.11">(</XMTok>
                            <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.12">s</XMTok>
                            <XMTok role="PUNCT" xml:id="S3.E1.m1.13">,</XMTok>
                            <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.14">a</XMTok>
                            <XMTok role="CLOSE" stretchy="false" xml:id="S3.E1.m1.15">)</XMTok>
                          </XMWrap>
                        </XMApp>
                        <XMApp>
                          <XMApp xml:id="S3.E1.m1.79">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                            <XMDual role="OPFUNCTION" xml:id="S3.E1.m1.17">
                              <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                              <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                            </XMDual>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E1.m1.18.1">E</XMTok>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false" xml:id="S3.E1.m1.19">(</XMTok>
                            <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.20">h</XMTok>
                            <XMTok role="PUNCT" xml:id="S3.E1.m1.21">,</XMTok>
                            <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.22">a</XMTok>
                            <XMTok role="CLOSE" stretchy="false" xml:id="S3.E1.m1.23">)</XMTok>
                          </XMWrap>
                        </XMApp>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" name="gamma" role="UNKNOWN" xml:id="S3.E1.m1.25">γ</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                            <XMTok font="blackboard" role="UNKNOWN" xml:id="S3.E1.m1.26">E</XMTok>
                            <XMApp xml:id="S3.E1.m1.27.1">
                              <XMTok fontsize="70%" meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                              <XMApp>
                                <XMTok role="SUPERSCRIPTOP" scriptpos="post8"/>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                                <XMTok fontsize="50%" name="prime" role="SUPOP">′</XMTok>
                              </XMApp>
                              <XMDual>
                                <XMApp>
                                  <XMTok meaning="list"/>
                                  <XMRef idref="S3.E1.m1.27.1.1"/>
                                  <XMRef idref="S3.E1.m1.27.1.2"/>
                                </XMApp>
                                <XMWrap>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E1.m1.27.1.1">s</XMTok>
                                  <XMTok fontsize="70%" role="PUNCT">,</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E1.m1.27.1.2">a</XMTok>
                                </XMWrap>
                              </XMDual>
                            </XMApp>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false" xml:id="S3.E1.m1.28">[</XMTok>
                            <XMApp xml:id="S3.E1.m1.80">
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMApp>
                                <XMApp scriptpos="mid">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="mid7"/>
                                  <XMTok meaning="maximum" role="OPFUNCTION" scriptpos="mid" xml:id="S3.E1.m1.29">max</XMTok>
                                  <XMApp xml:id="S3.E1.m1.30.1">
                                    <XMTok role="SUPERSCRIPTOP" scriptpos="post8"/>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                                    <XMTok fontsize="50%" name="prime" role="SUPOP">′</XMTok>
                                  </XMApp>
                                </XMApp>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.31">Q</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E1.m1.32.1">E</XMTok>
                                </XMApp>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false" xml:id="S3.E1.m1.33">(</XMTok>
                                <XMApp xml:id="S3.E1.m1.80.1">
                                  <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.34">h</XMTok>
                                  <XMTok fontsize="70%" name="prime" role="SUPOP" xml:id="S3.E1.m1.35.1">′</XMTok>
                                </XMApp>
                                <XMTok role="PUNCT" xml:id="S3.E1.m1.36">,</XMTok>
                                <XMApp xml:id="S3.E1.m1.80.2">
                                  <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.37">a</XMTok>
                                  <XMTok fontsize="70%" name="prime" role="SUPOP" xml:id="S3.E1.m1.38.1">′</XMTok>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false" xml:id="S3.E1.m1.39">)</XMTok>
                              </XMWrap>
                            </XMApp>
                            <XMTok role="CLOSE" stretchy="false" xml:id="S3.E1.m1.40">]</XMTok>
                          </XMWrap>
                        </XMApp>
                      </XMApp>
                    </XMApp>
                  </XMCell>
                </XMRow>
                <XMRow>
                  <XMCell/>
                  <XMCell align="left">
                    <XMWrap>
                      <XMApp xml:id="S3.E1.m1.81">
                        <XMTok meaning="equals" role="RELOP" xml:id="S3.E1.m1.41">=</XMTok>
                        <XMTok meaning="absent"/>
                        <XMApp>
                          <XMTok meaning="plus" role="ADDOP" xml:id="S3.E1.m1.49">+</XMTok>
                          <XMApp>
                            <XMApp xml:id="S3.E1.m1.81.1">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                              <XMDual role="OPFUNCTION" xml:id="S3.E1.m1.42">
                                <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                                <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                              </XMDual>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E1.m1.43.1">E</XMTok>
                            </XMApp>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false" xml:id="S3.E1.m1.44">(</XMTok>
                              <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.45">s</XMTok>
                              <XMTok role="PUNCT" xml:id="S3.E1.m1.46">,</XMTok>
                              <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.47">a</XMTok>
                              <XMTok role="CLOSE" stretchy="false" xml:id="S3.E1.m1.48">)</XMTok>
                            </XMWrap>
                          </XMApp>
                          <XMApp>
                            <XMApp xml:id="S3.E1.m1.81.2">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                              <XMDual role="OPFUNCTION" xml:id="S3.E1.m1.50">
                                <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                                <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                              </XMDual>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E1.m1.51.1">E</XMTok>
                            </XMApp>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false" xml:id="S3.E1.m1.52">(</XMTok>
                              <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.53">h</XMTok>
                              <XMTok role="PUNCT" xml:id="S3.E1.m1.54">,</XMTok>
                              <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.55">a</XMTok>
                              <XMTok role="CLOSE" stretchy="false" xml:id="S3.E1.m1.56">)</XMTok>
                            </XMWrap>
                          </XMApp>
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" name="gamma" role="UNKNOWN" xml:id="S3.E1.m1.58">γ</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                              <XMTok font="blackboard" role="UNKNOWN" xml:id="S3.E1.m1.59">E</XMTok>
                              <XMApp xml:id="S3.E1.m1.60.1">
                                <XMTok fontsize="70%" meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                                <XMApp>
                                  <XMTok role="SUPERSCRIPTOP" scriptpos="post8"/>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                                  <XMTok fontsize="50%" name="prime" role="SUPOP">′</XMTok>
                                </XMApp>
                                <XMDual>
                                  <XMApp>
                                    <XMTok meaning="list"/>
                                    <XMRef idref="S3.E1.m1.60.1.1"/>
                                    <XMRef idref="S3.E1.m1.60.1.2"/>
                                  </XMApp>
                                  <XMWrap>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E1.m1.60.1.1">s</XMTok>
                                    <XMTok fontsize="70%" role="PUNCT">,</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E1.m1.60.1.2">a</XMTok>
                                  </XMWrap>
                                </XMDual>
                              </XMApp>
                            </XMApp>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false" xml:id="S3.E1.m1.61">[</XMTok>
                              <XMApp xml:id="S3.E1.m1.81.3">
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.62">Q</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E1.m1.63.1">E</XMTok>
                                </XMApp>
                                <XMWrap>
                                  <XMTok role="OPEN" stretchy="false" xml:id="S3.E1.m1.64">(</XMTok>
                                  <XMApp xml:id="S3.E1.m1.81.3.1">
                                    <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.65">h</XMTok>
                                    <XMTok fontsize="70%" name="prime" role="SUPOP" xml:id="S3.E1.m1.66.1">′</XMTok>
                                  </XMApp>
                                  <XMTok role="PUNCT" xml:id="S3.E1.m1.67">,</XMTok>
                                  <XMApp xml:id="S3.E1.m1.81.3.2">
                                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                    <XMApp>
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMTok font="italic" name="pi" role="UNKNOWN" xml:id="S3.E1.m1.68">π</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E1.m1.69.1">E</XMTok>
                                    </XMApp>
                                    <XMWrap>
                                      <XMTok role="OPEN" stretchy="false" xml:id="S3.E1.m1.70">(</XMTok>
                                      <XMApp xml:id="S3.E1.m1.81.3.2.1">
                                        <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                        <XMTok font="italic" role="UNKNOWN" xml:id="S3.E1.m1.71">h</XMTok>
                                        <XMTok fontsize="70%" name="prime" role="SUPOP" xml:id="S3.E1.m1.72.1">′</XMTok>
                                      </XMApp>
                                      <XMTok role="CLOSE" stretchy="false" xml:id="S3.E1.m1.73">)</XMTok>
                                    </XMWrap>
                                  </XMApp>
                                  <XMTok role="CLOSE" stretchy="false" xml:id="S3.E1.m1.74">)</XMTok>
                                </XMWrap>
                              </XMApp>
                              <XMTok role="CLOSE" stretchy="false" xml:id="S3.E1.m1.75">]</XMTok>
                            </XMWrap>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                      <XMTok role="PERIOD" xml:id="S3.E1.m1.76">.</XMTok>
                    </XMWrap>
                  </XMCell>
                </XMRow>
              </XMArray>
            </XMDual>
          </XMath>
        </Math>
      </equation>
      <p>Would the optimal policy and <Math mode="inline" tex="Q" text="Q" xml:id="S3.p8.m9">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">Q</XMTok>
          </XMath>
        </Math>-function be known, we could use them to recover the optimized bonus-augmented reward using this Bellman equation:</p>
      <equation xml:id="S3.E2">
        <tags>
          <tag>(2)</tag>
          <tag role="autoref">Equation 2</tag>
          <tag role="refnum">2</tag>
        </tags>
        <Math content-tex="\begin{split}&amp;\displaystyle\R_{E}(s,a)+\B_{E}(h,a)=Q_{E}(h,a)-\gamma\mathbb{E}%&#10;_{s^{\prime}|s,a}[Q_{E}(h^{\prime},\pi_{E}(h^{\prime}))].\end{split}" mode="display" tex="\begin{split}&amp;\displaystyle\operatorname{R}_{E}(s,a)+\operatorname{B}_{E}(h,a)%&#10;=Q_{E}(h,a)-\gamma\mathbb{E}_{s^{\prime}|s,a}[Q_{E}(h^{\prime},\pi_{E}(h^{%&#10;\prime}))].\end{split}" text="(R _ E)@(s, a) + (B _ E)@(h, a) = Q _ E * open-interval@(h, a) - gamma * E _ (conditional@(s ^ prime, list@(s, a))) * delimited-[]@(Q _ E * open-interval@(h ^ prime, pi _ E * h ^ prime))" xml:id="S3.E2.m1">
          <XMath>
            <XMDual>
              <XMDual>
                <XMRef idref="S3.E2.m1.44"/>
                <XMWrap>
                  <XMApp xml:id="S3.E2.m1.44">
                    <XMRef idref="S3.E2.m1.16"/>
                    <XMApp>
                      <XMRef idref="S3.E2.m1.8"/>
                      <XMDual>
                        <XMApp>
                          <XMRef idref="S3.E2.m1.44.1"/>
                          <XMRef idref="S3.E2.m1.4"/>
                          <XMRef idref="S3.E2.m1.6"/>
                        </XMApp>
                        <XMApp>
                          <XMApp xml:id="S3.E2.m1.44.1">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                            <XMRef idref="S3.E2.m1.1"/>
                            <XMRef idref="S3.E2.m1.2.1"/>
                          </XMApp>
                          <XMWrap>
                            <XMRef idref="S3.E2.m1.3"/>
                            <XMRef idref="S3.E2.m1.4"/>
                            <XMRef idref="S3.E2.m1.5"/>
                            <XMRef idref="S3.E2.m1.6"/>
                            <XMRef idref="S3.E2.m1.7"/>
                          </XMWrap>
                        </XMApp>
                      </XMDual>
                      <XMDual>
                        <XMApp>
                          <XMRef idref="S3.E2.m1.44.2"/>
                          <XMRef idref="S3.E2.m1.12"/>
                          <XMRef idref="S3.E2.m1.14"/>
                        </XMApp>
                        <XMApp>
                          <XMApp xml:id="S3.E2.m1.44.2">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                            <XMRef idref="S3.E2.m1.9"/>
                            <XMRef idref="S3.E2.m1.10.1"/>
                          </XMApp>
                          <XMWrap>
                            <XMRef idref="S3.E2.m1.11"/>
                            <XMRef idref="S3.E2.m1.12"/>
                            <XMRef idref="S3.E2.m1.13"/>
                            <XMRef idref="S3.E2.m1.14"/>
                            <XMRef idref="S3.E2.m1.15"/>
                          </XMWrap>
                        </XMApp>
                      </XMDual>
                    </XMApp>
                    <XMApp>
                      <XMRef idref="S3.E2.m1.24"/>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                          <XMRef idref="S3.E2.m1.17"/>
                          <XMRef idref="S3.E2.m1.18.1"/>
                        </XMApp>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S3.E2.m1.20"/>
                          <XMRef idref="S3.E2.m1.22"/>
                        </XMApp>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMRef idref="S3.E2.m1.25"/>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                          <XMRef idref="S3.E2.m1.26"/>
                          <XMRef idref="S3.E2.m1.27.1"/>
                        </XMApp>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="delimited-[]"/>
                            <XMRef idref="S3.E2.m1.44.3"/>
                          </XMApp>
                          <XMWrap>
                            <XMRef idref="S3.E2.m1.28"/>
                            <XMApp xml:id="S3.E2.m1.44.3">
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMRef idref="S3.E2.m1.29"/>
                                <XMRef idref="S3.E2.m1.30.1"/>
                              </XMApp>
                              <XMDual>
                                <XMApp>
                                  <XMTok meaning="open-interval"/>
                                  <XMRef idref="S3.E2.m1.44.3.1"/>
                                  <XMRef idref="S3.E2.m1.44.3.2"/>
                                </XMApp>
                                <XMWrap>
                                  <XMRef idref="S3.E2.m1.31"/>
                                  <XMApp xml:id="S3.E2.m1.44.3.1">
                                    <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                    <XMRef idref="S3.E2.m1.32"/>
                                    <XMRef idref="S3.E2.m1.33.1"/>
                                  </XMApp>
                                  <XMRef idref="S3.E2.m1.34"/>
                                  <XMApp xml:id="S3.E2.m1.44.3.2">
                                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                    <XMApp>
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMRef idref="S3.E2.m1.35"/>
                                      <XMRef idref="S3.E2.m1.36.1"/>
                                    </XMApp>
                                    <XMDual>
                                      <XMRef idref="S3.E2.m1.44.3.2.1"/>
                                      <XMWrap>
                                        <XMRef idref="S3.E2.m1.37"/>
                                        <XMApp xml:id="S3.E2.m1.44.3.2.1">
                                          <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                          <XMRef idref="S3.E2.m1.38"/>
                                          <XMRef idref="S3.E2.m1.39.1"/>
                                        </XMApp>
                                        <XMRef idref="S3.E2.m1.40"/>
                                      </XMWrap>
                                    </XMDual>
                                  </XMApp>
                                  <XMRef idref="S3.E2.m1.41"/>
                                </XMWrap>
                              </XMDual>
                            </XMApp>
                            <XMRef idref="S3.E2.m1.42"/>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMTok role="PERIOD"/>
                </XMWrap>
              </XMDual>
              <XMArray colsep="0pt" name="aligned">
                <XMRow>
                  <XMCell/>
                  <XMCell align="left">
                    <XMWrap>
                      <XMApp xml:id="S3.E2.m1.45">
                        <XMTok meaning="equals" role="RELOP" xml:id="S3.E2.m1.16">=</XMTok>
                        <XMApp>
                          <XMTok meaning="plus" role="ADDOP" xml:id="S3.E2.m1.8">+</XMTok>
                          <XMApp>
                            <XMApp xml:id="S3.E2.m1.45.1">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                              <XMDual role="OPFUNCTION" xml:id="S3.E2.m1.1">
                                <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                                <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                              </XMDual>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.2.1">E</XMTok>
                            </XMApp>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.3">(</XMTok>
                              <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.4">s</XMTok>
                              <XMTok role="PUNCT" xml:id="S3.E2.m1.5">,</XMTok>
                              <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.6">a</XMTok>
                              <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.7">)</XMTok>
                            </XMWrap>
                          </XMApp>
                          <XMApp>
                            <XMApp xml:id="S3.E2.m1.45.2">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                              <XMDual role="OPFUNCTION" xml:id="S3.E2.m1.9">
                                <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                                <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                              </XMDual>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.10.1">E</XMTok>
                            </XMApp>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.11">(</XMTok>
                              <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.12">h</XMTok>
                              <XMTok role="PUNCT" xml:id="S3.E2.m1.13">,</XMTok>
                              <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.14">a</XMTok>
                              <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.15">)</XMTok>
                            </XMWrap>
                          </XMApp>
                        </XMApp>
                        <XMApp>
                          <XMTok meaning="minus" role="ADDOP" xml:id="S3.E2.m1.24">-</XMTok>
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                              <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.17">Q</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.18.1">E</XMTok>
                            </XMApp>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.19">(</XMTok>
                              <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.20">h</XMTok>
                              <XMTok role="PUNCT" xml:id="S3.E2.m1.21">,</XMTok>
                              <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.22">a</XMTok>
                              <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.23">)</XMTok>
                            </XMWrap>
                          </XMApp>
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" name="gamma" role="UNKNOWN" xml:id="S3.E2.m1.25">γ</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                              <XMTok font="blackboard" role="UNKNOWN" xml:id="S3.E2.m1.26">E</XMTok>
                              <XMApp xml:id="S3.E2.m1.27.1">
                                <XMTok fontsize="70%" meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                                <XMApp>
                                  <XMTok role="SUPERSCRIPTOP" scriptpos="post8"/>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                                  <XMTok fontsize="50%" name="prime" role="SUPOP">′</XMTok>
                                </XMApp>
                                <XMDual>
                                  <XMApp>
                                    <XMTok meaning="list"/>
                                    <XMRef idref="S3.E2.m1.27.1.1"/>
                                    <XMRef idref="S3.E2.m1.27.1.2"/>
                                  </XMApp>
                                  <XMWrap>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.27.1.1">s</XMTok>
                                    <XMTok fontsize="70%" role="PUNCT">,</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.27.1.2">a</XMTok>
                                  </XMWrap>
                                </XMDual>
                              </XMApp>
                            </XMApp>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.28">[</XMTok>
                              <XMApp xml:id="S3.E2.m1.45.3">
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.29">Q</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.30.1">E</XMTok>
                                </XMApp>
                                <XMWrap>
                                  <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.31">(</XMTok>
                                  <XMApp xml:id="S3.E2.m1.45.3.1">
                                    <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.32">h</XMTok>
                                    <XMTok fontsize="70%" name="prime" role="SUPOP" xml:id="S3.E2.m1.33.1">′</XMTok>
                                  </XMApp>
                                  <XMTok role="PUNCT" xml:id="S3.E2.m1.34">,</XMTok>
                                  <XMApp xml:id="S3.E2.m1.45.3.2">
                                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                    <XMApp>
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMTok font="italic" name="pi" role="UNKNOWN" xml:id="S3.E2.m1.35">π</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.36.1">E</XMTok>
                                    </XMApp>
                                    <XMWrap>
                                      <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.37">(</XMTok>
                                      <XMApp xml:id="S3.E2.m1.45.3.2.1">
                                        <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                        <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.38">h</XMTok>
                                        <XMTok fontsize="70%" name="prime" role="SUPOP" xml:id="S3.E2.m1.39.1">′</XMTok>
                                      </XMApp>
                                      <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.40">)</XMTok>
                                    </XMWrap>
                                  </XMApp>
                                  <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.41">)</XMTok>
                                </XMWrap>
                              </XMApp>
                              <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.42">]</XMTok>
                            </XMWrap>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                      <XMTok role="PERIOD" xml:id="S3.E2.m1.43">.</XMTok>
                    </XMWrap>
                  </XMCell>
                </XMRow>
              </XMArray>
            </XMDual>
          </XMath>
        </Math>
      </equation>
      <p>Now, the quantities in the right hand side are unknown, but they can be estimated in an indirect way.
<!--  %****␣sample.tex␣Line␣175␣**** --></p>
    </para>
    <para xml:id="S3.p9">
      <p>Assuming that the actions are discrete, we can learn the policy <Math mode="inline" tex="\pi_{E}" text="pi _ E" xml:id="S3.p9.m1">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
              <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
            </XMApp>
          </XMath>
        </Math> by mapping histories to actions (using, for example, an LSTM network). Write <Math content-tex="\hat{\pi}:\Memory\to\States" mode="inline" tex="\hat{\pi}:\operatorname{H}\to\operatorname{S}" text="hat@(pi) colon Memory to States" xml:id="S3.p9.m2">
          <XMath>
            <XMApp>
              <XMTok name="colon" role="METARELOP">:</XMTok>
              <XMApp>
                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
              </XMApp>
              <XMApp>
                <XMTok name="to" role="ARROW">→</XMTok>
                <XMDual role="OPFUNCTION">
                  <XMTok name="Memory" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">H</XMTok>
                </XMDual>
                <XMDual role="OPFUNCTION">
                  <XMTok name="States" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">S</XMTok>
                </XMDual>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> the resulting policy, or classifier. If we train it by minimizing a cross-entropy loss, what we learn indeed are logits <Math mode="inline" tex="\hat{Q}(h,a)" text="hat@(Q) * open-interval@(h, a)" xml:id="S3.p9.m3">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMApp>
                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                <XMTok font="italic" role="UNKNOWN">Q</XMTok>
              </XMApp>
              <XMDual>
                <XMApp>
                  <XMTok meaning="open-interval"/>
                  <XMRef idref="S3.p9.m3.1"/>
                  <XMRef idref="S3.p9.m3.2"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.p9.m3.1">h</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.p9.m3.2">a</XMTok>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math>, and the classifier is <Math mode="inline" tex="\hat{\pi}(h)=\operatorname*{argmax}_{a}\hat{Q}(h,a)" text="hat@(pi) * h = (argmax _ a)@(hat@(Q)) * open-interval@(h, a)" xml:id="S3.p9.m4">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                </XMApp>
                <XMDual>
                  <XMRef idref="S3.p9.m4.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p9.m4.1">h</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMApp scriptpos="mid">
                    <XMTok role="SUBSCRIPTOP" scriptpos="mid1"/>
                    <XMTok role="OPERATOR" scriptpos="mid">argmax</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                  </XMApp>
                </XMApp>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S3.p9.m4.2"/>
                    <XMRef idref="S3.p9.m4.3"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p9.m4.2">h</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p9.m4.3">a</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>. Said otherwise, <Math mode="inline" tex="\hat{\pi}" text="hat@(pi)" xml:id="S3.p9.m5">
          <XMath>
            <XMApp>
              <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
            </XMApp>
          </XMath>
        </Math> is greedy with respect to <Math mode="inline" tex="\hat{Q}" text="hat@(Q)" xml:id="S3.p9.m6">
          <XMath>
            <XMApp>
              <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
              <XMTok font="italic" role="UNKNOWN">Q</XMTok>
            </XMApp>
          </XMath>
        </Math>, that can thus be interpreted as an optimal Q-function for an unknown reward. Using the Bellman equation, we can recover this reward:</p>
      <equation labels="LABEL:eq:recovered_reward" xml:id="S3.E3">
        <tags>
          <tag>(3)</tag>
          <tag role="autoref">Equation 3</tag>
          <tag role="refnum">3</tag>
        </tags>
        <Math content-tex="\R(s,a)+\hat{\B}(h,a)=\hat{Q}(h,a)-\gamma\mathbb{E}_{s^{\prime}|s,a}[\hat{Q}(h%&#10;^{\prime},\hat{\pi}(h^{\prime}))]." mode="display" tex="\operatorname{R}(s,a)+\hat{\operatorname{B}}(h,a)=\hat{Q}(h,a)-\gamma\mathbb{E%&#10;}_{s^{\prime}|s,a}[\hat{Q}(h^{\prime},\hat{\pi}(h^{\prime}))]." text="R@(s, a) + hat@(B) * open-interval@(h, a) = hat@(Q) * open-interval@(h, a) - gamma * E _ (conditional@(s ^ prime, list@(s, a))) * delimited-[]@(hat@(Q) * open-interval@(h ^ prime, hat@(pi) * h ^ prime))" xml:id="S3.E3.m1">
          <XMath>
            <XMDual>
              <XMRef idref="S3.E3.m1.10"/>
              <XMWrap>
                <XMApp xml:id="S3.E3.m1.10">
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok meaning="plus" role="ADDOP">+</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMRef idref="S3.E3.m1.3"/>
                        <XMRef idref="S3.E3.m1.4"/>
                        <XMRef idref="S3.E3.m1.5"/>
                      </XMApp>
                      <XMApp>
                        <XMDual role="OPFUNCTION" xml:id="S3.E3.m1.3">
                          <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                          <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                        </XMDual>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="italic" role="UNKNOWN" xml:id="S3.E3.m1.4">s</XMTok>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMTok font="italic" role="UNKNOWN" xml:id="S3.E3.m1.5">a</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMApp>
                    </XMDual>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                        <XMDual role="OPFUNCTION">
                          <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                          <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                        </XMDual>
                      </XMApp>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S3.E3.m1.6"/>
                          <XMRef idref="S3.E3.m1.7"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="italic" role="UNKNOWN" xml:id="S3.E3.m1.6">h</XMTok>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMTok font="italic" role="UNKNOWN" xml:id="S3.E3.m1.7">a</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="minus" role="ADDOP">-</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                        <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                      </XMApp>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S3.E3.m1.8"/>
                          <XMRef idref="S3.E3.m1.9"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="italic" role="UNKNOWN" xml:id="S3.E3.m1.8">h</XMTok>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMTok font="italic" role="UNKNOWN" xml:id="S3.E3.m1.9">a</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="blackboard" role="UNKNOWN">E</XMTok>
                        <XMApp>
                          <XMTok fontsize="70%" meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                          <XMApp>
                            <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                            <XMTok fontsize="50%" name="prime" role="SUPOP">′</XMTok>
                          </XMApp>
                          <XMDual>
                            <XMApp>
                              <XMTok meaning="list"/>
                              <XMRef idref="S3.E3.m1.1"/>
                              <XMRef idref="S3.E3.m1.2"/>
                            </XMApp>
                            <XMWrap>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E3.m1.1">s</XMTok>
                              <XMTok fontsize="70%" role="PUNCT">,</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E3.m1.2">a</XMTok>
                            </XMWrap>
                          </XMDual>
                        </XMApp>
                      </XMApp>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="delimited-[]"/>
                          <XMRef idref="S3.E3.m1.10.1"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">[</XMTok>
                          <XMApp xml:id="S3.E3.m1.10.1">
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMApp>
                              <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                              <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                            </XMApp>
                            <XMDual>
                              <XMApp>
                                <XMTok meaning="open-interval"/>
                                <XMRef idref="S3.E3.m1.10.1.1"/>
                                <XMRef idref="S3.E3.m1.10.1.2"/>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">(</XMTok>
                                <XMApp xml:id="S3.E3.m1.10.1.1">
                                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                                  <XMTok font="italic" role="UNKNOWN">h</XMTok>
                                  <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                                </XMApp>
                                <XMTok role="PUNCT">,</XMTok>
                                <XMApp xml:id="S3.E3.m1.10.1.2">
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMApp>
                                    <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                                    <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                                  </XMApp>
                                  <XMDual>
                                    <XMRef idref="S3.E3.m1.10.1.2.1"/>
                                    <XMWrap>
                                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                                      <XMApp xml:id="S3.E3.m1.10.1.2.1">
                                        <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                                        <XMTok font="italic" role="UNKNOWN">h</XMTok>
                                        <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                                      </XMApp>
                                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                    </XMWrap>
                                  </XMDual>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false">)</XMTok>
                              </XMWrap>
                            </XMDual>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">]</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                  </XMApp>
                </XMApp>
                <XMTok role="PERIOD">.</XMTok>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math>
      </equation>
      <p>By Bellman, as <Math mode="inline" tex="\hat{\pi}" text="hat@(pi)" xml:id="S3.p9.m7">
          <XMath>
            <XMApp>
              <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
            </XMApp>
          </XMath>
        </Math> is greedy w.r.t. <Math mode="inline" tex="\hat{Q}" text="hat@(Q)" xml:id="S3.p9.m8">
          <XMath>
            <XMApp>
              <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
              <XMTok font="italic" role="UNKNOWN">Q</XMTok>
            </XMApp>
          </XMath>
        </Math>, we have that <Math mode="inline" tex="\hat{\pi}" text="hat@(pi)" xml:id="S3.p9.m9">
          <XMath>
            <XMApp>
              <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
            </XMApp>
          </XMath>
        </Math> and <Math mode="inline" tex="\hat{Q}" text="hat@(Q)" xml:id="S3.p9.m10">
          <XMath>
            <XMApp>
              <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
              <XMTok font="italic" role="UNKNOWN">Q</XMTok>
            </XMApp>
          </XMath>
        </Math> are respectively the optimal policy and <Math mode="inline" tex="Q" text="Q" xml:id="S3.p9.m11">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">Q</XMTok>
          </XMath>
        </Math>-function for the reward <Math content-tex="\R+\hat{\B}" mode="inline" tex="\operatorname{R}+\hat{\operatorname{B}}" text="R + hat@(B)" xml:id="S3.p9.m12">
          <XMath>
            <XMApp>
              <XMTok meaning="plus" role="ADDOP">+</XMTok>
              <XMDual role="OPFUNCTION">
                <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
              </XMDual>
              <XMApp>
                <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                <XMDual role="OPFUNCTION">
                  <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                </XMDual>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>. We cannot use directly Eq. (<ref labelref="LABEL:eq:recovered_reward"/>), the model being unknown, but we can sample the right hand side and estimate <Math content-tex="\hat{\B}" mode="inline" tex="\hat{\operatorname{B}}" text="hat@(B)" xml:id="S3.p9.m13">
          <XMath>
            <XMApp>
              <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
              <XMDual role="OPFUNCTION">
                <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
              </XMDual>
            </XMApp>
          </XMath>
        </Math> by solving a regression problem.</p>
    </para>
    <para xml:id="S3.p10">
      <p>Therefore, we have reduced our initial problem to a sequence of supervised learning problem. Our algorithm is indeed CSI, up to the fact that we consider trajectories instead of states, and parameterize the bonus with the reward task. As such, the theoretical results of <cite class="ltx_citemacro_citet"><bibref bibrefs="klein2013cascaded" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> applies to our setting. Notably, we would have that</p>
      <equation xml:id="S3.E4">
        <tags>
          <tag>(4)</tag>
          <tag role="autoref">Equation 4</tag>
          <tag role="refnum">4</tag>
        </tags>
        <Math mode="display" tex="0\leq\mathbb{E}_{h\sim\pi_{E}}[\max_{a}Q^{*}_{R+\hat{B}}(h,a)-Q^{\pi_{E}}_{R+%&#10;\hat{B}}(h,\pi_{E}(h)]\leq\mathcal{O}\left(\frac{\epsilon_{1}+\epsilon_{2}}{1-%&#10;\gamma}\right)," xml:id="S3.E4.m1">
          <XMath>
            <XMTok meaning="0" role="NUMBER">0</XMTok>
            <XMTok meaning="less-than-or-equals" name="leq" role="RELOP">≤</XMTok>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="blackboard" role="UNKNOWN">E</XMTok>
              <XMApp>
                <XMTok fontsize="70%" meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">h</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                  <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
                  <XMTok font="italic" fontsize="50%" role="UNKNOWN">E</XMTok>
                </XMApp>
              </XMApp>
            </XMApp>
            <XMWrap>
              <XMTok role="OPEN" stretchy="false">[</XMTok>
              <XMApp scriptpos="mid">
                <XMTok role="SUBSCRIPTOP" scriptpos="mid1"/>
                <XMTok meaning="maximum" role="OPFUNCTION" scriptpos="mid">max</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
              </XMApp>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                  <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                </XMApp>
                <XMApp>
                  <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">R</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">B</XMTok>
                  </XMApp>
                </XMApp>
              </XMApp>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">(</XMTok>
                <XMTok font="italic" role="UNKNOWN" xml:id="S3.E4.m1.1">h</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok font="italic" role="UNKNOWN" xml:id="S3.E4.m1.2">a</XMTok>
                <XMTok role="CLOSE" stretchy="false">)</XMTok>
              </XMWrap>
              <XMTok meaning="minus" role="ADDOP">-</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                    <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
                    <XMTok font="italic" fontsize="50%" role="UNKNOWN">E</XMTok>
                  </XMApp>
                </XMApp>
                <XMApp>
                  <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">R</XMTok>
                  <XMApp>
                    <XMTok fontsize="70%" name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">B</XMTok>
                  </XMApp>
                </XMApp>
              </XMApp>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">(</XMTok>
                <XMTok font="italic" role="UNKNOWN" xml:id="S3.E4.m1.4">h</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E4.m1.3">h</XMTok>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
                <XMTok role="CLOSE" stretchy="false">]</XMTok>
              </XMWrap>
              <XMTok meaning="less-than-or-equals" name="leq" role="RELOP">≤</XMTok>
              <XMTok font="caligraphic" role="UNKNOWN">O</XMTok>
              <XMWrap>
                <XMTok role="OPEN" stretchy="true">(</XMTok>
                <XMApp>
                  <XMTok mathstyle="display" meaning="divide" role="FRACOP"/>
                  <XMApp>
                    <XMTok meaning="plus" role="ADDOP">+</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post3"/>
                      <XMTok font="italic" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post3"/>
                      <XMTok font="italic" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                      <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="minus" role="ADDOP">-</XMTok>
                    <XMTok meaning="1" role="NUMBER">1</XMTok>
                    <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
                  </XMApp>
                </XMApp>
                <XMTok role="CLOSE" stretchy="true">)</XMTok>
              </XMWrap>
              <XMTok role="PUNCT">,</XMTok>
            </XMWrap>
          </XMath>
        </Math>
      </equation>
      <p>with <Math mode="inline" tex="\epsilon_{1}" text="epsilon _ 1" xml:id="S3.p10.m1">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" name="epsilon" role="UNKNOWN">ϵ</XMTok>
              <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
            </XMApp>
          </XMath>
        </Math> the classification error and <Math mode="inline" tex="\epsilon_{2}" text="epsilon _ 2" xml:id="S3.p10.m2">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" name="epsilon" role="UNKNOWN">ϵ</XMTok>
              <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
            </XMApp>
          </XMath>
        </Math> the regression error. This means that the demonstrator policy is close to optimal if these errors are small, for the learnt bonus function. One could argue that this bound trivially holds for <Math mode="inline" tex="R+\hat{B}=0" text="R + hat@(B) = 0" xml:id="S3.p10.m3">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok meaning="plus" role="ADDOP">+</XMTok>
                <XMTok font="italic" role="UNKNOWN">R</XMTok>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">B</XMTok>
                </XMApp>
              </XMApp>
              <XMTok meaning="0" role="NUMBER">0</XMTok>
            </XMApp>
          </XMath>
        </Math>, when all behaviors are optimal. Yet, this is unlikely, as for having the classification error <Math mode="inline" tex="\epsilon_{1}" text="epsilon _ 1" xml:id="S3.p10.m4">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" name="epsilon" role="UNKNOWN">ϵ</XMTok>
              <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
            </XMApp>
          </XMath>
        </Math> small, we must have <Math mode="inline" tex="\hat{Q}(h,\pi_{E}(h))&gt;\hat{Q}(h,a\neq\pi_{E}(h))" xml:id="S3.p10.m5">
          <XMath>
            <XMApp>
              <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
              <XMTok font="italic" role="UNKNOWN">Q</XMTok>
            </XMApp>
            <XMWrap>
              <XMTok role="OPEN" stretchy="false">(</XMTok>
              <XMTok font="italic" role="UNKNOWN" xml:id="S3.p10.m5.2">h</XMTok>
              <XMTok role="PUNCT">,</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
              </XMApp>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">(</XMTok>
                <XMTok font="italic" role="UNKNOWN" xml:id="S3.p10.m5.1">h</XMTok>
                <XMTok role="CLOSE" stretchy="false">)</XMTok>
              </XMWrap>
              <XMTok role="CLOSE" stretchy="false">)</XMTok>
            </XMWrap>
            <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
            <XMApp>
              <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
              <XMTok font="italic" role="UNKNOWN">Q</XMTok>
            </XMApp>
            <XMWrap>
              <XMTok role="OPEN" stretchy="false">(</XMTok>
              <XMTok font="italic" role="UNKNOWN">h</XMTok>
              <XMTok role="PUNCT">,</XMTok>
              <XMTok font="italic" role="UNKNOWN">a</XMTok>
              <XMTok meaning="not-equals" name="neq" role="RELOP">≠</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
              </XMApp>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">(</XMTok>
                <XMTok font="italic" role="UNKNOWN">h</XMTok>
                <XMTok role="CLOSE" stretchy="false">)</XMTok>
              </XMWrap>
              <XMTok role="CLOSE" stretchy="false">)</XMTok>
            </XMWrap>
          </XMath>
        </Math> w.h.p., and thus learn an informative bonus.</p>
    </para>
    <figure inlist="lof" labels="LABEL:schema" placement="h" xml:id="S3.F1">
      <tags>
        <tag>Figure 1</tag>
        <tag role="autoref">Figure 1</tag>
        <tag role="refnum">1</tag>
        <tag role="typerefnum">Figure 1</tag>
      </tags>
      <inline-para align="center" class="ltx_minipage" vattach="middle" width="433.6pt">
        <para align="center" xml:id="S3.F1.p1">
          <graphics candidates="figures/schema.png" graphic="figures/schema.png" options="width=433.62pt" xml:id="S3.F1.p1.g1"/>
        </para>
      </inline-para>
      <toccaption class="ltx_centering"><tag close=" ">1</tag>Trajectories <Math mode="inline" tex="(s,a,\dots)" text="vector@(s, a, dots)" xml:id="S3.F1.m1">
          <XMath>
            <XMDual>
              <XMApp>
                <XMTok meaning="vector"/>
                <XMRef idref="S3.F1.m1.1"/>
                <XMRef idref="S3.F1.m1.2"/>
                <XMRef idref="S3.F1.m1.3"/>
              </XMApp>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">(</XMTok>
                <XMTok font="italic" role="UNKNOWN" xml:id="S3.F1.m1.1">s</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok font="italic" role="UNKNOWN" xml:id="S3.F1.m1.2">a</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok name="dots" role="ID" xml:id="S3.F1.m1.3">…</XMTok>
                <XMTok role="CLOSE" stretchy="false">)</XMTok>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math> are generated by a demonstrator exploring its environment. In order to recover a bonus that can explain its behavior, a BC policy parameterized with an LSTM is trained to predict the actions of the demonstrator from its trajectories of states, by minimizing <Math mode="inline" tex="\mathcal{L}^{\text{BC}}" text="L ^ [BC]" xml:id="S3.F1.m2">
          <XMath>
            <XMApp>
              <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
              <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
              <XMText><text fontsize="70%">BC</text></XMText>
            </XMApp>
          </XMath>
        </Math>. The policy’s logits <Math mode="inline" tex="Q_{\phi}" text="Q _ phi" xml:id="S3.F1.m3">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">Q</XMTok>
              <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
            </XMApp>
          </XMath>
        </Math> are interpreted as optimal <Math mode="inline" tex="Q" text="Q" xml:id="S3.F1.m4">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">Q</XMTok>
          </XMath>
        </Math>-values and used to compute a regression target. A bonus function <Math content-tex="\B_{\theta}" mode="inline" tex="\operatorname{B}_{\theta}" text="B _ theta" xml:id="S3.F1.m5">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMDual role="OPFUNCTION">
                <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
              </XMDual>
              <XMTok font="italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
            </XMApp>
          </XMath>
        </Math>, parameterized with an LSTM, is then trained to predict it, by minimizing <Math mode="inline" tex="\mathcal{L}^{\text{reg}}" text="L ^ [reg]" xml:id="S3.F1.m6">
          <XMath>
            <XMApp>
              <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
              <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
              <XMText><text fontsize="70%">reg</text></XMText>
            </XMApp>
          </XMath>
        </Math>.</toccaption>
      <caption class="ltx_centering"><tag close=". ">Figure 1</tag>Trajectories <Math mode="inline" tex="(s,a,\dots)" text="vector@(s, a, dots)" xml:id="S3.F1.m7">
          <XMath>
            <XMDual>
              <XMApp>
                <XMTok meaning="vector"/>
                <XMRef idref="S3.F1.m7.1"/>
                <XMRef idref="S3.F1.m7.2"/>
                <XMRef idref="S3.F1.m7.3"/>
              </XMApp>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">(</XMTok>
                <XMTok font="italic" role="UNKNOWN" xml:id="S3.F1.m7.1">s</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok font="italic" role="UNKNOWN" xml:id="S3.F1.m7.2">a</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok name="dots" role="ID" xml:id="S3.F1.m7.3">…</XMTok>
                <XMTok role="CLOSE" stretchy="false">)</XMTok>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math> are generated by a demonstrator exploring its environment. In order to recover a bonus that can explain its behavior, a BC policy parameterized with an LSTM is trained to predict the actions of the demonstrator from its trajectories of states, by minimizing <Math mode="inline" tex="\mathcal{L}^{\text{BC}}" text="L ^ [BC]" xml:id="S3.F1.m8">
          <XMath>
            <XMApp>
              <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
              <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
              <XMText><text fontsize="70%">BC</text></XMText>
            </XMApp>
          </XMath>
        </Math>. The policy’s logits <Math mode="inline" tex="Q_{\phi}" text="Q _ phi" xml:id="S3.F1.m9">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">Q</XMTok>
              <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
            </XMApp>
          </XMath>
        </Math> are interpreted as optimal <Math mode="inline" tex="Q" text="Q" xml:id="S3.F1.m10">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">Q</XMTok>
          </XMath>
        </Math>-values and used to compute a regression target. A bonus function <Math content-tex="\B_{\theta}" mode="inline" tex="\operatorname{B}_{\theta}" text="B _ theta" xml:id="S3.F1.m11">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMDual role="OPFUNCTION">
                <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
              </XMDual>
              <XMTok font="italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
            </XMApp>
          </XMath>
        </Math>, parameterized with an LSTM, is then trained to predict it, by minimizing <Math mode="inline" tex="\mathcal{L}^{\text{reg}}" text="L ^ [reg]" xml:id="S3.F1.m12">
          <XMath>
            <XMApp>
              <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
              <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
              <XMText><text fontsize="70%">reg</text></XMText>
            </XMApp>
          </XMath>
        </Math>.</caption>
    </figure>
    <para xml:id="S3.p11">
      <p><text font="bold">Implementation.</text> More concretely, we consider <Math content-tex="\sm(Q_{\phi})" mode="inline" tex="\operatorname{softmax}(Q_{\phi})" text="sm@(Q _ phi)" xml:id="S3.p11.m1">
          <XMath>
            <XMDual>
              <XMApp>
                <XMRef idref="S3.p11.m1.1"/>
                <XMRef idref="S3.p11.m1.2"/>
              </XMApp>
              <XMApp>
                <XMDual role="OPFUNCTION" xml:id="S3.p11.m1.1">
                  <XMTok name="sm" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">softmax</XMTok>
                </XMDual>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp xml:id="S3.p11.m1.2">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                    <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMApp>
            </XMDual>
          </XMath>
        </Math> to be a neural network classifier with LSTM <cite class="ltx_citemacro_citep">(<bibref bibrefs="hochreiter1997long" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> units, <Math mode="inline" tex="\phi" text="phi" xml:id="S3.p11.m2">
          <XMath>
            <XMTok font="italic" name="phi" role="UNKNOWN">ϕ</XMTok>
          </XMath>
        </Math> being the set of parameters and <Math mode="inline" tex="Q_{\phi}" text="Q _ phi" xml:id="S3.p11.m3">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">Q</XMTok>
              <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
            </XMApp>
          </XMath>
        </Math> being the logits. We train <Math mode="inline" tex="\pi_{\phi}" text="pi _ phi" xml:id="S3.p11.m4">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
              <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
            </XMApp>
          </XMath>
        </Math> to do behavioral cloning, that is to predict the demonstrator actions <Math mode="inline" tex="a_{E}" text="a _ E" xml:id="S3.p11.m5">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">a</XMTok>
              <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
            </XMApp>
          </XMath>
        </Math> based on its past interactions <Math mode="inline" tex="h_{E}" text="h _ E" xml:id="S3.p11.m6">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">h</XMTok>
              <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
            </XMApp>
          </XMath>
        </Math>, by minimizing a cross-entropy loss:</p>
      <equation xml:id="S3.E5">
        <tags>
          <tag>(5)</tag>
          <tag role="autoref">Equation 5</tag>
          <tag role="refnum">5</tag>
        </tags>
        <Math content-tex="\mathcal{L}^{\text{BC}}=-\ln(\sm(Q_{\phi}(h_{E},a_{E}))," mode="display" tex="\mathcal{L}^{\text{BC}}=-\ln(\operatorname{softmax}(Q_{\phi}(h_{E},a_{E}))," xml:id="S3.E5.m1">
          <XMath>
            <XMApp>
              <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
              <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
              <XMText><text fontsize="70%">BC</text></XMText>
            </XMApp>
            <XMTok meaning="equals" role="RELOP">=</XMTok>
            <XMTok meaning="minus" role="ADDOP">-</XMTok>
            <XMTok meaning="natural-logarithm" role="OPFUNCTION">ln</XMTok>
            <XMWrap>
              <XMTok role="OPEN" stretchy="false">(</XMTok>
              <XMDual role="OPFUNCTION" xml:id="S3.E5.m1.1">
                <XMTok name="sm" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">softmax</XMTok>
              </XMDual>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">(</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                  <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">h</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">a</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
                <XMTok role="CLOSE" stretchy="false">)</XMTok>
              </XMWrap>
              <XMTok role="PUNCT">,</XMTok>
<!--  %**** sample.tex Line 200 **** 
 %**** sample.tex Line 200 **** -->            </XMWrap>
          </XMath>
        </Math>
      </equation>
      <p>with <Math mode="inline" tex="Q_{\phi}(h,a)" text="Q _ phi * open-interval@(h, a)" xml:id="S3.p11.m7">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
              </XMApp>
              <XMDual>
                <XMApp>
                  <XMTok meaning="open-interval"/>
                  <XMRef idref="S3.p11.m7.1"/>
                  <XMRef idref="S3.p11.m7.2"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.p11.m7.1">h</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.p11.m7.2">a</XMTok>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math> the <Math mode="inline" tex="a^{\text{th}}" text="a ^ [th]" xml:id="S3.p11.m8">
          <XMath>
            <XMApp>
              <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">a</XMTok>
              <XMText><text fontsize="70%">th</text></XMText>
            </XMApp>
          </XMath>
        </Math> logit for input <Math mode="inline" tex="h" text="h" xml:id="S3.p11.m9">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">h</XMTok>
          </XMath>
        </Math>. If the classifier learns correctly, the logits of the resulting network should satisfy <Math mode="inline" tex="Q_{\phi}(h_{E},a_{E})&gt;Q_{\phi}(h_{E},a)" text="Q _ phi * open-interval@(h _ E, a _ E) &gt; Q _ phi * open-interval@(h _ E, a)" xml:id="S3.p11.m10">
          <XMath>
            <XMApp>
              <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                  <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                </XMApp>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S3.p11.m10.2"/>
                    <XMRef idref="S3.p11.m10.3"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp xml:id="S3.p11.m10.2">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">h</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMApp xml:id="S3.p11.m10.3">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">a</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                  <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                </XMApp>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S3.p11.m10.4"/>
                    <XMRef idref="S3.p11.m10.1"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp xml:id="S3.p11.m10.4">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">h</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p11.m10.1">a</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> for <Math mode="inline" tex="a\neq a_{E}" text="a not-equals a _ E" xml:id="S3.p11.m11">
          <XMath>
            <XMApp>
              <XMTok meaning="not-equals" name="neq" role="RELOP">≠</XMTok>
              <XMTok font="italic" role="UNKNOWN">a</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">a</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>, and the class predicted by the classifier will be <Math mode="inline" tex="\pi_{\phi}(h)=\operatorname*{argmax}_{a}Q_{\phi}(h,a)" text="pi _ phi * h = (argmax _ a)@(Q _ phi) * open-interval@(h, a)" xml:id="S3.p11.m12">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                  <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                </XMApp>
                <XMDual>
                  <XMRef idref="S3.p11.m12.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p11.m12.1">h</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMApp scriptpos="mid">
                    <XMTok role="SUBSCRIPTOP" scriptpos="mid1"/>
                    <XMTok role="OPERATOR" scriptpos="mid">argmax</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                    <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                  </XMApp>
                </XMApp>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S3.p11.m12.2"/>
                    <XMRef idref="S3.p11.m12.3"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p11.m12.2">h</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.p11.m12.3">a</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>. Hence, as explained above, one can interpret <Math mode="inline" tex="Q_{\phi}" text="Q _ phi" xml:id="S3.p11.m13">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">Q</XMTok>
              <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
            </XMApp>
          </XMath>
        </Math> as an optimal <Math mode="inline" tex="Q" text="Q" xml:id="S3.p11.m14">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">Q</XMTok>
          </XMath>
        </Math>-function (hence the notation), and <Math mode="inline" tex="\pi_{\phi}" text="pi _ phi" xml:id="S3.p11.m15">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
              <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
            </XMApp>
          </XMath>
        </Math> as the associated optimal policy. Recall that these quantities can be related to the bonus-augmented reward through Bellman:</p>
      <equation xml:id="S3.E6">
        <tags>
          <tag>(6)</tag>
          <tag role="autoref">Equation 6</tag>
          <tag role="refnum">6</tag>
        </tags>
        <Math content-tex="Q_{\phi}(h,a)=\R(s,a)+\B(h,a)+\mathbb{E}_{s^{\prime}|s,a}[Q_{\phi}(h^{\prime},%&#10;\pi_{\phi}(h^{\prime}))]." mode="display" tex="Q_{\phi}(h,a)=\operatorname{R}(s,a)+\operatorname{B}(h,a)+\mathbb{E}_{s^{%&#10;\prime}|s,a}[Q_{\phi}(h^{\prime},\pi_{\phi}(h^{\prime}))]." text="Q _ phi * open-interval@(h, a) = R@(s, a) + B@(h, a) + E _ (conditional@(s ^ prime, list@(s, a))) * delimited-[]@(Q _ phi * open-interval@(h ^ prime, pi _ phi * h ^ prime))" xml:id="S3.E6.m1">
          <XMath>
            <XMDual>
              <XMRef idref="S3.E6.m1.11"/>
              <XMWrap>
                <XMApp xml:id="S3.E6.m1.11">
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                      <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                    </XMApp>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="open-interval"/>
                        <XMRef idref="S3.E6.m1.3"/>
                        <XMRef idref="S3.E6.m1.4"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMTok font="italic" role="UNKNOWN" xml:id="S3.E6.m1.3">h</XMTok>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMTok font="italic" role="UNKNOWN" xml:id="S3.E6.m1.4">a</XMTok>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="plus" role="ADDOP">+</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMRef idref="S3.E6.m1.5"/>
                        <XMRef idref="S3.E6.m1.6"/>
                        <XMRef idref="S3.E6.m1.7"/>
                      </XMApp>
                      <XMApp>
                        <XMDual role="OPFUNCTION" xml:id="S3.E6.m1.5">
                          <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                          <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
                        </XMDual>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="italic" role="UNKNOWN" xml:id="S3.E6.m1.6">s</XMTok>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMTok font="italic" role="UNKNOWN" xml:id="S3.E6.m1.7">a</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMApp>
                    </XMDual>
                    <XMDual>
                      <XMApp>
                        <XMRef idref="S3.E6.m1.8"/>
                        <XMRef idref="S3.E6.m1.9"/>
                        <XMRef idref="S3.E6.m1.10"/>
                      </XMApp>
                      <XMApp>
                        <XMDual role="OPFUNCTION" xml:id="S3.E6.m1.8">
                          <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                          <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                        </XMDual>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="italic" role="UNKNOWN" xml:id="S3.E6.m1.9">h</XMTok>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMTok font="italic" role="UNKNOWN" xml:id="S3.E6.m1.10">a</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMApp>
                    </XMDual>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="blackboard" role="UNKNOWN">E</XMTok>
                        <XMApp>
                          <XMTok fontsize="70%" meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                          <XMApp>
                            <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                            <XMTok fontsize="50%" name="prime" role="SUPOP">′</XMTok>
                          </XMApp>
                          <XMDual>
                            <XMApp>
                              <XMTok meaning="list"/>
                              <XMRef idref="S3.E6.m1.1"/>
                              <XMRef idref="S3.E6.m1.2"/>
                            </XMApp>
                            <XMWrap>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E6.m1.1">s</XMTok>
                              <XMTok fontsize="70%" role="PUNCT">,</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E6.m1.2">a</XMTok>
                            </XMWrap>
                          </XMDual>
                        </XMApp>
                      </XMApp>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="delimited-[]"/>
                          <XMRef idref="S3.E6.m1.11.1"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">[</XMTok>
                          <XMApp xml:id="S3.E6.m1.11.1">
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                              <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                            </XMApp>
                            <XMDual>
                              <XMApp>
                                <XMTok meaning="open-interval"/>
                                <XMRef idref="S3.E6.m1.11.1.1"/>
                                <XMRef idref="S3.E6.m1.11.1.2"/>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">(</XMTok>
                                <XMApp xml:id="S3.E6.m1.11.1.1">
                                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                                  <XMTok font="italic" role="UNKNOWN">h</XMTok>
                                  <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                                </XMApp>
                                <XMTok role="PUNCT">,</XMTok>
                                <XMApp xml:id="S3.E6.m1.11.1.2">
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                    <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                                    <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                                  </XMApp>
                                  <XMDual>
                                    <XMRef idref="S3.E6.m1.11.1.2.1"/>
                                    <XMWrap>
                                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                                      <XMApp xml:id="S3.E6.m1.11.1.2.1">
                                        <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                                        <XMTok font="italic" role="UNKNOWN">h</XMTok>
                                        <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                                      </XMApp>
                                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                    </XMWrap>
                                  </XMDual>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false">)</XMTok>
                              </XMWrap>
                            </XMDual>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">]</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                  </XMApp>
                </XMApp>
                <XMTok role="PERIOD">.</XMTok>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math>
      </equation>
      <p>Then, we can learn a network <Math content-tex="\B_{\theta}" mode="inline" tex="\operatorname{B}_{\theta}" text="B _ theta" xml:id="S3.p11.m16">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMDual role="OPFUNCTION">
                <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
              </XMDual>
              <XMTok font="italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
            </XMApp>
          </XMath>
        </Math> (parameterized by <Math mode="inline" tex="\theta" text="theta" xml:id="S3.p11.m17">
          <XMath>
            <XMTok font="italic" name="theta" role="UNKNOWN">θ</XMTok>
          </XMath>
        </Math>, with LSTM units) by minimizing a square-loss, the regression target being <Math mode="inline" tex="Q_{\phi}(h_{E},a_{E})-\gamma Q_{\phi}({h_{E}}^{\prime},\pi_{\phi}({h_{E}}^{%&#10;\prime}))-R(s_{E},a_{E})" text="Q _ phi * open-interval@(h _ E, a _ E) - gamma * Q _ phi * open-interval@((h _ E) ^ prime, pi _ phi * (h _ E) ^ prime) - R * open-interval@(s _ E, a _ E)" xml:id="S3.p11.m18">
          <XMath>
            <XMApp>
              <XMTok meaning="minus" role="ADDOP">-</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                  <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                </XMApp>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S3.p11.m18.1"/>
                    <XMRef idref="S3.p11.m18.2"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp xml:id="S3.p11.m18.1">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">h</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMApp xml:id="S3.p11.m18.2">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">a</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                  <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                </XMApp>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S3.p11.m18.3"/>
                    <XMRef idref="S3.p11.m18.4"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp xml:id="S3.p11.m18.3">
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                        <XMTok font="italic" role="UNKNOWN">h</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                      </XMApp>
                      <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMApp xml:id="S3.p11.m18.4">
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                        <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                      </XMApp>
                      <XMDual>
                        <XMRef idref="S3.p11.m18.4.1"/>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMApp xml:id="S3.p11.m18.4.1">
                            <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                              <XMTok font="italic" role="UNKNOWN">h</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                            </XMApp>
                            <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">R</XMTok>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S3.p11.m18.5"/>
                    <XMRef idref="S3.p11.m18.6"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp xml:id="S3.p11.m18.5">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">s</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMApp xml:id="S3.p11.m18.6">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">a</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>, an unbiased sample of what would give the true Bellman equation. However, we only observe optimal actions (according to <Math content-tex="\R+\B" mode="inline" tex="\operatorname{R}+\operatorname{B}" text="R + B" xml:id="S3.p11.m19">
          <XMath>
            <XMApp>
              <XMTok meaning="plus" role="ADDOP">+</XMTok>
              <XMDual role="OPFUNCTION">
                <XMTok name="R" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">R</XMTok>
              </XMDual>
              <XMDual role="OPFUNCTION">
                <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
              </XMDual>
            </XMApp>
          </XMath>
        </Math>), so this alone would hardly generalize to suboptimal ones. Therefore, we propose a heuristic, that consists in regressing for suboptimal actions towards <Math content-tex="\B_{\text{min}}" mode="inline" tex="\operatorname{B}_{\text{min}}" text="B _ [min]" xml:id="S3.p11.m20">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMDual role="OPFUNCTION">
                <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
              </XMDual>
              <XMText><text fontsize="70%">min</text></XMText>
            </XMApp>
          </XMath>
        </Math>, a hyperparameter of the algorithm. For example, it could be set to <Math mode="inline" tex="\min(Q_{\phi}(h_{E},a_{E})-\gamma Q_{\phi}({h_{E}}^{\prime},\pi_{\phi}({h_{E}}%&#10;^{\prime}))-R(s_{E},a_{E}))" text="minimum@(Q _ phi * open-interval@(h _ E, a _ E) - gamma * Q _ phi * open-interval@((h _ E) ^ prime, pi _ phi * (h _ E) ^ prime) - R * open-interval@(s _ E, a _ E))" xml:id="S3.p11.m21">
          <XMath>
            <XMDual>
              <XMApp>
                <XMRef idref="S3.p11.m21.1"/>
                <XMRef idref="S3.p11.m21.2"/>
              </XMApp>
              <XMApp>
                <XMTok meaning="minimum" role="OPFUNCTION" scriptpos="post" xml:id="S3.p11.m21.1">min</XMTok>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp xml:id="S3.p11.m21.2">
                    <XMTok meaning="minus" role="ADDOP">-</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                        <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                      </XMApp>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S3.p11.m21.2.1"/>
                          <XMRef idref="S3.p11.m21.2.2"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMApp xml:id="S3.p11.m21.2.1">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="italic" role="UNKNOWN">h</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                          </XMApp>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMApp xml:id="S3.p11.m21.2.2">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="italic" role="UNKNOWN">a</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" role="UNKNOWN">Q</XMTok>
                        <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                      </XMApp>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S3.p11.m21.2.3"/>
                          <XMRef idref="S3.p11.m21.2.4"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMApp xml:id="S3.p11.m21.2.3">
                            <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                              <XMTok font="italic" role="UNKNOWN">h</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                            </XMApp>
                            <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                          </XMApp>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMApp xml:id="S3.p11.m21.2.4">
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                              <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
                            </XMApp>
                            <XMDual>
                              <XMRef idref="S3.p11.m21.2.4.1"/>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">(</XMTok>
                                <XMApp xml:id="S3.p11.m21.2.4.1">
                                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                                    <XMTok font="italic" role="UNKNOWN">h</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                                  </XMApp>
                                  <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false">)</XMTok>
                              </XMWrap>
                            </XMDual>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" role="UNKNOWN">R</XMTok>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="open-interval"/>
                          <XMRef idref="S3.p11.m21.2.5"/>
                          <XMRef idref="S3.p11.m21.2.6"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMApp xml:id="S3.p11.m21.2.5">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="italic" role="UNKNOWN">s</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                          </XMApp>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMApp xml:id="S3.p11.m21.2.6">
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="italic" role="UNKNOWN">a</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                          </XMApp>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMApp>
            </XMDual>
          </XMath>
        </Math>, the minimum being over transitions in the dataset. This gives the following loss, for a transition <Math mode="inline" tex="(h_{E},a_{E},{h_{E}}^{\prime})" text="vector@(h _ E, a _ E, (h _ E) ^ prime)" xml:id="S3.p11.m22">
          <XMath>
            <XMDual>
              <XMApp>
                <XMTok meaning="vector"/>
                <XMRef idref="S3.p11.m22.1"/>
                <XMRef idref="S3.p11.m22.2"/>
                <XMRef idref="S3.p11.m22.3"/>
              </XMApp>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">(</XMTok>
                <XMApp xml:id="S3.p11.m22.1">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">h</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp xml:id="S3.p11.m22.2">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">a</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp xml:id="S3.p11.m22.3">
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                    <XMTok font="italic" role="UNKNOWN">h</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                  </XMApp>
                  <XMTok fontsize="70%" name="prime" role="SUPOP">′</XMTok>
                </XMApp>
                <XMTok role="CLOSE" stretchy="false">)</XMTok>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math>, and for <Math mode="inline" tex="\mskip 1.5mu \overline{\mskip-1.5mu a_{E}\mskip-1.5mu }\mskip 1.5mu " text="overline@(a _ E)" xml:id="S3.p11.m23">
          <XMath>
            <XMApp lpadding="0.8pt" rpadding="0.8pt">
              <XMTok name="overline" role="OVERACCENT">¯</XMTok>
              <XMApp rpadding="-0.8pt">
                <XMTok role="SUBSCRIPTOP" scriptpos="post3"/>
                <XMTok font="italic" lpadding="-0.8pt" role="UNKNOWN">a</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> being sampled randomly in <Math content-tex="\Actions\setminus\{a_{E}\}" mode="inline" tex="\operatorname{A}\setminus\{a_{E}\}" text="Actions set-minus set@(a _ E)" xml:id="S3.p11.m24">
          <XMath>
            <XMApp>
              <XMTok meaning="set-minus" name="setminus" role="ADDOP">∖</XMTok>
              <XMDual role="OPFUNCTION">
                <XMTok name="Actions" role="OPFUNCTION" scriptpos="post"/>
                <XMTok role="OPFUNCTION" scriptpos="post">A</XMTok>
              </XMDual>
              <XMDual>
                <XMApp>
                  <XMTok meaning="set"/>
                  <XMRef idref="S3.p11.m24.1"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">{</XMTok>
                  <XMApp xml:id="S3.p11.m24.1">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">a</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">}</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math>:</p>
      <equation xml:id="S3.E7">
        <tags>
          <tag>(7)</tag>
          <tag role="autoref">Equation 7</tag>
          <tag role="refnum">7</tag>
        </tags>
        <Math content-tex="\begin{split}\displaystyle\mathcal{L}^{\text{reg}}&amp;\displaystyle=\Bigl{(}Q_{%&#10;\phi}(h_{E},a_{E})-\gamma Q_{\phi}({h_{E}}^{\prime},\pi_{\phi}({h_{E}}^{\prime%&#10;}))-R(s_{E},a_{E})-\B_{\theta}(h_{E},a_{E})\Bigr{)}^{2}\\&#10;&amp;\displaystyle+\Bigl{(}\B_{\text{min}}-\B_{\theta}(h_{E},\mskip 1.5mu %&#10;\overline{\mskip-1.5mu a_{E}\mskip-1.5mu }\mskip 1.5mu )\Bigr{)}^{2}.\end{split}" mode="display" tex="\begin{split}\displaystyle\mathcal{L}^{\text{reg}}&amp;\displaystyle=\Bigl{(}Q_{%&#10;\phi}(h_{E},a_{E})-\gamma Q_{\phi}({h_{E}}^{\prime},\pi_{\phi}({h_{E}}^{\prime%&#10;}))-R(s_{E},a_{E})-\operatorname{B}_{\theta}(h_{E},a_{E})\Bigr{)}^{2}\\&#10;&amp;\displaystyle+\Bigl{(}\operatorname{B}_{\text{min}}-\operatorname{B}_{\theta}%&#10;(h_{E},\mskip 1.5mu \overline{\mskip-1.5mu a_{E}\mskip-1.5mu }\mskip 1.5mu )%&#10;\Bigr{)}^{2}.\end{split}" text="L ^ [reg] = (Q _ phi * open-interval@(h _ E, a _ E) - gamma * Q _ phi * open-interval@((h _ E) ^ prime, pi _ phi * (h _ E) ^ prime) - R * open-interval@(s _ E, a _ E) - (B _ theta)@(h _ E, a _ E)) ^ 2 + (B _ [min] - (B _ theta)@(h _ E, overline@(a _ E))) ^ 2" xml:id="S3.E7.m1">
          <XMath>
            <XMDual>
              <XMDual>
                <XMRef idref="S3.E7.m1.68"/>
                <XMWrap>
                  <XMApp xml:id="S3.E7.m1.68">
                    <XMRef idref="S3.E7.m1.3"/>
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                      <XMRef idref="S3.E7.m1.1"/>
                      <XMRef idref="S3.E7.m1.2.1"/>
                    </XMApp>
                    <XMApp>
                      <XMRef idref="S3.E7.m1.52"/>
                      <XMApp>
                        <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                        <XMDual>
                          <XMRef idref="S3.E7.m1.68.1"/>
                          <XMWrap>
                            <XMRef idref="S3.E7.m1.4"/>
                            <XMApp xml:id="S3.E7.m1.68.1">
                              <XMRef idref="S3.E7.m1.14"/>
                              <XMApp>
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMRef idref="S3.E7.m1.5"/>
                                  <XMRef idref="S3.E7.m1.6.1"/>
                                </XMApp>
                                <XMDual>
                                  <XMApp>
                                    <XMTok meaning="open-interval"/>
                                    <XMRef idref="S3.E7.m1.68.1.1"/>
                                    <XMRef idref="S3.E7.m1.68.1.2"/>
                                  </XMApp>
                                  <XMWrap>
                                    <XMRef idref="S3.E7.m1.7"/>
                                    <XMApp xml:id="S3.E7.m1.68.1.1">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMRef idref="S3.E7.m1.8"/>
                                      <XMRef idref="S3.E7.m1.9.1"/>
                                    </XMApp>
                                    <XMRef idref="S3.E7.m1.10"/>
                                    <XMApp xml:id="S3.E7.m1.68.1.2">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMRef idref="S3.E7.m1.11"/>
                                      <XMRef idref="S3.E7.m1.12.1"/>
                                    </XMApp>
                                    <XMRef idref="S3.E7.m1.13"/>
                                  </XMWrap>
                                </XMDual>
                              </XMApp>
                              <XMApp>
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMRef idref="S3.E7.m1.15"/>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMRef idref="S3.E7.m1.16"/>
                                  <XMRef idref="S3.E7.m1.17.1"/>
                                </XMApp>
                                <XMDual>
                                  <XMApp>
                                    <XMTok meaning="open-interval"/>
                                    <XMRef idref="S3.E7.m1.68.1.3"/>
                                    <XMRef idref="S3.E7.m1.68.1.4"/>
                                  </XMApp>
                                  <XMWrap>
                                    <XMRef idref="S3.E7.m1.18"/>
                                    <XMApp xml:id="S3.E7.m1.68.1.3">
                                      <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                      <XMApp>
                                        <XMTok role="SUBSCRIPTOP" scriptpos="post8"/>
                                        <XMRef idref="S3.E7.m1.19"/>
                                        <XMRef idref="S3.E7.m1.20.1"/>
                                      </XMApp>
                                      <XMRef idref="S3.E7.m1.21.1"/>
                                    </XMApp>
                                    <XMRef idref="S3.E7.m1.22"/>
                                    <XMApp xml:id="S3.E7.m1.68.1.4">
                                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                      <XMApp>
                                        <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                        <XMRef idref="S3.E7.m1.23"/>
                                        <XMRef idref="S3.E7.m1.24.1"/>
                                      </XMApp>
                                      <XMDual>
                                        <XMRef idref="S3.E7.m1.68.1.4.1"/>
                                        <XMWrap>
                                          <XMRef idref="S3.E7.m1.25"/>
                                          <XMApp xml:id="S3.E7.m1.68.1.4.1">
                                            <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                            <XMApp>
                                              <XMTok role="SUBSCRIPTOP" scriptpos="post8"/>
                                              <XMRef idref="S3.E7.m1.26"/>
                                              <XMRef idref="S3.E7.m1.27.1"/>
                                            </XMApp>
                                            <XMRef idref="S3.E7.m1.28.1"/>
                                          </XMApp>
                                          <XMRef idref="S3.E7.m1.29"/>
                                        </XMWrap>
                                      </XMDual>
                                    </XMApp>
                                    <XMRef idref="S3.E7.m1.30"/>
                                  </XMWrap>
                                </XMDual>
                              </XMApp>
                              <XMApp>
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMRef idref="S3.E7.m1.32"/>
                                <XMDual>
                                  <XMApp>
                                    <XMTok meaning="open-interval"/>
                                    <XMRef idref="S3.E7.m1.68.1.5"/>
                                    <XMRef idref="S3.E7.m1.68.1.6"/>
                                  </XMApp>
                                  <XMWrap>
                                    <XMRef idref="S3.E7.m1.33"/>
                                    <XMApp xml:id="S3.E7.m1.68.1.5">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMRef idref="S3.E7.m1.34"/>
                                      <XMRef idref="S3.E7.m1.35.1"/>
                                    </XMApp>
                                    <XMRef idref="S3.E7.m1.36"/>
                                    <XMApp xml:id="S3.E7.m1.68.1.6">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMRef idref="S3.E7.m1.37"/>
                                      <XMRef idref="S3.E7.m1.38.1"/>
                                    </XMApp>
                                    <XMRef idref="S3.E7.m1.39"/>
                                  </XMWrap>
                                </XMDual>
                              </XMApp>
                              <XMDual>
                                <XMApp>
                                  <XMRef idref="S3.E7.m1.68.1.7"/>
                                  <XMRef idref="S3.E7.m1.68.1.8"/>
                                  <XMRef idref="S3.E7.m1.68.1.9"/>
                                </XMApp>
                                <XMApp>
                                  <XMApp xml:id="S3.E7.m1.68.1.7">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMRef idref="S3.E7.m1.41"/>
                                    <XMRef idref="S3.E7.m1.42.1"/>
                                  </XMApp>
                                  <XMWrap>
                                    <XMRef idref="S3.E7.m1.43"/>
                                    <XMApp xml:id="S3.E7.m1.68.1.8">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMRef idref="S3.E7.m1.44"/>
                                      <XMRef idref="S3.E7.m1.45.1"/>
                                    </XMApp>
                                    <XMRef idref="S3.E7.m1.46"/>
                                    <XMApp xml:id="S3.E7.m1.68.1.9">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMRef idref="S3.E7.m1.47"/>
                                      <XMRef idref="S3.E7.m1.48.1"/>
                                    </XMApp>
                                    <XMRef idref="S3.E7.m1.49"/>
                                  </XMWrap>
                                </XMApp>
                              </XMDual>
                            </XMApp>
                            <XMRef idref="S3.E7.m1.50"/>
                          </XMWrap>
                        </XMDual>
                        <XMRef idref="S3.E7.m1.51.1"/>
                      </XMApp>
                      <XMApp>
                        <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                        <XMDual>
                          <XMRef idref="S3.E7.m1.68.2"/>
                          <XMWrap>
                            <XMRef idref="S3.E7.m1.53"/>
                            <XMApp xml:id="S3.E7.m1.68.2">
                              <XMRef idref="S3.E7.m1.56"/>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMRef idref="S3.E7.m1.54"/>
                                <XMRef idref="S3.E7.m1.55.1"/>
                              </XMApp>
                              <XMDual>
                                <XMApp>
                                  <XMRef idref="S3.E7.m1.68.2.1"/>
                                  <XMRef idref="S3.E7.m1.68.2.2"/>
                                  <XMRef idref="S3.E7.m1.63"/>
                                </XMApp>
                                <XMApp>
                                  <XMApp xml:id="S3.E7.m1.68.2.1">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMRef idref="S3.E7.m1.57"/>
                                    <XMRef idref="S3.E7.m1.58.1"/>
                                  </XMApp>
                                  <XMWrap>
                                    <XMRef idref="S3.E7.m1.59"/>
                                    <XMApp xml:id="S3.E7.m1.68.2.2">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMRef idref="S3.E7.m1.60"/>
                                      <XMRef idref="S3.E7.m1.61.1"/>
                                    </XMApp>
                                    <XMRef idref="S3.E7.m1.62" rpadding="0.8pt"/>
                                    <XMRef idref="S3.E7.m1.63" rpadding="0.8pt"/>
                                    <XMRef idref="S3.E7.m1.64"/>
                                  </XMWrap>
                                </XMApp>
                              </XMDual>
                            </XMApp>
                            <XMRef idref="S3.E7.m1.65"/>
                          </XMWrap>
                        </XMDual>
                        <XMRef idref="S3.E7.m1.66.1"/>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMTok role="PERIOD"/>
                </XMWrap>
              </XMDual>
              <XMArray colsep="0pt" name="aligned">
                <XMRow>
                  <XMCell align="right">
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                      <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.E7.m1.1">L</XMTok>
                      <XMText xml:id="S3.E7.m1.2.1"><text fontsize="70%">reg</text></XMText>
                    </XMApp>
                  </XMCell>
                  <XMCell align="left">
                    <XMApp>
                      <XMTok meaning="equals" role="RELOP" xml:id="S3.E7.m1.3">=</XMTok>
                      <XMTok meaning="absent"/>
                      <XMApp>
                        <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                        <XMWrap>
                          <XMTok fontsize="160%" role="OPEN" stretchy="false" xml:id="S3.E7.m1.4">(</XMTok>
                          <XMApp xml:id="S3.E7.m1.69">
                            <XMTok meaning="minus" role="ADDOP" xml:id="S3.E7.m1.14">-</XMTok>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMTok font="italic" role="UNKNOWN" xml:id="S3.E7.m1.5">Q</XMTok>
                                <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN" xml:id="S3.E7.m1.6.1">ϕ</XMTok>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false" xml:id="S3.E7.m1.7">(</XMTok>
                                <XMApp xml:id="S3.E7.m1.69.1">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E7.m1.8">h</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E7.m1.9.1">E</XMTok>
                                </XMApp>
                                <XMTok role="PUNCT" xml:id="S3.E7.m1.10">,</XMTok>
                                <XMApp xml:id="S3.E7.m1.69.2">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E7.m1.11">a</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E7.m1.12.1">E</XMTok>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false" xml:id="S3.E7.m1.13">)</XMTok>
                              </XMWrap>
                            </XMApp>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMTok font="italic" name="gamma" role="UNKNOWN" xml:id="S3.E7.m1.15">γ</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMTok font="italic" role="UNKNOWN" xml:id="S3.E7.m1.16">Q</XMTok>
                                <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN" xml:id="S3.E7.m1.17.1">ϕ</XMTok>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false" xml:id="S3.E7.m1.18">(</XMTok>
                                <XMApp xml:id="S3.E7.m1.69.3">
                                  <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post8"/>
                                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.E7.m1.19">h</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E7.m1.20.1">E</XMTok>
                                  </XMApp>
                                  <XMTok fontsize="70%" name="prime" role="SUPOP" xml:id="S3.E7.m1.21.1">′</XMTok>
                                </XMApp>
                                <XMTok role="PUNCT" xml:id="S3.E7.m1.22">,</XMTok>
                                <XMApp xml:id="S3.E7.m1.69.4">
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="italic" name="pi" role="UNKNOWN" xml:id="S3.E7.m1.23">π</XMTok>
                                    <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN" xml:id="S3.E7.m1.24.1">ϕ</XMTok>
                                  </XMApp>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false" xml:id="S3.E7.m1.25">(</XMTok>
                                    <XMApp xml:id="S3.E7.m1.69.4.1">
                                      <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                      <XMApp>
                                        <XMTok role="SUBSCRIPTOP" scriptpos="post8"/>
                                        <XMTok font="italic" role="UNKNOWN" xml:id="S3.E7.m1.26">h</XMTok>
                                        <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E7.m1.27.1">E</XMTok>
                                      </XMApp>
                                      <XMTok fontsize="70%" name="prime" role="SUPOP" xml:id="S3.E7.m1.28.1">′</XMTok>
                                    </XMApp>
                                    <XMTok role="CLOSE" stretchy="false" xml:id="S3.E7.m1.29">)</XMTok>
                                  </XMWrap>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false" xml:id="S3.E7.m1.30">)</XMTok>
                              </XMWrap>
                            </XMApp>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMTok font="italic" role="UNKNOWN" xml:id="S3.E7.m1.32">R</XMTok>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false" xml:id="S3.E7.m1.33">(</XMTok>
                                <XMApp xml:id="S3.E7.m1.69.5">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E7.m1.34">s</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E7.m1.35.1">E</XMTok>
                                </XMApp>
                                <XMTok role="PUNCT" xml:id="S3.E7.m1.36">,</XMTok>
                                <XMApp xml:id="S3.E7.m1.69.6">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E7.m1.37">a</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E7.m1.38.1">E</XMTok>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false" xml:id="S3.E7.m1.39">)</XMTok>
                              </XMWrap>
                            </XMApp>
                            <XMApp>
                              <XMApp xml:id="S3.E7.m1.69.7">
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMDual role="OPFUNCTION" xml:id="S3.E7.m1.41">
                                  <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                                  <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                                </XMDual>
                                <XMTok font="italic" fontsize="70%" name="theta" role="UNKNOWN" xml:id="S3.E7.m1.42.1">θ</XMTok>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false" xml:id="S3.E7.m1.43">(</XMTok>
                                <XMApp xml:id="S3.E7.m1.69.8">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E7.m1.44">h</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E7.m1.45.1">E</XMTok>
                                </XMApp>
                                <XMTok role="PUNCT" xml:id="S3.E7.m1.46">,</XMTok>
                                <XMApp xml:id="S3.E7.m1.69.9">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E7.m1.47">a</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E7.m1.48.1">E</XMTok>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false" xml:id="S3.E7.m1.49">)</XMTok>
                              </XMWrap>
                            </XMApp>
                          </XMApp>
                          <XMTok fontsize="160%" role="CLOSE" stretchy="false" xml:id="S3.E7.m1.50">)</XMTok>
                        </XMWrap>
                        <XMTok fontsize="70%" meaning="2" role="NUMBER" xml:id="S3.E7.m1.51.1">2</XMTok>
                      </XMApp>
                    </XMApp>
                  </XMCell>
                </XMRow>
                <XMRow>
                  <XMCell/>
                  <XMCell align="left">
                    <XMWrap>
                      <XMApp xml:id="S3.E7.m1.70">
                        <XMTok meaning="plus" role="ADDOP" xml:id="S3.E7.m1.52">+</XMTok>
                        <XMApp>
                          <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                          <XMWrap>
                            <XMTok fontsize="160%" role="OPEN" stretchy="false" xml:id="S3.E7.m1.53">(</XMTok>
                            <XMApp xml:id="S3.E7.m1.70.1">
                              <XMTok meaning="minus" role="ADDOP" xml:id="S3.E7.m1.56">-</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMDual role="OPFUNCTION" xml:id="S3.E7.m1.54">
                                  <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                                  <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                                </XMDual>
                                <XMText xml:id="S3.E7.m1.55.1"><text fontsize="70%">min</text></XMText>
                              </XMApp>
                              <XMApp>
                                <XMApp xml:id="S3.E7.m1.70.1.1">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMDual role="OPFUNCTION" xml:id="S3.E7.m1.57">
                                    <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                                    <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                                  </XMDual>
                                  <XMTok font="italic" fontsize="70%" name="theta" role="UNKNOWN" xml:id="S3.E7.m1.58.1">θ</XMTok>
                                </XMApp>
                                <XMWrap>
                                  <XMTok role="OPEN" stretchy="false" xml:id="S3.E7.m1.59">(</XMTok>
                                  <XMApp xml:id="S3.E7.m1.70.1.2">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.E7.m1.60">h</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E7.m1.61.1">E</XMTok>
                                  </XMApp>
                                  <XMTok role="PUNCT" rpadding="0.8pt" xml:id="S3.E7.m1.62">,</XMTok>
                                  <XMApp rpadding="0.8pt" xml:id="S3.E7.m1.63">
                                    <XMTok name="overline" role="OVERACCENT">¯</XMTok>
                                    <XMApp rpadding="-0.8pt">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post9"/>
                                      <XMTok font="italic" lpadding="-0.8pt" role="UNKNOWN">a</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">E</XMTok>
                                    </XMApp>
                                  </XMApp>
                                  <XMTok role="CLOSE" stretchy="false" xml:id="S3.E7.m1.64">)</XMTok>
                                </XMWrap>
                              </XMApp>
                            </XMApp>
                            <XMTok fontsize="160%" role="CLOSE" stretchy="false" xml:id="S3.E7.m1.65">)</XMTok>
                          </XMWrap>
                          <XMTok fontsize="70%" meaning="2" role="NUMBER" xml:id="S3.E7.m1.66.1">2</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMTok role="PERIOD" xml:id="S3.E7.m1.67">.</XMTok>
                    </XMWrap>
                  </XMCell>
                </XMRow>
              </XMArray>
            </XMDual>
          </XMath>
        </Math>
      </equation>
      <p>To sum up, we train a BC policy by minimizing <Math mode="inline" tex="\mathcal{L}^{\text{BC}}" text="L ^ [BC]" xml:id="S3.p11.m25">
          <XMath>
            <XMApp>
              <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
              <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
              <XMText><text fontsize="70%">BC</text></XMText>
            </XMApp>
          </XMath>
        </Math>. The implicit resulting logits are considered optimal <Math mode="inline" tex="Q" text="Q" xml:id="S3.p11.m26">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">Q</XMTok>
          </XMath>
        </Math>-values, that are in turn used to learn the bonus <Math mode="inline" tex="B_{\theta}" text="B _ theta" xml:id="S3.p11.m27">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">B</XMTok>
              <XMTok font="italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
            </XMApp>
          </XMath>
        </Math> by minimizing the loss <Math mode="inline" tex="\mathcal{L}^{\text{reg}}" text="L ^ [reg]" xml:id="S3.p11.m28">
          <XMath>
            <XMApp>
              <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
              <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
              <XMText><text fontsize="70%">reg</text></XMText>
            </XMApp>
          </XMath>
        </Math> (Figure <ref labelref="LABEL:schema"/>).</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S4">
    <tags>
      <tag>4</tag>
      <tag role="autoref">section 4</tag>
      <tag role="refnum">4</tag>
      <tag role="typerefnum">§4</tag>
    </tags>
    <title><tag close=". ">4</tag>Experiments</title>
    <toctitle><tag close=" ">4</tag>Experiments</toctitle>
    <para xml:id="S4.p1">
      <p>We aim at providing insights on what <text font="italic">priors</text> SmtW is able to extract from the demonstrations and specifically, we wish to verify that SmtW is able to encourage a <text font="italic">structured exploration</text> of the environment. In order to thoroughly study the method, we test it on a grid-world where we are able to design controllers with specific behaviors.
As in IRL, studying the return of an agent trained with our bonus is only a proxy to evaluate SmtW’s quality and is not informative on the priors the bonus conveys. We thus focus our experiments on analyzing the priors that were extracted from the demonstrations by the method.
More specifically we wish to answer the following questions: (1) Is SmtW encouraging the demonstrator’s behavior more than a random one? (2) Is SmtW capturing the demonstrator’s style, its way of exploring the environment? (3) Does SmtW capture the skills required to solve the task? (4) Does SmtW encourage novelty seeking? (5) Does SmtW capture the constraints the demonstrator may be submitted to?
To do so, we design controlled behaviors and study the bonus returned along these specific behaviors by SmtW, as described in Fig. <ref labelref="LABEL:xp_explained"/>. Given a behavior <Math mode="inline" tex="A" text="A" xml:id="S4.p1.m1">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">A</XMTok>
          </XMath>
        </Math> and a behavior <Math mode="inline" tex="B" text="B" xml:id="S4.p1.m2">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">B</XMTok>
          </XMath>
        </Math>, this allows to check if a given bonus encourages behavior <Math mode="inline" tex="A" text="A" xml:id="S4.p1.m3">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">A</XMTok>
          </XMath>
        </Math> over <Math mode="inline" tex="B" text="B" xml:id="S4.p1.m4">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">B</XMTok>
          </XMath>
        </Math> or vice versa or rewards them equivalently.</p>
    </para>
    <figure inlist="lof" labels="LABEL:xp_explained" placement="!tbh" xml:id="S4.F2">
      <tags>
        <tag>Figure 2</tag>
        <tag role="autoref">Figure 2</tag>
        <tag role="refnum">2</tag>
        <tag role="typerefnum">Figure 2</tag>
      </tags>
      <inline-para align="center" class="ltx_minipage" vattach="middle" width="433.6pt">
<!--  %****␣sample.tex␣Line␣225␣**** -->        <para align="center" xml:id="S4.F2.p1">
          <graphics candidates="figures/xp_explained.pdf" graphic="figures/xp_explained" options="width=368.577pt" xml:id="S4.F2.p1.g1"/>
        </para>
      </inline-para>
      <toccaption class="ltx_centering"><tag close=" ">2</tag>Comparing intrinsic rewards on what behavior they encourage or discourage on new unseen environments.</toccaption>
      <caption class="ltx_centering"><tag close=". ">Figure 2</tag>Comparing intrinsic rewards on what behavior they encourage or discourage on new unseen environments.</caption>
    </figure>
    <para xml:id="S4.p2">
      <p>After addressing these questions, we also verify that a simple agent can benefit from SmtW to actually solve efficiently a task.</p>
    </para>
    <para xml:id="S4.p3">
      <p><text font="bold">The environment.</text>
We introduce a specific environment to answer these. We require this environment to be procedurally-generated in order to test SmtW’s ability to generalize to unseen environments. We require the environment to be complex enough so that exhaustive exploration is prohibitively expensive. To achieve this, we introduce the KeysDoors grid-world of size NxN.</p>
    </para>
    <figure float="right" inlist="lof" labels="LABEL:keysdoors" xml:id="S4.F3">
      <tags>
        <tag>Figure 3</tag>
        <tag role="autoref">Figure 3</tag>
        <tag role="refnum">3</tag>
        <tag role="typerefnum">Figure 3</tag>
      </tags>
      <graphics candidates="figures/env.png" class="ltx_centering" graphic="figures/env.png" options="width=433.62pt" xml:id="S4.F3.g1"/>
      <toccaption class="ltx_centering"><tag close=" ">3</tag>KeysDoors(N=5).</toccaption>
      <caption class="ltx_centering"><tag close=". ">Figure 3</tag>KeysDoors(N=5).</caption>
    </figure>
    <para xml:id="S4.p4">
      <p>It contains <Math mode="inline" tex="N" text="N" xml:id="S4.p4.m1">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">N</XMTok>
          </XMath>
        </Math> keys and <Math mode="inline" tex="N" text="N" xml:id="S4.p4.m2">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">N</XMTok>
          </XMath>
        </Math> doors, modeled by two different colors. The agent has a third color. The goal is to find the correct key and to open the correct door with it. As doors (resp. keys) are indistinguishable (except by their locations), an explorer has to try the different keys on the different doors. Actions available are {<text font="italic">go left, go right, go up, go down, take, open, wait</text>}. When an agent makes the action “take” on a key, it is then able to move with it. Actions “open” or “take” make the agent lose the key it was previously holding. To solve the task, the agent has to go to the correct key, take it, go to the correct door without doing action “take” or “open” on the way (so as not to lose the key), and then “open” the door. We need the environment to require <text font="italic">perseverance</text> so we made the reward function -1 for any actions but the <text font="italic">wait</text> action, that is rewarded 0. Opening the correct door with the correct key gives a reward of 100 and terminates the episode. It requires perseverance as a “lazy” policy would get a return of 0 whereas trying to find the 100 reward gives -1 at each step. This is a well known issue in RL that simple exploration leads to such lazy solutions.</p>
    </para>
    <para xml:id="S4.p5">
      <p>The KeysDoors environment is generated procedurally. For each column, locations for a door and a key are sampled uniformly without replacement. Thus, there is exactly one key and one door on each column and these cannot be at the same location. The “correct” key is then uniformly sampled among the keys and the "correct" door is sampled uniformly among the doors. The initial position of the agent is sample uniformly on the grid. The environment gives both a ground-truth-state (an integer representing the current state), unused by SmtW as well as an RGB observation (as shown in Fig. <ref labelref="LABEL:keysdoors"/>), used by SmtW.
Figure <ref labelref="LABEL:keysdoors_traj"/> shows a trajectory in one possible instance of the KeysDoors environment with <Math mode="inline" tex="N=5" text="N = 5" xml:id="S4.p5.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMTok font="italic" role="UNKNOWN">N</XMTok>
              <XMTok meaning="5" role="NUMBER">5</XMTok>
            </XMApp>
          </XMath>
        </Math>. Every observation <Math mode="inline" tex="x" text="x" xml:id="S4.p5.m2">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">x</XMTok>
          </XMath>
        </Math> (an <Math mode="inline" tex="N\times N\times 3" text="N * N * 3" xml:id="S4.p5.m3">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">×</XMTok>
              <XMTok font="italic" role="UNKNOWN">N</XMTok>
              <XMTok font="italic" role="UNKNOWN">N</XMTok>
              <XMTok meaning="3" role="NUMBER">3</XMTok>
            </XMApp>
          </XMath>
        </Math> tensor) is normalized between 0 and 1 by dividing by 255.</p>
    </para>
    <para xml:id="S4.p6">
      <p><text font="bold">The demonstrations</text>. For a given instance of the environment, the demonstrator navigates between keys and doors and tries key/door pairs in a precise order. It takes the first key on the left and tries it on the first door on the left, then it tries the same key on the second door etc. Once it has tried the first key on every door, it repeats the operation with the second key and proceeds further this way. The episode ends when the demonstrator finds the right key/door pair and obtains the reward. Then it “exploits”, taking the correct key and opening directly the correct door five consecutive times. Note that this also simulates the non-stationnarity happening in most goal-directed task solving processes. One first mainly explores and then exploits more and more.</p>
    </para>
    <para xml:id="S4.p7">
      <p><text font="bold">Train vs. Test.</text> The bonus is always used in new test environments, unseen in the demonstrations. SmtW’s ability to generalize to new environments is thus tested in all the following experiments. Given the possible positions of the keys, of the doors and then of the correct key and the correct door, there are <Math mode="inline" tex="(N-1)N^{3}" text="(N - 1) * N ^ 3" xml:id="S4.p7.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMDual>
                <XMRef idref="S4.p7.m1.1"/>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp xml:id="S4.p7.m1.1">
                    <XMTok meaning="minus" role="ADDOP">-</XMTok>
                    <XMTok font="italic" role="UNKNOWN">N</XMTok>
                    <XMTok meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">N</XMTok>
                <XMTok fontsize="70%" meaning="3" role="NUMBER">3</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> possible instances of the environment.
<text font="bold">The behaviors</text> that are designed to study what is actually encouraged or discouraged by SmtW are the following. Their associated bonus is always studied in test instances of the environment.
<!--  %****␣sample.tex␣Line␣250␣**** --></p>
      <itemize xml:id="S4.I1">
        <item xml:id="S4.I1.i1">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">1st item</tag>
          </tags>
          <para xml:id="S4.I1.i1.p1">
            <p>The demonstrator behavior acts as described previously, sequentially trying key/door pairs.</p>
          </para>
        </item>
        <item xml:id="S4.I1.i2">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">2nd item</tag>
          </tags>
          <para xml:id="S4.I1.i2.p1">
            <p>The <text font="italic">random</text> behavior takes random actions. Trajectories are limited to 1000 steps.</p>
          </para>
        </item>
        <item xml:id="S4.I1.i3">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">3rd item</tag>
          </tags>
          <para xml:id="S4.I1.i3.p1">
            <p>The <text font="italic">demonstrator inverse</text> behavior is similar to the demonstrator as it navigates to a key, takes it, navigates to a door and opens it. However, the key/door pairs are tried in the reverse order to the demonstrations.</p>
          </para>
        </item>
        <item xml:id="S4.I1.i4">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">4th item</tag>
          </tags>
          <para xml:id="S4.I1.i4.p1">
            <p>The <text font="italic">demonstrator random</text> behavior is also similar but tries the key/door pairs in a random order.</p>
          </para>
        </item>
        <item xml:id="S4.I1.i5">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">5th item</tag>
          </tags>
          <para xml:id="S4.I1.i5.p1">
            <p>The <text font="italic">dummy demonstrator</text> behavior navigates exactly like the demonstrator but drops the key at a random time on the way to the door (uniformly sampled on the path to the door) by taking action <text font="italic">open</text>. The trajectories are limited to 1000 steps.</p>
          </para>
        </item>
        <item xml:id="S4.I1.i6">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">6th item</tag>
          </tags>
          <para xml:id="S4.I1.i6.p1">
            <p>The <text font="italic">standing still</text> behavior remains in its original position by only taking the <text font="italic">wait</text> action.</p>
          </para>
        </item>
        <item xml:id="S4.I1.i7">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">7th item</tag>
          </tags>
          <para xml:id="S4.I1.i7.p1">
            <p>The <text font="italic">waiting demonstrator</text> behavior acts like the demonstrator but has a probability 0.1 of waiting at each step.</p>
          </para>
        </item>
        <item xml:id="S4.I1.i8">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">8th item</tag>
          </tags>
          <para xml:id="S4.I1.i8.p1">
            <p>The <text font="italic">unsafe demonstrator</text> acts like the demonstrator but takes this action <text font="italic">take</text> each time it moves until it has a key. Taking action <text font="italic">take</text> somewhere else than on a key can be considered as breaking a safety constraint that the demonstrator respects strictly.</p>
          </para>
        </item>
      </itemize>
    </para>
    <para xml:id="S4.p8">
      <p>A trajectory of an agent moving to a key, taking it, moving to a door and trying to open it with the key is shown in Fig. <ref labelref="LABEL:keysdoors_traj"/></p>
    </para>
    <figure inlist="lof" labels="LABEL:keysdoors_traj" placement="!htb" xml:id="S4.F4">
      <tags>
        <tag>Figure 4</tag>
        <tag role="autoref">Figure 4</tag>
        <tag role="refnum">4</tag>
        <tag role="typerefnum">Figure 4</tag>
      </tags>
      <graphics candidates="figures/smtw_traj.png" class="ltx_centering" graphic="figures/smtw_traj.png" options="width=433.62pt" xml:id="S4.F4.g1"/>
      <toccaption class="ltx_centering"><tag close=" ">4</tag>A trajectory of length 9 in an instance of the KeysDoors(N=5) environment.</toccaption>
      <caption class="ltx_centering"><tag close=". ">Figure 4</tag>A trajectory of length 9 in an instance of the KeysDoors(N=5) environment.</caption>
    </figure>
    <subsection inlist="toc" labels="LABEL:bonus_analysis" xml:id="S4.SS1">
      <tags>
        <tag>4.1</tag>
        <tag role="autoref">subsection 4.1</tag>
        <tag role="refnum">4.1</tag>
        <tag role="typerefnum">§4.1</tag>
      </tags>
      <title><tag close=". ">4.1</tag>Bonus analysis</title>
      <toctitle><tag close=" ">4.1</tag>Bonus analysis</toctitle>
      <para xml:id="S4.SS1.p1">
        <p>We train the SmtW bonus on 200 KeysDoors(N=5) training environments with 10 demonstrations for each of them. The implementation choices are detailed in Sec. <ref labelref="LABEL:implementation_details"/>.
In order to study the priors extracted from the demonstrations, we study the bonus given by SmtW along various trajectories following a given behavior.
We thereafter plot the distribution of received bonus along the various controlled behaviors on 20 test environments, unseen during the training of SmtW.
We compare the bonus given by SmtW along these trajectories to the one that would be given by a count-based <cite class="ltx_citemacro_citep">(<bibref bibrefs="strehl2008analysis" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> and a random network distillation bonus <cite class="ltx_citemacro_citep">(<bibref bibrefs="burda2018exploration" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. The very same trajectories are presented to each bonus.</p>
      </para>
      <figure inlist="lof" labels="LABEL:hist_random" placement="!tbh" xml:id="S4.F5">
        <tags>
          <tag>Figure 5</tag>
          <tag role="autoref">Figure 5</tag>
          <tag role="refnum">5</tag>
          <tag role="typerefnum">Figure 5</tag>
        </tags>
<!--  %****␣sample.tex␣Line␣275␣**** -->        <inline-para align="center" class="ltx_minipage" vattach="middle" width="433.6pt">
          <para align="center" xml:id="S4.F5.p1">
            <graphics candidates="figures/hist_random2.pdf" graphic="figures/hist_random2" options="width=433.62pt" xml:id="S4.F5.p1.g1"/>
          </para>
        </inline-para>
        <toccaption class="ltx_centering"><tag close=" ">5</tag>Bonus distribution received by an demonstrator’s behavior (top) and a random behavior (bottom). We can say that a bonus encourages more a behavior A than a behavior B if the distribution of bonus along trajectories following A are globally higher than the one along trajectories following B. SmtW encourages the demonstrator behavior over the random behavior.</toccaption>
        <caption class="ltx_centering"><tag close=". ">Figure 5</tag>Bonus distribution received by an demonstrator’s behavior (top) and a random behavior (bottom). We can say that a bonus encourages more a behavior A than a behavior B if the distribution of bonus along trajectories following A are globally higher than the one along trajectories following B. SmtW encourages the demonstrator behavior over the random behavior.</caption>
      </figure>
      <figure inlist="lof" labels="LABEL:hist_order" placement="!tbh" xml:id="S4.F6">
        <tags>
          <tag>Figure 6</tag>
          <tag role="autoref">Figure 6</tag>
          <tag role="refnum">6</tag>
          <tag role="typerefnum">Figure 6</tag>
        </tags>
        <inline-para align="center" class="ltx_minipage" vattach="middle" width="433.6pt">
          <para align="center" xml:id="S4.F6.p1">
            <graphics candidates="figures/hist_order2.pdf" graphic="figures/hist_order2" options="width=433.62pt" xml:id="S4.F6.p1.g1"/>
          </para>
        </inline-para>
        <toccaption class="ltx_centering"><tag close=" ">6</tag>Bonus distribution received by the <text font="italic">demonstrator</text> behavior (top), the <text font="italic">inverse demonstrator</text> behavior (middle), and the <text font="italic">random demonstrator</text> behavior. (bottom).</toccaption>
        <caption class="ltx_centering"><tag close=". ">Figure 6</tag>Bonus distribution received by the <text font="italic">demonstrator</text> behavior (top), the <text font="italic">inverse demonstrator</text> behavior (middle), and the <text font="italic">random demonstrator</text> behavior. (bottom).</caption>
      </figure>
      <figure inlist="lof" labels="LABEL:hist_releasing" placement="!tbh" xml:id="S4.F7">
        <tags>
          <tag>Figure 7</tag>
          <tag role="autoref">Figure 7</tag>
          <tag role="refnum">7</tag>
          <tag role="typerefnum">Figure 7</tag>
        </tags>
        <inline-para align="center" class="ltx_minipage" vattach="middle" width="433.6pt">
          <para align="center" xml:id="S4.F7.p1">
            <graphics candidates="figures/hist_releasing2.pdf" graphic="figures/hist_releasing2.pdf" options="width=433.62pt" xml:id="S4.F7.p1.g1"/>
          </para>
        </inline-para>
        <toccaption class="ltx_centering"><tag close=" ">7</tag>Bonus distribution received by the <text font="italic">demonstrator</text> behavior (top) and by the <text font="italic">dummy demonstrator</text> behavior, acting (almost) like the demonstrator but releasing the key on the way to the door (bottom).</toccaption>
        <caption class="ltx_centering"><tag close=". ">Figure 7</tag>Bonus distribution received by the <text font="italic">demonstrator</text> behavior (top) and by the <text font="italic">dummy demonstrator</text> behavior, acting (almost) like the demonstrator but releasing the key on the way to the door (bottom).</caption>
<!--  %****␣sample.tex␣Line␣300␣**** -->      </figure>
      <para xml:id="S4.SS1.p2">
        <p><text font="bold">Does SmtW encourage a structured exploration more than a random one?</text> We compare in Figure <ref labelref="LABEL:hist_random"/> the distribution of bonus received along random trajectories to the ones obtained by the demonstrator’s behavior. Recall that SmtW has been trained on similar environments but is here tested on different ones. It is thus provided with trajectories unseen during training.</p>
      </para>
      <para xml:id="S4.SS1.p3">
        <p>As shown on Figure <ref labelref="LABEL:hist_random"/>, the demonstrator’s behavior (top) is more rewarded by SmtW than the random behavior (bottom).
We can conclude that SmtW encourage the agent to follow a demonstrator-like behavior on new unseen environments more than a random behavior.
The count-based bonus also rewards the demonstrator’s behavior more than a random one as a random behavior explores the environment very locally. Surprisingly, RND rewards the random behavior more than the demonstrator’s one. This might be explained by the fact that the demonstrator visits several times the same state in order to explore correctly. Indeed the demonstrator has to go several times to the same key to take it and try it on the several doors.</p>
      </para>
      <para xml:id="S4.SS1.p4">
        <p><text font="bold">Does SmtW capture the demonstrator’s style, his way of exploring the environment?</text> We show in Figure <ref labelref="LABEL:hist_order"/> the distribution of bonus received along different behaviors: the <text font="italic">demonstrator</text> one, the <text font="italic">demonstrator inverse</text> one as well as the <text font="italic">demonstrator random</text> one. These three behaviors lead to the same outcome but we hope to capture the demonstrator’s exploration bias and see if it encourages the behaviors that tries the key/door pair in the same order as in the demonstrations.
As shown on Figure <ref labelref="LABEL:hist_order"/>, the count-based bonus and RND reward similarly the three behaviors, as they lead to the same amount of novelty. Only the order in which the key/door pairs are tried is change. SmtW, on the contrary, encourages to reproduce the demonstrator bias. It rewards more the behavior trying the key/door pairs in the same order as in the demonstrations.</p>
      </para>
      <para xml:id="S4.SS1.p5">
        <p><text font="bold">Does SmtW capture the priors useful to solve the task?</text> Figure <ref labelref="LABEL:hist_releasing"/> shows the distribution of bonus received by the <text font="italic">demonstrator</text> behavior and compares the bonus received to the one received when following the <text font="italic">dummy demonstrator</text> one.
As shown on Figure <ref labelref="LABEL:hist_releasing"/>, the count-based bonus and RND reward equivalently these two behaviors as they bring the same amount of novelty (both in term of ground-truth-state and observations). SmtW does not reward the <text font="italic">dummy demonstrator</text> behavior as much as the expert one and we can interpret the lower distribution mode (SmtW-bottom) as the bonus obtained after loosing the key. We can argue that SmtW has somehow captured the prior that it is useful to navigate from the key to the door without loosing the key, as it rewards more the <text font="italic">demonstrator</text> behavior than the <text font="italic">dummy demonstrator</text> one.</p>
      </para>
      <figure inlist="lof" labels="LABEL:hist_novelty" placement="!tbh" xml:id="S4.F8">
        <tags>
          <tag>Figure 8</tag>
          <tag role="autoref">Figure 8</tag>
          <tag role="refnum">8</tag>
          <tag role="typerefnum">Figure 8</tag>
        </tags>
        <inline-para align="center" class="ltx_minipage" vattach="middle" width="433.6pt">
          <para align="center" xml:id="S4.F8.p1">
            <graphics candidates="figures/hist_novelty2.pdf" graphic="figures/hist_novelty2.pdf" options="width=433.62pt" xml:id="S4.F8.p1.g1"/>
          </para>
        </inline-para>
        <toccaption class="ltx_centering"><tag close=" ">8</tag>Bonus distribution received by the <text font="italic">demonstrator</text> behavior (top) and by the <text font="italic">standing still</text> behavior (bottom).</toccaption>
        <caption class="ltx_centering"><tag close=". ">Figure 8</tag>Bonus distribution received by the <text font="italic">demonstrator</text> behavior (top) and by the <text font="italic">standing still</text> behavior (bottom).</caption>
      </figure>
      <figure inlist="lof" labels="LABEL:hist_time" placement="!tbh" xml:id="S4.F9">
        <tags>
          <tag>Figure 9</tag>
          <tag role="autoref">Figure 9</tag>
          <tag role="refnum">9</tag>
          <tag role="typerefnum">Figure 9</tag>
        </tags>
        <inline-para align="center" class="ltx_minipage" vattach="middle" width="433.6pt">
<!--  %****␣sample.tex␣Line␣325␣**** -->          <para align="center" xml:id="S4.F9.p1">
            <graphics candidates="figures/hist_time2.pdf" graphic="figures/hist_time2.pdf" options="width=433.62pt" xml:id="S4.F9.p1.g1"/>
          </para>
        </inline-para>
        <toccaption class="ltx_centering"><tag close=" ">9</tag>Bonus distribution received by the <text font="italic">demonstrator</text> behavior (top) and by the <text font="italic">waiting demonstrator</text> behavior (bottom).</toccaption>
        <caption class="ltx_centering"><tag close=". ">Figure 9</tag>Bonus distribution received by the <text font="italic">demonstrator</text> behavior (top) and by the <text font="italic">waiting demonstrator</text> behavior (bottom).</caption>
      </figure>
      <figure inlist="lof" labels="LABEL:hist_safety" placement="!tbh" xml:id="S4.F10">
        <tags>
          <tag>Figure 10</tag>
          <tag role="autoref">Figure 10</tag>
          <tag role="refnum">10</tag>
          <tag role="typerefnum">Figure 10</tag>
        </tags>
        <inline-para align="center" class="ltx_minipage" vattach="middle" width="433.6pt">
          <para align="center" xml:id="S4.F10.p1">
            <graphics candidates="figures/hist_safety2.pdf" graphic="figures/hist_safety2.pdf" options="width=433.62pt" xml:id="S4.F10.p1.g1"/>
          </para>
        </inline-para>
        <toccaption class="ltx_centering"><tag close=" ">10</tag>Bonus distribution received by the <text font="italic">demonstrator</text> behavior (top) vs. by the <text font="italic">unsafe demonstrator</text> behavior (bottom).</toccaption>
        <caption class="ltx_centering"><tag close=". ">Figure 10</tag>Bonus distribution received by the <text font="italic">demonstrator</text> behavior (top) vs. by the <text font="italic">unsafe demonstrator</text> behavior (bottom).</caption>
      </figure>
      <para xml:id="S4.SS1.p6">
        <p><text font="bold">Does SmtW encourage long-term exploration?</text> As the environment gives a reward of <Math mode="inline" tex="-1" text="- 1" xml:id="S4.SS1.p6.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="minus" role="ADDOP">-</XMTok>
                <XMTok meaning="1" role="NUMBER">1</XMTok>
              </XMApp>
            </XMath>
          </Math> for taking any action but the <text font="italic">wait</text> action, an agent not exploring sufficiently would quickly converge to the policy only taking action <text font="italic">wait</text> to avoid negative rewards (verified in Figure <ref labelref="LABEL:res"/>). This same problem is visible in the <text font="italic">Pitfall!</text> game, where the best agents learn a policy obtaining 0 reward, while persevering humans get much higher scores. We show in Figure <ref labelref="LABEL:hist_novelty"/> the distribution of bonus obtained by the <text font="italic">standing still</text> behavior.
As shown on Figure <ref labelref="LABEL:hist_novelty"/>, SmtW rewards much less a behavior not seeking novelty. As expected the count based gives a bonus very close to 0 for such a behavior. Perhaps surprisingly, RND rewards negatively this behavior but not with an average bonus lower than the demonstrator’s behavior. This might be also due to the designed bonus normalization that RND uses (zero-mean unit-variance).</p>
      </para>
      <para xml:id="S4.SS1.p7">
        <p><text font="bold">Does SmtW capture the constraints the demonstrator may be submitted to?</text> A demonstrator can be subject to time or energy constraints. In the demonstrations, the demonstrator tries to explore the environment as fast as possible and does not take action <text font="italic">wait</text> on his way to keys and doors. We compare the bonus distribution obtained by the <text font="italic">waiting demonstrator</text> behavior to the one obtained by the <text font="italic">demonstrator</text> one.</p>
      </para>
      <para xml:id="S4.SS1.p8">
        <p>As shown on Figure <ref labelref="LABEL:hist_time"/>, RND and the count-based bonus reward equivalently these two behaviors. On the other hand, SmtW rewards less the <text font="italic">waiting demonstrator</text> behavior. We argue it has somehow captured the prior resulting from the resource constraint that leads the demonstrator to try the key/door pairs as fast as possible. In other words, it favors behaviors that, as shown in the demonstrations, discard the <text font="italic">wait</text> action to simplify exploration of the MDP.
What is more, a demonstrator might be subject to safety constraints. As example, it might be dangerous for a robot to try an action in an inappropriate place. The demonstrations minimize the number of time they use the action “take” and only do it when on keys. We can consider that the demonstrator’s behavior complied with safety constraints. We show in Figure <ref labelref="LABEL:hist_safety"/> the bonus distribution obtained by the demonstrator’s behavior and compare it with the one obtained by the <text font="italic">unsafe demonstrator</text>.
As shown on Figure <ref labelref="LABEL:hist_safety"/>, the RND and the count-based bonuses reward equivalently these two behaviors. This is expected as they bring the same amount of novelty. In contrast, SmtW rewards less the <text font="italic">unsafe demonstrator</text> behavior, capturing the safety prior the demonstrator have been subject to.</p>
      </para>
<!--  %****␣sample.tex␣Line␣350␣**** -->      <para xml:id="S4.SS1.p9">
        <p>Overall, we argue that SmtW is able to recover some important bias and constraints inherent to the demonstrations. Hand-crafting a bonus expressing these motivations could be extremely complicated and we demonstrated that SmtW is able to generalize these motivations to unseen environments.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S4.SS2">
      <tags>
        <tag>4.2</tag>
        <tag role="autoref">subsection 4.2</tag>
        <tag role="refnum">4.2</tag>
        <tag role="typerefnum">§4.2</tag>
      </tags>
      <title><tag close=". ">4.2</tag>Training an agent on the bonus</title>
      <toctitle><tag close=" ">4.2</tag>Training an agent on the bonus</toctitle>
      <para xml:id="S4.SS2.p1">
        <p>We now wish to check that an agent can benefit from SmtW. We thus train a <Math mode="inline" tex="Q" text="Q" xml:id="S4.SS2.p1.m1">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">Q</XMTok>
            </XMath>
          </Math>-learning agent with SmtW and compare the results with that of a simple <Math mode="inline" tex="\epsilon" text="epsilon" xml:id="S4.SS2.p1.m2">
            <XMath>
              <XMTok font="italic" name="epsilon" role="UNKNOWN">ϵ</XMTok>
            </XMath>
          </Math>-greedy (<Math mode="inline" tex="\epsilon" text="epsilon" xml:id="S4.SS2.p1.m3">
            <XMath>
              <XMTok font="italic" name="epsilon" role="UNKNOWN">ϵ</XMTok>
            </XMath>
          </Math>=0.1) exploration strategy and a count-based bonus with <Math mode="inline" tex="B(s,a)=N(s,a)^{-1/2}" text="B * open-interval@(s, a) = N * (open-interval@(s, a)) ^ (- 1 / 2)" xml:id="S4.SS2.p1.m4">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">B</XMTok>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="open-interval"/>
                      <XMRef idref="S4.SS2.p1.m4.1"/>
                      <XMRef idref="S4.SS2.p1.m4.2"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="italic" role="UNKNOWN" xml:id="S4.SS2.p1.m4.1">s</XMTok>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMTok font="italic" role="UNKNOWN" xml:id="S4.SS2.p1.m4.2">a</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">N</XMTok>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="open-interval"/>
                        <XMRef idref="S4.SS2.p1.m4.3"/>
                        <XMRef idref="S4.SS2.p1.m4.4"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMTok font="italic" role="UNKNOWN" xml:id="S4.SS2.p1.m4.3">s</XMTok>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMTok font="italic" role="UNKNOWN" xml:id="S4.SS2.p1.m4.4">a</XMTok>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                      <XMApp>
                        <XMTok fontsize="70%" meaning="divide" role="MULOP">/</XMTok>
                        <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                        <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>.</p>
      </para>
      <figure inlist="lof" labels="LABEL:res" placement="!h" xml:id="S4.F11">
        <tags>
          <tag>Figure 11</tag>
          <tag role="autoref">Figure 11</tag>
          <tag role="refnum">11</tag>
          <tag role="typerefnum">Figure 11</tag>
        </tags>
        <graphics candidates="figures/results.pdf" graphic="figures/results.pdf" options="width=433.62pt" xml:id="S4.F11.g1"/>
        <toccaption><tag close=" ">11</tag>Median and min/max values of the return per episode (left) and of the total bonus per episode(right).</toccaption>
        <caption><tag close=". ">Figure 11</tag>Median and min/max values of the return per episode (left) and of the total bonus per episode(right).</caption>
      </figure>
      <para xml:id="S4.SS2.p2">
        <p>The results are averaged over 10 newly generated environments, unseen during SmtW training.
For each of these environments, the experiment is repeated twice. We present, for each algorithm, the best result after a hyper-parameter search.
The bonus given by our method is computed to capture the exploratory behavior of the demonstrator. In order for the agent not to keep exploring forever, our bonus is here divided by <Math mode="inline" tex="\sqrt{k}" text="square-root@(k)" xml:id="S4.SS2.p2.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="square-root"/>
                <XMTok font="italic" role="UNKNOWN">k</XMTok>
              </XMApp>
            </XMath>
          </Math> with k the number of step of training.</p>
      </para>
      <para xml:id="S4.SS2.p3">
        <p>As Figure <ref labelref="LABEL:res"/> shows, the Q-learning with an <Math mode="inline" tex="\epsilon" text="epsilon" xml:id="S4.SS2.p3.m1">
            <XMath>
              <XMTok font="italic" name="epsilon" role="UNKNOWN">ϵ</XMTok>
            </XMath>
          </Math>-greedy exploration strategy quickly gets stuck in “waiting” at each timestep. SmtW encourages the agent to visit its environment and solves the 10 new environments much faster than the count-based method that push for exhaustive exploration.</p>
      </para>
    </subsection>
    <subsection inlist="toc" labels="LABEL:implementation_details" xml:id="S4.SS3">
      <tags>
        <tag>4.3</tag>
        <tag role="autoref">subsection 4.3</tag>
        <tag role="refnum">4.3</tag>
        <tag role="typerefnum">§4.3</tag>
      </tags>
      <title><tag close=". ">4.3</tag>Implementation Details</title>
      <toctitle><tag close=" ">4.3</tag>Implementation Details</toctitle>
      <para xml:id="S4.SS3.p1">
        <p>Our method works directly with visual inputs, as shown in Fig. <ref labelref="LABEL:keysdoors"/>.
The network used for the behavioral cloning policy <Math mode="inline" tex="\pi_{\phi}" text="pi _ phi" xml:id="S4.SS3.p1.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                <XMTok font="italic" fontsize="70%" name="phi" role="UNKNOWN">ϕ</XMTok>
              </XMApp>
            </XMath>
          </Math> has the following architecture: an LSTM with 64 units, a fully-connected layer with 512 units and relu activation and an output layer with as many units as there are actions available in the environment (7 for KeysDoors).
It is trained with the Adam optimizer <cite class="ltx_citemacro_citep">(<bibref bibrefs="kingma2014adam" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> with a learning rate of <Math mode="inline" tex="10^{-3}" text="10 ^ (- 3)" xml:id="S4.SS3.p1.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok meaning="10" role="NUMBER">10</XMTok>
                <XMApp>
                  <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                  <XMTok fontsize="70%" meaning="3" role="NUMBER">3</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> and a batch size of <Math mode="inline" tex="1" text="1" xml:id="S4.SS3.p1.m3">
            <XMath>
              <XMTok meaning="1" role="NUMBER">1</XMTok>
            </XMath>
          </Math>. It uses the visual input from the environment and not the ground-truth state.</p>
      </para>
      <para xml:id="S4.SS3.p2">
        <p>The network used for the regression of the bonus <Math content-tex="\B_{\theta}" mode="inline" tex="\operatorname{B}_{\theta}" text="B _ theta" xml:id="S4.SS3.p2.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMDual role="OPFUNCTION">
                  <XMTok name="B" role="OPFUNCTION" scriptpos="post"/>
                  <XMTok role="OPFUNCTION" scriptpos="post">B</XMTok>
                </XMDual>
                <XMTok font="italic" fontsize="70%" name="theta" role="UNKNOWN">θ</XMTok>
              </XMApp>
            </XMath>
          </Math> has the same architecture but an output layer with a single unit.
It is trained with the Adam optimizer, a learning rate of <Math mode="inline" tex="10^{-4}" text="10 ^ (- 4)" xml:id="S4.SS3.p2.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok meaning="10" role="NUMBER">10</XMTok>
                <XMApp>
                  <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                  <XMTok fontsize="70%" meaning="4" role="NUMBER">4</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> and a batch size of <Math mode="inline" tex="1" text="1" xml:id="S4.SS3.p2.m3">
            <XMath>
              <XMTok meaning="1" role="NUMBER">1</XMTok>
            </XMath>
          </Math>.
<!--  %****␣sample.tex␣Line␣375␣**** -->The discount factor used in SmtW is set to <Math mode="inline" tex="\gamma=0.99" text="gamma = 0.99" xml:id="S4.SS3.p2.m4">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
                <XMTok meaning="0.99" role="NUMBER">0.99</XMTok>
              </XMApp>
            </XMath>
          </Math>.</p>
      </para>
      <para xml:id="S4.SS3.p3">
        <p>For experiment shown in Figure <ref labelref="LABEL:res"/>, the tabular <Math mode="inline" tex="Q" text="Q" xml:id="S4.SS3.p3.m1">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">Q</XMTok>
            </XMath>
          </Math>-learning is trained on the 10 test environments twice and the figure shows the median and the min/max values. For each of the compared algorithms, we sweep over the agent learning rate over the following values: <Math mode="inline" tex="[0.01,0.1,0.5,0.7]" text="list@(0.01, 0.1, 0.5, 0.7)" xml:id="S4.SS3.p3.m2">
            <XMath>
              <XMDual>
                <XMApp>
                  <XMTok meaning="list"/>
                  <XMRef idref="S4.SS3.p3.m2.1"/>
                  <XMRef idref="S4.SS3.p3.m2.2"/>
                  <XMRef idref="S4.SS3.p3.m2.3"/>
                  <XMRef idref="S4.SS3.p3.m2.4"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">[</XMTok>
                  <XMTok meaning="0.01" role="NUMBER" xml:id="S4.SS3.p3.m2.1">0.01</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok meaning="0.1" role="NUMBER" xml:id="S4.SS3.p3.m2.2">0.1</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok meaning="0.5" role="NUMBER" xml:id="S4.SS3.p3.m2.3">0.5</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok meaning="0.7" role="NUMBER" xml:id="S4.SS3.p3.m2.4">0.7</XMTok>
                  <XMTok role="CLOSE" stretchy="false">]</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>. Only the result of the learning rate with the highest median return over the <Math mode="inline" tex="10\times 2" text="10 * 2" xml:id="S4.SS3.p3.m3">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">×</XMTok>
                <XMTok meaning="10" role="NUMBER">10</XMTok>
                <XMTok meaning="2" role="NUMBER">2</XMTok>
              </XMApp>
            </XMath>
          </Math> runs is shown for each algorithm. The <Math mode="inline" tex="\varepsilon" text="varepsilon" xml:id="S4.SS3.p3.m4">
            <XMath>
              <XMTok font="italic" name="varepsilon" role="UNKNOWN">ε</XMTok>
            </XMath>
          </Math>-greedy strategy is used for all methods with <Math mode="inline" tex="\varepsilon=0.1" text="varepsilon = 0.1" xml:id="S4.SS3.p3.m5">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" name="varepsilon" role="UNKNOWN">ε</XMTok>
                <XMTok meaning="0.1" role="NUMBER">0.1</XMTok>
              </XMApp>
            </XMath>
          </Math>.
Even though the agent is tabular, we recall that <text class="ltx_markedasmath">SmtW</text> itself does not access the ground-truth state of the environment. It works from observations. The count-based bonus, on the contrary, counts ground-truth states.</p>
      </para>
    </subsection>
  </section>
  <section inlist="toc" xml:id="S5">
    <tags>
      <tag>5</tag>
      <tag role="autoref">section 5</tag>
      <tag role="refnum">5</tag>
      <tag role="typerefnum">§5</tag>
    </tags>
    <title><tag close=". ">5</tag>Related Work</title>
    <toctitle><tag close=" ">5</tag>Related Work</toctitle>
    <para xml:id="S5.p1">
      <p><text font="bold">Intrinsic Motivation.</text> Intrinsic motivation is essential to mental development <cite class="ltx_citemacro_citep">(<bibref bibrefs="oudeyer2007intrinsic" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> and we can argue that this may, in consequence, be an essential component for computational learning. <cite class="ltx_citemacro_citet"><bibref bibrefs="oudeyer2009intrinsic" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> argue that all humans respond to intrinsic motivations. Young infants motivations can be qualified as more chaotic as they push children to bite, throw, grasp or shout in order to learn. Adults, in contrast, have more structured intrinsic motivations, activated, for instance, when they play games, read novels or watch movies. Correctly using these numerous intrinsic motivations can be key to train agents that solve more and more difficult tasks.
Instead of modeling such intrinsic motivations to mimic cognitive processes, we learn them from demonstrations.</p>
    </para>
    <para xml:id="S5.p2">
      <p><text font="bold">Exploration.</text> In order to provide an exploration signal to the agent,  <cite class="ltx_citemacro_citep">(<bibref bibrefs="strehl2008analysis" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> proposed the very intuitive count-based method in order to measure novelty. Counting how many times the agent has been in a given state, it rewards less visited states. Several methods extended this idea to large state-space problems  <cite class="ltx_citemacro_citep">(<bibref bibrefs="ostrovski2017count,bellemare2016unifying,tang2017exploration,machado2018revisiting" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, where it is not possible to count state occupancy. Intrinsic curiosity is also commonly computed as a prediction error, either trying to predict the environment’s dynamics  <cite class="ltx_citemacro_citep">(<bibref bibrefs="pathak2017curiosity,raileanu2020ride" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> or random statistics about the current state  <cite class="ltx_citemacro_citep">(<bibref bibrefs="burda2018exploration" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Different methods try also to measure surprise as a prediction gain  <cite class="ltx_citemacro_citep">(<bibref bibrefs="schmidhuber1991curious,houthooft2016variational" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Instead of designing such a bonus, we aim at learning one from demonstrations.</p>
    </para>
    <para xml:id="S5.p3">
      <p><text font="bold">Learning from demonstrations.</text> Imitation learning, the problem of learning from demonstrations, is typically folded into two different paradigms. (1) Behavioral cloning <cite class="ltx_citemacro_citep">(<bibref bibrefs="pomerleau1991efficient,bagnell2007boosting,ross2010efficient" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> tries to directly match the demonstrator’s behavior, generally using supervised learning techniques. (2) Inverse Reinforcement Learning  <cite class="ltx_citemacro_citep">(<bibref bibrefs="russell1998learning,ng2000algorithms" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> first tries to recover a reward explaining the demonstrator’s behavior, before optimizing the reward for imitating the demonstrator. Some methods output an explicit reward  <cite class="ltx_citemacro_citep">(<bibref bibrefs="klein2013cascaded,abbeel2004apprenticeship,ng2000algorithms,ziebart2008maximum" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> while adversarial imitation learning can be seen as IRL with implicit reward recovery  <cite class="ltx_citemacro_citep">(<bibref bibrefs="ho2016generative,fu2017learning,finn2016guided" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Overall these methods all assume that the near-optimality of the demonstrations. Some works try to relax this assumption and to learn from sub-optimal demonstrations  <cite class="ltx_citemacro_citep">(<bibref bibrefs="jacq2019learning,brown2019extrapolating" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. IRL methods typically control the quality of their algorithm through the proxy of the return obtained by an agent trained on the inferred reward.</p>
    </para>
    <para xml:id="S5.p4">
      <p>Our methods differs from these methods it does not assume that demonstrations are optimal but rather try to answer the question: “In what way is the demonstrator’s behavior deviating from an optimal policy?”.
Moreover, we do not seek to recover a reward as in IRL but rather to recover a bonus explaining which, added to the environment reward, explains the demonstrator’s behavior. Facing the same problem that the usual proxy to control the algorithm quality (training an agent on the inferred bonus) is not informative, we decided to study our method through its response to various behaviors.</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S6">
    <tags>
      <tag>6</tag>
      <tag role="autoref">section 6</tag>
      <tag role="refnum">6</tag>
      <tag role="typerefnum">§6</tag>
    </tags>
    <title><tag close=". ">6</tag>Conclusion</title>
    <toctitle><tag close=" ">6</tag>Conclusion</toctitle>
    <para xml:id="S6.p1">
      <p>In this work, we present a novel method for extracting an intrinsic bonus from the demonstrations. The method we introduce is offline and does not require environment interactions to recover the bonus, unlike recent adversarial imitation methods who need numerous interaction in order to recover a reward function. Anyway, those methods could not be readily applied to our problem, as they do not explicitly compute a reward function. Moreover, to the best of our knowledge, this is one of the very first method to recover some kind of reward that is history-dependent. We show how this bonus generalizes to unseen environments and is able to convey long-term priors. We exemplified the approach on a simple yet didactic and challenging example. Yet, testing the method on a larger-scale environment would require human exploratory demonstrations. Gathering such a dataset is costly and very few are already available, none of them really covering our setting.
Even though the given example is simple, this novel approach of capturing the demonstrator’s bias could potentially lead to new lines of work in RL. For instance, one could use our method to implement <text font="italic">behavioral style-transfer</text> in RL and show to an agent a specific way to solve the task thanks to demonstrations. Combining a reward and biases extracted from demonstrations may also help for robotic tasks, where some aspects of the task are easily programmable with a reward but some expectations on how to solve the task may be easier to transmit thanks to demonstrations. This could also lead to some advances in tackling mispecified rewards. Using both a reward, that would contain information on the task to solve but not fully describe the constraints of the problem and demonstrations to correct the reward can be key to train sequential controllers in complex dynamics.</p>
    </para>
    <pagination role="newpage"/>
<!--  %%%␣-*-BibTeX-*- 
     %%%␣Do␣NOT␣edit.␣File␣created␣by␣BibTeX␣with␣style
     %%%␣ACM-Reference-Format-Journals␣[18-Jan-2012].-->  </section>
  <bibliography xml:id="bib">
    <title>References</title>
    <biblist>
<!--  %%%␣==================================================================== 
     %%%␣NOTE␣TO␣THE␣USER:␣you␣can␣override␣these␣defaults␣by␣providing
     %%%␣customized␣versions␣of␣any␣of␣these␣macros␣before␣the␣\bibliography
     %%%␣command.␣␣Each␣of␣them␣MUST␣provide␣its␣own␣final␣punctuation,
     %%%␣except␣for␣\shownote{},␣\showDOI{},␣and␣\showURL{}.␣␣The␣latter␣two
     %%%␣do␣not␣use␣final␣punctuation,␣in␣order␣to␣avoid␣confusing␣it␣with
     %%%␣the␣Web␣address.
     %%%
     %%%␣To␣suppress␣output␣of␣a␣particular␣field,␣define␣its␣macro␣to␣expand
     %%%␣to␣an␣empty␣string,␣or␣better,␣\unskip,␣like␣this:
     %%%
     %%%␣\newcommand{\showDOI}[1]{\unskip}␣␣␣%␣LaTeX␣syntax
     %%%
     %%%␣\def␣\showDOI␣#1{\unskip}␣␣␣␣␣␣␣␣␣␣␣%␣plain␣TeX␣syntax
     %%%
     %%%␣====================================================================-->      <bibitem xml:id="bib.bib1">
        <tags>
          <tag role="number">1</tag>
          <tag role="refnum">(1)</tag>
        </tags>
        <bibblock> <!--  %****␣sample.bbl␣Line␣25␣**** -->       <!--  %The␣following␣commands␣are␣used␣for␣tagged␣output␣and␣should␣be 
     %invisible␣to␣TeX-->



</bibblock>
      </bibitem>
      <bibitem key="abbeel2004apprenticeship" xml:id="bib.bib2">
        <tags>
          <tag role="number">2</tag>
          <tag role="year">2004</tag>
          <tag role="authors">Abbeel and Ng</tag>
          <tag role="fullauthors">Abbeel and Ng</tag>
          <tag role="refnum">Abbeel and Ng (2004)</tag>
          <tag role="key">abbeel2004apprenticeship</tag>
        </tags>
        <bibblock>
Pieter Abbeel and
Andrew Y Ng. 2004.
</bibblock>
        <bibblock>Apprenticeship learning via inverse reinforcement
learning. In <emph font="italic">International Conference on Machine
Learning</emph>.
</bibblock>
        <bibblock><!--  %****␣sample.bbl␣Line␣50␣**** --></bibblock>
      </bibitem>
      <bibitem key="andrychowicz2020learning" xml:id="bib.bib3">
        <tags>
          <tag role="number">3</tag>
          <tag role="year">2020</tag>
          <tag role="authors">Andrychowicz
et al.</tag>
          <tag role="fullauthors">Andrychowicz, Baker, Chociej, Jozefowicz,
McGrew, Pachocki, Petron, Plappert, Powell, Ray, et al.</tag>
          <tag role="refnum">Andrychowicz
et al. (2020)</tag>
          <tag role="key">andrychowicz2020learning</tag>
        </tags>
        <bibblock>
OpenAI: Marcin Andrychowicz,
Bowen Baker, Maciek Chociej,
Rafal Jozefowicz, Bob McGrew,
Jakub Pachocki, Arthur Petron,
Matthias Plappert, Glenn Powell,
Alex Ray, et al. 2020.
</bibblock>
        <bibblock>Learning dexterous in-hand manipulation.
</bibblock>
        <bibblock><emph font="italic">The International Journal of Robotics
Research</emph> (2020).
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="bagnell2007boosting" xml:id="bib.bib4">
        <tags>
          <tag role="number">4</tag>
          <tag role="year">2007</tag>
          <tag role="authors">Bagnell et al.</tag>
          <tag role="fullauthors">Bagnell, Chestnutt, Bradley, and
Ratliff</tag>
          <tag role="refnum">Bagnell et al. (2007)</tag>
          <tag role="key">bagnell2007boosting</tag>
        </tags>
        <bibblock>
JA Bagnell, Joel
Chestnutt, David M Bradley, and
Nathan D Ratliff. 2007.
</bibblock>
        <bibblock>Boosting structured prediction for imitation
learning. In <emph font="italic">Advances in Neural Information
Processing Systems</emph>.
<!--  %****␣sample.bbl␣Line␣75␣**** --></bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="bellemare2016unifying" xml:id="bib.bib5">
        <tags>
          <tag role="number">5</tag>
          <tag role="year">2016</tag>
          <tag role="authors">Bellemare et al.</tag>
          <tag role="fullauthors">Bellemare, Srinivasan, Ostrovski, Schaul,
Saxton, and Munos</tag>
          <tag role="refnum">Bellemare et al. (2016)</tag>
          <tag role="key">bellemare2016unifying</tag>
        </tags>
        <bibblock>
Marc Bellemare, Sriram
Srinivasan, Georg Ostrovski, Tom Schaul,
David Saxton, and Remi Munos.
2016.
</bibblock>
        <bibblock>Unifying count-based exploration and intrinsic
motivation. In <emph font="italic">Advances in neural information
processing systems</emph>. 1471–1479.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="berseth2019smirl" xml:id="bib.bib6">
        <tags>
          <tag role="number">6</tag>
          <tag role="year">2019</tag>
          <tag role="authors">Berseth et al.</tag>
          <tag role="fullauthors">Berseth, Geng, Devin, Finn, Jayaraman, and
Levine</tag>
          <tag role="refnum">Berseth et al. (2019)</tag>
          <tag role="key">berseth2019smirl</tag>
        </tags>
        <bibblock>
Glen Berseth, Daniel
Geng, Coline Devin, Chelsea Finn,
Dinesh Jayaraman, and Sergey Levine.
2019.
</bibblock>
        <bibblock>SMiRL: Surprise Minimizing RL in Dynamic
Environments.
<!--  %****␣sample.bbl␣Line␣100␣**** --></bibblock>
        <bibblock><emph font="italic">arXiv preprint arXiv:1912.05510</emph>
(2019).
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="brown2019extrapolating" xml:id="bib.bib7">
        <tags>
          <tag role="number">7</tag>
          <tag role="year">2019</tag>
          <tag role="authors">Brown
et al.</tag>
          <tag role="fullauthors">Brown, Goo, Nagarajan, and Niekum</tag>
          <tag role="refnum">Brown
et al. (2019)</tag>
          <tag role="key">brown2019extrapolating</tag>
        </tags>
        <bibblock>
Daniel Brown, Wonjoon
Goo, Prabhat Nagarajan, and Scott
Niekum. 2019.
</bibblock>
        <bibblock>Extrapolating Beyond Suboptimal Demonstrations via
Inverse Reinforcement Learning from Observations. In
<emph font="italic">International Conference on Machine Learning</emph>.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="burda2018exploration" xml:id="bib.bib8">
        <tags>
          <tag role="number">8</tag>
          <tag role="year">2018</tag>
          <tag role="authors">Burda
et al.</tag>
          <tag role="fullauthors">Burda, Edwards, Storkey, and Klimov</tag>
          <tag role="refnum">Burda
et al. (2018)</tag>
          <tag role="key">burda2018exploration</tag>
        </tags>
        <bibblock>
Yuri Burda, Harrison
Edwards, Amos Storkey, and Oleg
Klimov. 2018.
</bibblock>
        <bibblock>Exploration by random network distillation.
</bibblock>
        <bibblock><emph font="italic"><!--  %****␣sample.bbl␣Line␣125␣**** -->International Conference on Learning
Representations (ICLR)</emph> (2018).
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="byrne1990machiavellian" xml:id="bib.bib9">
        <tags>
          <tag role="number">9</tag>
          <tag role="year">1989</tag>
          <tag role="authors">Byrne and Whiten</tag>
          <tag role="fullauthors">Byrne and Whiten</tag>
          <tag role="refnum">Byrne and Whiten (1989)</tag>
          <tag role="key">byrne1990machiavellian</tag>
        </tags>
        <bibblock>
Richard W Byrne and
Andrew Whiten. 1989.
</bibblock>
        <bibblock><emph font="italic">Machiavellian intelligence: social
expertise and the evolution of intellect in monkeys, apes, and humans</emph>.
</bibblock>
        <bibblock>Clarendon Press.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="deci2010intrinsic" xml:id="bib.bib10">
        <tags>
          <tag role="number">10</tag>
          <tag role="year">2010</tag>
          <tag role="authors">Deci and Ryan</tag>
          <tag role="fullauthors">Deci and Ryan</tag>
          <tag role="refnum">Deci and Ryan (2010)</tag>
          <tag role="key">deci2010intrinsic</tag>
        </tags>
        <bibblock>
Edward L Deci and
Richard M Ryan. 2010.
</bibblock>
        <bibblock>Intrinsic motivation.
</bibblock>
        <bibblock><emph font="italic">The corsini encyclopedia of psychology</emph>
(2010), 1–2.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="dubey2018investigating" xml:id="bib.bib11">
        <tags>
          <tag role="number">11</tag>
          <tag role="year">2018</tag>
          <tag role="authors">Dubey et al.</tag>
          <tag role="fullauthors">Dubey, Agrawal, Pathak, Griffiths, and
Efros</tag>
          <tag role="refnum">Dubey et al. (2018)</tag>
          <tag role="key">dubey2018investigating</tag>
        </tags>
        <bibblock>
Rachit Dubey, Pulkit
Agrawal, Deepak Pathak, Thomas L
Griffiths, and Alexei A Efros.
2018.
</bibblock>
        <bibblock>Investigating human priors for playing video
games.
</bibblock>
        <bibblock><emph font="italic">arXiv preprint arXiv:1802.10217</emph>
(2018).
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="finn2016guided" xml:id="bib.bib12">
        <tags>
          <tag role="number">12</tag>
          <tag role="year">2016</tag>
          <tag role="authors">Finn
et al.</tag>
          <tag role="fullauthors">Finn, Levine, and Abbeel</tag>
          <tag role="refnum">Finn
et al. (2016)</tag>
          <tag role="key">finn2016guided</tag>
        </tags>
        <bibblock>
Chelsea Finn, Sergey
Levine, and Pieter Abbeel.
2016.
</bibblock>
        <bibblock>Guided cost learning: Deep inverse optimal control
via policy optimization. In <emph font="italic">International
Conference on Machine Learning</emph>.
</bibblock>
        <bibblock><!--  %****␣sample.bbl␣Line␣175␣**** --></bibblock>
      </bibitem>
      <bibitem key="fu2017learning" xml:id="bib.bib13">
        <tags>
          <tag role="number">13</tag>
          <tag role="year">2018</tag>
          <tag role="authors">Fu et al.</tag>
          <tag role="fullauthors">Fu, Luo, and Levine</tag>
          <tag role="refnum">Fu et al. (2018)</tag>
          <tag role="key">fu2017learning</tag>
        </tags>
        <bibblock>
Justin Fu, Katie Luo,
and Sergey Levine. 2018.
</bibblock>
        <bibblock>Learning robust rewards with adversarial inverse
reinforcement learning.
</bibblock>
        <bibblock><emph font="italic">International Conference on Learning
Representations</emph> (2018).
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="ho2016generative" xml:id="bib.bib14">
        <tags>
          <tag role="number">14</tag>
          <tag role="year">2016</tag>
          <tag role="authors">Ho and Ermon</tag>
          <tag role="fullauthors">Ho and Ermon</tag>
          <tag role="refnum">Ho and Ermon (2016)</tag>
          <tag role="key">ho2016generative</tag>
        </tags>
        <bibblock>
Jonathan Ho and Stefano
Ermon. 2016.
</bibblock>
        <bibblock>Generative adversarial imitation learning. In
<emph font="italic">Advances in Neural Information Processing
Systems</emph>.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="hochreiter1997long" xml:id="bib.bib15">
        <tags>
          <tag role="number">15</tag>
          <tag role="year">1997</tag>
          <tag role="authors">Hochreiter and
Schmidhuber</tag>
          <tag role="fullauthors">Hochreiter and Schmidhuber</tag>
          <tag role="refnum">Hochreiter and
Schmidhuber (1997)</tag>
          <tag role="key">hochreiter1997long</tag>
        </tags>
        <bibblock>
S<!--  %****␣sample.bbl␣Line␣200␣**** -->epp Hochreiter and
Jürgen Schmidhuber. 1997.
</bibblock>
        <bibblock>Long short-term memory.
</bibblock>
        <bibblock><emph font="italic">Neural computation</emph> 9,
8 (1997), 1735–1780.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="holloway2004children" xml:id="bib.bib16">
        <tags>
          <tag role="number">16</tag>
          <tag role="year">2004</tag>
          <tag role="authors">Holloway and
Valentine</tag>
          <tag role="fullauthors">Holloway and Valentine</tag>
          <tag role="refnum">Holloway and
Valentine (2004)</tag>
          <tag role="key">holloway2004children</tag>
        </tags>
        <bibblock>
Sarah L Holloway and
Gill Valentine. 2004.
</bibblock>
        <bibblock><emph font="italic">Children’s geographies: Playing, living,
learning</emph>.
</bibblock>
        <bibblock>Routledge.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="houthooft2016variational" xml:id="bib.bib17">
        <tags>
          <tag role="number">17</tag>
          <tag role="year">2016</tag>
          <tag role="authors">Houthooft et al.</tag>
          <tag role="fullauthors">Houthooft, Chen, Duan, Schulman, De Turck, and
Abbeel</tag>
          <tag role="refnum">Houthooft et al. (2016)</tag>
          <tag role="key">houthooft2016variational</tag>
        </tags>
        <bibblock>
Rein Houthooft, Xi Chen,
Yan Duan, John Schulman,
Filip De Turck, and Pieter Abbeel.
2016.
<!--  %****␣sample.bbl␣Line␣225␣**** --></bibblock>
        <bibblock>Variational information maximizing exploration.
</bibblock>
        <bibblock><emph font="italic">Advances in Neural Information Processing
Systems (NIPS)</emph> (2016).
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="hunt1965intrinsic" xml:id="bib.bib18">
        <tags>
          <tag role="number">18</tag>
          <tag role="year">1965</tag>
          <tag role="authors">Hunt</tag>
          <tag role="fullauthors">Hunt</tag>
          <tag role="refnum">Hunt (1965)</tag>
          <tag role="key">hunt1965intrinsic</tag>
        </tags>
        <bibblock>
JMcVLevine Hunt.
1965.
</bibblock>
        <bibblock>Intrinsic motivation and its role in psychological
development. In <emph font="italic">Nebraska symposium on
motivation</emph>, Vol. 13. University of Nebraska Press,
189–282.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="jacq2019learning" xml:id="bib.bib19">
        <tags>
          <tag role="number">19</tag>
          <tag role="year">2019</tag>
          <tag role="authors">Jacq
et al.</tag>
          <tag role="fullauthors">Jacq, Geist, Paiva, and Pietquin</tag>
          <tag role="refnum">Jacq
et al. (2019)</tag>
          <tag role="key">jacq2019learning</tag>
        </tags>
        <bibblock>
Alexis Jacq, Matthieu
Geist, Ana Paiva, and Olivier
Pietquin. 2019.
</bibblock>
        <bibblock>Learning from a Learner. In
<emph font="italic">International Conference on Machine Learning</emph>.
<!--  %****␣sample.bbl␣Line␣250␣**** --></bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="kidd2012goldilocks" xml:id="bib.bib20">
        <tags>
          <tag role="number">20</tag>
          <tag role="year">2012</tag>
          <tag role="authors">Kidd
et al.</tag>
          <tag role="fullauthors">Kidd, Piantadosi, and Aslin</tag>
          <tag role="refnum">Kidd
et al. (2012)</tag>
          <tag role="key">kidd2012goldilocks</tag>
        </tags>
        <bibblock>
Celeste Kidd, Steven T
Piantadosi, and Richard N Aslin.
2012.
</bibblock>
        <bibblock>The Goldilocks effect: Human infants allocate
attention to visual sequences that are neither too simple nor too complex.
</bibblock>
        <bibblock><emph font="italic">PloS one</emph> 7,
5 (2012), e36399.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="kingma2014adam" xml:id="bib.bib21">
        <tags>
          <tag role="number">21</tag>
          <tag role="year">2014</tag>
          <tag role="authors">Kingma and Ba</tag>
          <tag role="fullauthors">Kingma and Ba</tag>
          <tag role="refnum">Kingma and Ba (2014)</tag>
          <tag role="key">kingma2014adam</tag>
        </tags>
        <bibblock>
Diederik P Kingma and
Jimmy Ba. 2014.
</bibblock>
        <bibblock>Adam: A method for stochastic optimization.
</bibblock>
        <bibblock><emph font="italic">arXiv preprint arXiv:1412.6980</emph>
(2014).
</bibblock>
        <bibblock><!--  %****␣sample.bbl␣Line␣275␣**** --></bibblock>
      </bibitem>
      <bibitem key="klein2013cascaded" xml:id="bib.bib22">
        <tags>
          <tag role="number">22</tag>
          <tag role="year">2013</tag>
          <tag role="authors">Klein
et al.</tag>
          <tag role="fullauthors">Klein, Piot, Geist, and Pietquin</tag>
          <tag role="refnum">Klein
et al. (2013)</tag>
          <tag role="key">klein2013cascaded</tag>
        </tags>
        <bibblock>
Edouard Klein, Bilal
Piot, Matthieu Geist, and Olivier
Pietquin. 2013.
</bibblock>
        <bibblock>A cascaded supervised learning approach to inverse
reinforcement learning. In <emph font="italic">Joint European
conference on machine learning and knowledge discovery in databases</emph>.
Springer, 1–16.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="lang2000fear" xml:id="bib.bib23">
        <tags>
          <tag role="number">23</tag>
          <tag role="year">2000</tag>
          <tag role="authors">Lang
et al.</tag>
          <tag role="fullauthors">Lang, Davis, and Öhman</tag>
          <tag role="refnum">Lang
et al. (2000)</tag>
          <tag role="key">lang2000fear</tag>
        </tags>
        <bibblock>
Peter J Lang, Michael
Davis, and Arne Öhman.
2000.
</bibblock>
        <bibblock>Fear and anxiety: animal models and human cognitive
psychophysiology.
</bibblock>
        <bibblock><emph font="italic">Journal of affective disorders</emph>
61, 3 (2000),
137–159.
<!--  %****␣sample.bbl␣Line␣300␣**** --></bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="machado2018revisiting" xml:id="bib.bib24">
        <tags>
          <tag role="number">24</tag>
          <tag role="year">2018</tag>
          <tag role="authors">Machado et al.</tag>
          <tag role="fullauthors">Machado, Bellemare, Talvitie, Veness,
Hausknecht, and Bowling</tag>
          <tag role="refnum">Machado et al. (2018)</tag>
          <tag role="key">machado2018revisiting</tag>
        </tags>
        <bibblock>
Marlos C Machado, Marc G
Bellemare, Erik Talvitie, Joel Veness,
Matthew Hausknecht, and Michael
Bowling. 2018.
</bibblock>
        <bibblock>Revisiting the arcade learning environment:
Evaluation protocols and open problems for general agents.
</bibblock>
        <bibblock><emph font="italic">Journal of Artificial Intelligence Research</emph>
61 (2018), 523–562.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="mnih2015human" xml:id="bib.bib25">
        <tags>
          <tag role="number">25</tag>
          <tag role="year">2015</tag>
          <tag role="authors">Mnih
et al.</tag>
          <tag role="fullauthors">Mnih, Kavukcuoglu, Silver, Rusu, Veness,
Bellemare, Graves, Riedmiller, Fidjeland, Ostrovski, et al.</tag>
          <tag role="refnum">Mnih
et al. (2015)</tag>
          <tag role="key">mnih2015human</tag>
        </tags>
        <bibblock>
V<!--  %****␣sample.bbl␣Line␣325␣**** -->olodymyr Mnih, Koray
Kavukcuoglu, David Silver, Andrei A
Rusu, Joel Veness, Marc G Bellemare,
Alex Graves, Martin Riedmiller,
Andreas K Fidjeland, Georg Ostrovski,
et al. 2015.
</bibblock>
        <bibblock>Human-level control through deep reinforcement
learning.
</bibblock>
        <bibblock><emph font="italic">Nature</emph> (2015).
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="mohamed2015variational" xml:id="bib.bib26">
        <tags>
          <tag role="number">26</tag>
          <tag role="year">2015</tag>
          <tag role="authors">Mohamed and
Rezende</tag>
          <tag role="fullauthors">Mohamed and Rezende</tag>
          <tag role="refnum">Mohamed and
Rezende (2015)</tag>
          <tag role="key">mohamed2015variational</tag>
        </tags>
        <bibblock>
Shakir Mohamed and
Danilo Jimenez Rezende. 2015.
</bibblock>
        <bibblock>Variational information maximisation for
intrinsically motivated reinforcement learning. In
<emph font="italic">Advances in neural information processing
systems</emph>. 2125–2133.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="ng1999policy" xml:id="bib.bib27">
        <tags>
          <tag role="number">27</tag>
          <tag role="year">1999</tag>
          <tag role="authors">Ng
et al.</tag>
          <tag role="fullauthors">Ng, Harada, and Russell</tag>
          <tag role="refnum">Ng
et al. (1999)</tag>
          <tag role="key">ng1999policy</tag>
        </tags>
        <bibblock>
Andrew Y Ng, Daishi
Harada, and Stuart Russell.
<!--  %****␣sample.bbl␣Line␣350␣**** -->1999.
</bibblock>
        <bibblock>Policy invariance under reward transformations:
Theory and application to reward shaping. In
<emph font="italic">ICML</emph>, Vol. 99.
278–287.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="ng2000algorithms" xml:id="bib.bib28">
        <tags>
          <tag role="number">28</tag>
          <tag role="year">2000</tag>
          <tag role="authors">Ng
et al.</tag>
          <tag role="fullauthors">Ng, Russell, et al.</tag>
          <tag role="refnum">Ng
et al. (2000)</tag>
          <tag role="key">ng2000algorithms</tag>
        </tags>
        <bibblock>
Andrew Y Ng, Stuart J
Russell, et al. 2000.
</bibblock>
        <bibblock>Algorithms for inverse reinforcement learning.. In
<emph font="italic">International Conference on Machine Learning</emph>.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="ostrovski2017count" xml:id="bib.bib29">
        <tags>
          <tag role="number">29</tag>
          <tag role="year">2017</tag>
          <tag role="authors">Ostrovski et al.</tag>
          <tag role="fullauthors">Ostrovski, Bellemare, van den Oord, and
Munos</tag>
          <tag role="refnum">Ostrovski et al. (2017)</tag>
          <tag role="key">ostrovski2017count</tag>
        </tags>
        <bibblock>
Georg Ostrovski, Marc G
Bellemare, Aäron van den Oord, and
Rémi Munos. 2017.
</bibblock>
        <bibblock>C<!--  %****␣sample.bbl␣Line␣375␣**** -->ount-based exploration with neural density
models. In <emph font="italic">Proceedings of the 34th International
Conference on Machine Learning-Volume 70</emph>. JMLR. org,
2721–2730.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="oudeyer2009intrinsic" xml:id="bib.bib30">
        <tags>
          <tag role="number">30</tag>
          <tag role="year">2009</tag>
          <tag role="authors">Oudeyer and
Kaplan</tag>
          <tag role="fullauthors">Oudeyer and Kaplan</tag>
          <tag role="refnum">Oudeyer and
Kaplan (2009)</tag>
          <tag role="key">oudeyer2009intrinsic</tag>
        </tags>
        <bibblock>
Pierre-Yves Oudeyer and
Frederic Kaplan. 2009.
</bibblock>
        <bibblock>What is intrinsic motivation? A typology of
computational approaches.
</bibblock>
        <bibblock><emph font="italic">Frontiers in neurorobotics</emph>
1 (2009), 6.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="oudeyer2007intrinsic" xml:id="bib.bib31">
        <tags>
          <tag role="number">31</tag>
          <tag role="year">2007</tag>
          <tag role="authors">Oudeyer
et al.</tag>
          <tag role="fullauthors">Oudeyer, Kaplan, and Hafner</tag>
          <tag role="refnum">Oudeyer
et al. (2007)</tag>
          <tag role="key">oudeyer2007intrinsic</tag>
        </tags>
        <bibblock>
Pierre-Yves Oudeyer,
Frdric Kaplan, and Verena V Hafner.
2007.
</bibblock>
        <bibblock>I<!--  %****␣sample.bbl␣Line␣400␣**** -->ntrinsic motivation systems for autonomous mental
development.
</bibblock>
        <bibblock><emph font="italic">IEEE transactions on evolutionary
computation</emph> 11, 2
(2007), 265–286.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="pathak2017curiosity" xml:id="bib.bib32">
        <tags>
          <tag role="number">32</tag>
          <tag role="year">2017</tag>
          <tag role="authors">Pathak
et al.</tag>
          <tag role="fullauthors">Pathak, Agrawal, Efros, and Darrell</tag>
          <tag role="refnum">Pathak
et al. (2017)</tag>
          <tag role="key">pathak2017curiosity</tag>
        </tags>
        <bibblock>
Deepak Pathak, Pulkit
Agrawal, Alexei A Efros, and Trevor
Darrell. 2017.
</bibblock>
        <bibblock>Curiosity-driven exploration by self-supervised
prediction. In <emph font="italic">Proceedings of the IEEE Conference
on Computer Vision and Pattern Recognition Workshops</emph>.
16–17.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="piot2016bridging" xml:id="bib.bib33">
        <tags>
          <tag role="number">33</tag>
          <tag role="year">2016</tag>
          <tag role="authors">Piot
et al.</tag>
          <tag role="fullauthors">Piot, Geist, and Pietquin</tag>
          <tag role="refnum">Piot
et al. (2016)</tag>
          <tag role="key">piot2016bridging</tag>
        </tags>
        <bibblock>
Bilal Piot, Matthieu
Geist, and Olivier Pietquin.
<!--  %****␣sample.bbl␣Line␣425␣**** -->2016.
</bibblock>
        <bibblock>Bridging the gap between imitation learning and
inverse reinforcement learning.
</bibblock>
        <bibblock><emph font="italic">IEEE transactions on neural networks and
learning systems</emph> 28, 8
(2016), 1814–1826.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="pomerleau1991efficient" xml:id="bib.bib34">
        <tags>
          <tag role="number">34</tag>
          <tag role="year">1991</tag>
          <tag role="authors">Pomerleau</tag>
          <tag role="fullauthors">Pomerleau</tag>
          <tag role="refnum">Pomerleau (1991)</tag>
          <tag role="key">pomerleau1991efficient</tag>
        </tags>
        <bibblock>
Dean A Pomerleau.
1991.
</bibblock>
        <bibblock>Efficient training of artificial neural networks
for autonomous navigation.
</bibblock>
        <bibblock><emph font="italic">Neural computation</emph> (1991).
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="puterman2014markov" xml:id="bib.bib35">
        <tags>
          <tag role="number">35</tag>
          <tag role="year">2014</tag>
          <tag role="authors">Puterman</tag>
          <tag role="fullauthors">Puterman</tag>
          <tag role="refnum">Puterman (2014)</tag>
          <tag role="key">puterman2014markov</tag>
        </tags>
        <bibblock>
Martin L Puterman.
2014.
</bibblock>
        <bibblock><emph font="italic">Markov decision processes: discrete
stochastic dynamic programming</emph>.
<!--  %****␣sample.bbl␣Line␣450␣**** --></bibblock>
        <bibblock>John Wiley &amp; Sons.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="raileanu2020ride" xml:id="bib.bib36">
        <tags>
          <tag role="number">36</tag>
          <tag role="year">2020</tag>
          <tag role="authors">Raileanu and
Rocktäschel</tag>
          <tag role="fullauthors">Raileanu and Rocktäschel</tag>
          <tag role="refnum">Raileanu and
Rocktäschel (2020)</tag>
          <tag role="key">raileanu2020ride</tag>
        </tags>
        <bibblock>
Roberta Raileanu and Tim
Rocktäschel. 2020.
</bibblock>
        <bibblock>RIDE: Rewarding Impact-Driven Exploration for
Procedurally-Generated Environments.
</bibblock>
        <bibblock><emph font="italic">arXiv preprint arXiv:2002.12292</emph>
(2020).
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="ross2010efficient" xml:id="bib.bib37">
        <tags>
          <tag role="number">37</tag>
          <tag role="year">2010</tag>
          <tag role="authors">Ross and Bagnell</tag>
          <tag role="fullauthors">Ross and Bagnell</tag>
          <tag role="refnum">Ross and Bagnell (2010)</tag>
          <tag role="key">ross2010efficient</tag>
        </tags>
        <bibblock>
Stéphane Ross and
Drew Bagnell. 2010.
</bibblock>
        <bibblock>Efficient reductions for imitation learning. In
<emph font="italic">International Conference on Artificial Intelligence
and Statistics</emph>.
</bibblock>
        <bibblock><!--  %****␣sample.bbl␣Line␣475␣**** --></bibblock>
      </bibitem>
      <bibitem key="russell1998learning" xml:id="bib.bib38">
        <tags>
          <tag role="number">38</tag>
          <tag role="year">1998</tag>
          <tag role="authors">Russell</tag>
          <tag role="fullauthors">Russell</tag>
          <tag role="refnum">Russell (1998)</tag>
          <tag role="key">russell1998learning</tag>
        </tags>
        <bibblock>
Stuart Russell.
1998.
</bibblock>
        <bibblock>Learning agents for uncertain environments. In
<emph font="italic">Conference on Computational learning theory</emph>.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="schmidhuber1991curious" xml:id="bib.bib39">
        <tags>
          <tag role="number">39</tag>
          <tag role="year">1991</tag>
          <tag role="authors">Schmidhuber</tag>
          <tag role="fullauthors">Schmidhuber</tag>
          <tag role="refnum">Schmidhuber (1991)</tag>
          <tag role="key">schmidhuber1991curious</tag>
        </tags>
        <bibblock>
Jürgen Schmidhuber.
1991.
</bibblock>
        <bibblock>Curious model-building control systems. In
<emph font="italic">Proc. international joint conference on neural
networks</emph>. 1458–1463.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="silver2016mastering" xml:id="bib.bib40">
        <tags>
          <tag role="number">40</tag>
          <tag role="year">2016</tag>
          <tag role="authors">Silver et al.</tag>
          <tag role="fullauthors">Silver, Huang, Maddison, Guez, Sifre, Van
Den Driessche, Schrittwieser, Antonoglou, Panneershelvam, Lanctot,
et al.</tag>
          <tag role="refnum">Silver et al. (2016)</tag>
          <tag role="key">silver2016mastering</tag>
        </tags>
        <bibblock>
D<!--  %****␣sample.bbl␣Line␣500␣**** -->avid Silver, Aja Huang,
Chris J Maddison, Arthur Guez,
Laurent Sifre, George Van Den Driessche,
Julian Schrittwieser, Ioannis Antonoglou,
Veda Panneershelvam, Marc Lanctot,
et al. 2016.
</bibblock>
        <bibblock>Mastering the game of Go with deep neural networks
and tree search.
</bibblock>
        <bibblock><emph font="italic">Nature</emph> (2016).
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="csimcsek2006intrinsic" xml:id="bib.bib41">
        <tags>
          <tag role="number">41</tag>
          <tag role="year">2006</tag>
          <tag role="authors">Şimşek and Barto</tag>
          <tag role="fullauthors">Şimşek and
Barto</tag>
          <tag role="refnum">Şimşek and Barto (2006)</tag>
          <tag role="key">csimcsek2006intrinsic</tag>
        </tags>
        <bibblock>
Özgür Şimşek and
Andrew G Barto. 2006.
</bibblock>
        <bibblock>An intrinsic reward mechanism for efficient
exploration. In <emph font="italic">Proceedings of the 23rd
international conference on Machine learning</emph>. 833–840.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="strehl2009reinforcement" xml:id="bib.bib42">
        <tags>
          <tag role="number">42</tag>
          <tag role="year">2009</tag>
          <tag role="authors">Strehl
et al.</tag>
          <tag role="fullauthors">Strehl, Li, and Littman</tag>
          <tag role="refnum">Strehl
et al. (2009)</tag>
          <tag role="key">strehl2009reinforcement</tag>
        </tags>
        <bibblock>
<!--  %****␣sample.bbl␣Line␣525␣**** -->Alexander L Strehl, Lihong
Li, and Michael L Littman.
2009.
</bibblock>
        <bibblock>Reinforcement learning in finite MDPs: PAC
analysis.
</bibblock>
        <bibblock><emph font="italic">Journal of Machine Learning Research</emph>
10, Nov (2009),
2413–2444.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="strehl2008analysis" xml:id="bib.bib43">
        <tags>
          <tag role="number">43</tag>
          <tag role="year">2008</tag>
          <tag role="authors">Strehl and
Littman</tag>
          <tag role="fullauthors">Strehl and Littman</tag>
          <tag role="refnum">Strehl and
Littman (2008)</tag>
          <tag role="key">strehl2008analysis</tag>
        </tags>
        <bibblock>
Alexander L Strehl and
Michael L Littman. 2008.
</bibblock>
        <bibblock>An analysis of model-based interval estimation for
Markov decision processes.
</bibblock>
        <bibblock><emph font="italic">J. Comput. System Sci.</emph>
74, 8 (2008),
1309–1331.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="sutton2018reinforcement" xml:id="bib.bib44">
        <tags>
          <tag role="number">44</tag>
          <tag role="year">2018</tag>
          <tag role="authors">Sutton and Barto</tag>
          <tag role="fullauthors">Sutton and Barto</tag>
          <tag role="refnum">Sutton and Barto (2018)</tag>
          <tag role="key">sutton2018reinforcement</tag>
        </tags>
        <bibblock>
Richard S Sutton and
Andrew G Barto. 2018.
</bibblock>
        <bibblock><emph font="italic">Reinforcement learning: An introduction</emph>.
</bibblock>
        <bibblock>MIT press.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="taiga2020bonus" xml:id="bib.bib45">
        <tags>
          <tag role="number">45</tag>
          <tag role="year">2020</tag>
          <tag role="authors">Taıga et al.</tag>
          <tag role="fullauthors">Taıga, Fedus, Machado, Courville, and
Bellemare</tag>
          <tag role="refnum">Taıga et al. (2020)</tag>
          <tag role="key">taiga2020bonus</tag>
        </tags>
        <bibblock>
Adrien Ali Taıga,
William Fedus, Marlos C Machado,
Aaron Courville, and Marc G Bellemare.
2020.
</bibblock>
        <bibblock>ON BONUS-BASED EXPLORATION METHODS IN THE ARCADE
LEARNING ENVIRONMENT.
</bibblock>
        <bibblock><emph font="italic">International Conference on Learning
Representations (ICLR)</emph> (2020).
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="tang2017exploration" xml:id="bib.bib46">
        <tags>
          <tag role="number">46</tag>
          <tag role="year">2017</tag>
          <tag role="authors">Tang et al.</tag>
          <tag role="fullauthors">Tang, Houthooft, Foote, Stooke, Chen, Duan,
Schulman, DeTurck, and Abbeel</tag>
          <tag role="refnum">Tang et al. (2017)</tag>
          <tag role="key">tang2017exploration</tag>
        </tags>
        <bibblock>
<!--  %****␣sample.bbl␣Line␣575␣**** -->Haoran Tang, Rein
Houthooft, Davis Foote, Adam Stooke,
OpenAI Xi Chen, Yan Duan,
John Schulman, Filip DeTurck, and
Pieter Abbeel. 2017.
</bibblock>
        <bibblock># exploration: A study of count-based exploration
for deep reinforcement learning. In <emph font="italic">Advances in
neural information processing systems</emph>. 2753–2762.
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="tesauro1995temporal" xml:id="bib.bib47">
        <tags>
          <tag role="number">47</tag>
          <tag role="year">1995</tag>
          <tag role="authors">Tesauro</tag>
          <tag role="fullauthors">Tesauro</tag>
          <tag role="refnum">Tesauro (1995)</tag>
          <tag role="key">tesauro1995temporal</tag>
        </tags>
        <bibblock>
Gerald Tesauro.
1995.
</bibblock>
        <bibblock>Temporal difference learning and TD-Gammon.
</bibblock>
        <bibblock><emph font="italic">Commun. ACM</emph> (1995).
</bibblock>
        <bibblock/>
      </bibitem>
      <bibitem key="ziebart2008maximum" xml:id="bib.bib48">
        <tags>
          <tag role="number">48</tag>
          <tag role="year">2008</tag>
          <tag role="authors">Ziebart
et al.</tag>
          <tag role="fullauthors">Ziebart, Maas, Bagnell, and Dey</tag>
          <tag role="refnum">Ziebart
et al. (2008)</tag>
          <tag role="key">ziebart2008maximum</tag>
        </tags>
        <bibblock>
B<!--  %****␣sample.bbl␣Line␣600␣**** -->rian D Ziebart, Andrew L
Maas, J Andrew Bagnell, and Anind K
Dey. 2008.
</bibblock>
        <bibblock>Maximum entropy inverse reinforcement learning..
In <emph font="italic">AAAI Conference on Artificial Intelligence</emph>.
</bibblock>
        <bibblock/>
      </bibitem>
    </biblist>
  </bibliography>
<!--  %****␣sample.tex␣Line␣400␣**** --></document>
