<?xml version="1.0" encoding="UTF-8"?>
<?latexml searchpaths="/home/japhy/scienceReplication.artiswrong.com/paper_files/arxiv/2301.10067/latex_extracted"?>
<?latexml class="article"?>
<!--  %if you need to pass options to natbib, use, e.g.: --><!--  %“PassOptionsToPackage–numbers, compress˝–natbib˝ --><!--  %before loading neurips˙2022 --><!--  %ready for submission --><!--  %“usepackage–neurips˙2022˝ --><!--  %to compile a preprint version, e.g., for submission to arXiv, add add the --><!--  %[preprint] option: --><!--  %“usepackage[preprint]–neurips˙2022˝ --><!--  %to compile a camera-ready version, add the [final] option, e.g.: --><!--  %“usepackage[final]–neurips˙2022˝ --><!--  %to avoid loading the natbib package, add option nonatbib: --><?latexml package="neurips_2022" options="preprint, nonatbib"?>
<!--  %**** main˙eng.tex Line 25 **** --><?latexml package="inputenc" options="utf8"?>
<?latexml package="fontenc" options="T1"?>
<?latexml package="hyperref"?>
<?latexml package="url"?>
<?latexml package="booktabs"?>
<?latexml package="amsfonts"?>
<?latexml package="nicefrac"?>
<?latexml package="microtype"?>
<?latexml package="xcolor"?>
<?latexml package="amsmath"?>
<?latexml package="amssymb"?>
<?latexml package="array"?>
<?latexml package="graphicx" options="pdftex"?>
<?latexml package="tikz"?>
<!--  %The␣\author␣macro␣works␣with␣any␣number␣of␣authors.␣There␣are␣two␣commands --><!--  %used␣to␣separate␣the␣names␣and␣addresses␣of␣multiple␣authors:␣\And␣and␣\AND. --><!--  %Using␣\And␣between␣authors␣leaves␣it␣to␣LaTeX␣to␣determine␣where␣to␣break␣the --><!--  %lines.␣Using␣\AND␣forces␣a␣line␣break␣at␣that␣point.␣So,␣if␣LaTeX␣puts␣3␣of␣4 --><!--  %authors␣names␣on␣the␣first␣line,␣and␣the␣last␣on␣the␣second␣line,␣try␣using --><!--  %\AND␣instead␣of␣\And␣before␣the␣third␣author␣name. --><!--  %****␣main_eng.tex␣Line␣75␣**** --><!--  %\AND --><!--  %Coauthor␣\\ --><!--  %Affiliation␣\\ --><!--  %Address␣\\ --><!--  %\texttt{email}␣\\ --><?latexml RelaxNGSchema="LaTeXML"?>
<document xmlns="http://dlmf.nist.gov/LaTeXML" class="ltx_authors_1line">
  <resource src="LaTeXML.css" type="text/css"/>
  <resource src="ltx-article.css" type="text/css"/>
  <title>Intrinsic Motivation in Model-based Reinforcement Learning: A Brief Review</title>
  <creator role="author">
    <personname>Artem Latyshev <break/>Moscow Institute of Physics and Technology<break/>Moscow, Russia <break/><text font="typewriter">latyshev.ak@phystech.edu</text> <break/>&amp;Aleksandr I. Panov <break/>Moscow Institute of Physics and Technology <break/>Federal Research Center “Computer Science and Control” of Russian Academy of Sciences <break/>AIRI <break/>Moscow, Russia <break/><text font="typewriter">panov@airi.net</text> <break/></personname>
  </creator>
  <abstract name="Abstract">
    <p>The reinforcement learning research area contains a wide range of methods for solving the problems of intelligent agent control. Despite the progress that has been made, the task of creating a highly autonomous agent is still a significant challenge. One potential solution to this problem is intrinsic motivation, a concept derived from developmental psychology. This review considers the existing methods for determining intrinsic motivation based on the world model obtained by the agent. We propose a systematic approach to current research in this field, which consists of three categories of methods, distinguished by the way they utilize a world model in the agent’s components: complementary intrinsic reward, exploration policy, and intrinsically motivated goals. The proposed unified framework describes the architecture of agents using a world model and intrinsic motivation to improve learning. The potential for developing new techniques in this area of research is also examined.</p>
  </abstract>
  <section inlist="toc" xml:id="S1">
    <tags>
      <tag>1</tag>
      <tag role="autoref">section 1</tag>
      <tag role="refnum">1</tag>
      <tag role="typerefnum">§1</tag>
    </tags>
    <title><tag close=" ">1</tag>Introduction</title>
    <para xml:id="S1.p1">
      <p>Reinforcement learning (RL) methods demonstrate impressive results in many problems of generating artificial agent behavior. They show efficient processing at the human level in games <cite class="ltx_citemacro_cite"><bibref bibrefs="mnih_human-level_2015" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, in classic tasks with a large search space <cite class="ltx_citemacro_cite"><bibref bibrefs="silver_mastering_2017" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, in the control of robotic systems <cite class="ltx_citemacro_cite"><bibref bibrefs="schulman_proximal_2017" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>. At the same time, such algorithms already outperform experts on specific tasks, but how they learn from a scalar reward environment signal is far from how a person learns. It is well established that human beings not only pay attention to the outcome of a task, but also rely on their understanding of cause-and-effect relationships in the surrounding environment and their previously acquired skills. With these universal capabilities and knowledge, a person can quickly learn to solve complex problems and reuse the accumulated knowledge.</p>
    </para>
    <para xml:id="S1.p2">
      <p>On the one hand, in RL, there are many methods using world model learning <cite class="ltx_citemacro_cite"><bibref bibrefs="moerland_model-based_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, which allows the agent to store and generalize knowledge about the dynamical properties of the environment. On the other hand, even with a model, the agent forms a policy based on a task-specific reward signal so that in its absence, no learning occurs. One of the solutions is suggested by analogies with human learning and the psychology of motivation: <emph font="italic">human behavior determined by intrinsic motivation <cite class="ltx_citemacro_cite"><bibref bibrefs="ryan_intrinsic_2000" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> leads to effective learning in the absence of extrinsic drives (such as reward signals for RL agent)</emph>. The adoption of intrinsic motivation has led to a new approach in the development of artificial agents <cite class="ltx_citemacro_cite"><bibref bibrefs="oudeyer_what_2007,baldassarre_intrinsically_2013,aubret_survey_2019,aubret_information-theoretic_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, where the key element is the intrinsic reward signal, which serves as a substitute for rewards in RL.
<!--  %****␣main_eng.tex␣Line␣100␣**** --></p>
    </para>
    <para xml:id="S1.p3">
      <p>The intrinsic motivation approach suggests a method for obtaining information about the environment through the manipulation of the agent. Thus, it provides task-agnostic learning of the world model. Having access to a model allows for the identification of a wide variety of intrinsic motivation signals. According to them, the exploration policy is trained, the internal goals are formed, and their priority is determined. The world model and intrinsic motivation complement each other.</p>
    </para>
    <para xml:id="S1.p4">
      <p>The objectives of the article are to review existing model-based RL algorithms that use methods of intrinsic motivation; to propose a unified framework that systematizes the accumulated knowledge in this field; to consider ways for further development.</p>
    </para>
    <para xml:id="S1.p5">
      <p>The structure of the review is as follows. Sections <ref labelref="LABEL:sec:background"/> and <ref labelref="LABEL:sec:agent_learning"/> describe the formal statement of the problem. In Section <ref labelref="LABEL:sec:interaction_model_im"/>, we present our primary contribution by examining techniques for determining model-based intrinsic motivation methods. We group these methods into three research directions: complementary intrinsic reward (Section <ref labelref="LABEL:subsec:additional_reward"/>), exploration policy (Section <ref labelref="LABEL:subsec:expl_policy"/>), and intrinsically motivated goals (Section <ref labelref="LABEL:subsec:intr_goals"/>). For each of them, specific examples are analyzed and the impact on the learning process is presented. In Section <ref labelref="LABEL:sec:problems"/> we consider the problems facing the improvement of the current architectures of intrinsically motivated agents and directions of future development.</p>
    </para>
  </section>
  <section inlist="toc" labels="LABEL:sec:background" xml:id="S2">
    <tags>
      <tag>2</tag>
      <tag role="autoref">section 2</tag>
      <tag role="refnum">2</tag>
      <tag role="typerefnum">§2</tag>
    </tags>
    <title><tag close=" ">2</tag>Background</title>
    <para xml:id="S2.p1">
      <p>The Markov Decision Process (MDP) is formally defined by the set <Math mode="inline" tex="\langle S,A,T,\rho,\gamma\rangle" text="list@(S, A, T, rho, gamma)" xml:id="S2.p1.m1">
          <XMath>
            <XMDual>
              <XMApp>
                <XMTok meaning="list"/>
                <XMRef idref="S2.p1.m1.1"/>
                <XMRef idref="S2.p1.m1.2"/>
                <XMRef idref="S2.p1.m1.3"/>
                <XMRef idref="S2.p1.m1.4"/>
                <XMRef idref="S2.p1.m1.5"/>
              </XMApp>
              <XMWrap>
                <XMTok name="langle" role="OPEN" stretchy="false">⟨</XMTok>
                <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m1.1">S</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m1.2">A</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m1.3">T</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok font="italic" name="rho" role="UNKNOWN" xml:id="S2.p1.m1.4">ρ</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok font="italic" name="gamma" role="UNKNOWN" xml:id="S2.p1.m1.5">γ</XMTok>
                <XMTok name="rangle" role="CLOSE" stretchy="false">⟩</XMTok>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math>. For each state in the environment <Math mode="inline" tex="s\in S" text="s element-of S" xml:id="S2.p1.m2">
          <XMath>
            <XMApp>
              <XMTok meaning="element-of" name="in" role="RELOP">∈</XMTok>
              <XMTok font="italic" role="UNKNOWN">s</XMTok>
              <XMTok font="italic" role="UNKNOWN">S</XMTok>
            </XMApp>
          </XMath>
        </Math> and possible action <Math mode="inline" tex="a\in A" text="a element-of A" xml:id="S2.p1.m3">
          <XMath>
            <XMApp>
              <XMTok meaning="element-of" name="in" role="RELOP">∈</XMTok>
              <XMTok font="italic" role="UNKNOWN">a</XMTok>
              <XMTok font="italic" role="UNKNOWN">A</XMTok>
            </XMApp>
          </XMath>
        </Math> the transition function <Math mode="inline" tex="T" text="T" xml:id="S2.p1.m4">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">T</XMTok>
          </XMath>
        </Math> and the reward function <Math mode="inline" tex="R" text="R" xml:id="S2.p1.m5">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">R</XMTok>
          </XMath>
        </Math> are given. The transition function <Math mode="inline" tex="T:S\times A\rightarrow p(S)" text="T colon S * A rightarrow p * S" xml:id="S2.p1.m6">
          <XMath>
            <XMApp>
              <XMTok name="colon" role="METARELOP">:</XMTok>
              <XMTok font="italic" role="UNKNOWN">T</XMTok>
              <XMApp>
                <XMTok name="rightarrow" role="ARROW">→</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">×</XMTok>
                  <XMTok font="italic" role="UNKNOWN">S</XMTok>
                  <XMTok font="italic" role="UNKNOWN">A</XMTok>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">p</XMTok>
                  <XMDual>
                    <XMRef idref="S2.p1.m6.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m6.1">S</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> determines the conditional probability <Math mode="inline" tex="p(s_{t+1}|a_{t},s_{t})" text="p * conditional@(s _ (t + 1), list@(a _ t, s _ t))" xml:id="S2.p1.m7">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" role="UNKNOWN">p</XMTok>
              <XMDual>
                <XMRef idref="S2.p1.m7.1"/>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp xml:id="S2.p1.m7.1">
                    <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">s</XMTok>
                      <XMApp>
                        <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                        <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="list"/>
                        <XMRef idref="S2.p1.m7.1.1"/>
                        <XMRef idref="S2.p1.m7.1.2"/>
                      </XMApp>
                      <XMWrap>
                        <XMApp xml:id="S2.p1.m7.1.1">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="italic" role="UNKNOWN">a</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                        </XMApp>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMApp xml:id="S2.p1.m7.1.2">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="italic" role="UNKNOWN">s</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                        </XMApp>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math> of getting into the next state <Math mode="inline" tex="s_{t+1}\in S" text="s _ (t + 1) element-of S" xml:id="S2.p1.m8">
          <XMath>
            <XMApp>
              <XMTok meaning="element-of" name="in" role="RELOP">∈</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                <XMApp>
                  <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                </XMApp>
              </XMApp>
              <XMTok font="italic" role="UNKNOWN">S</XMTok>
            </XMApp>
          </XMath>
        </Math> from the current one <Math mode="inline" tex="s_{t}\in S" text="s _ t element-of S" xml:id="S2.p1.m9">
          <XMath>
            <XMApp>
              <XMTok meaning="element-of" name="in" role="RELOP">∈</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
              </XMApp>
              <XMTok font="italic" role="UNKNOWN">S</XMTok>
            </XMApp>
          </XMath>
        </Math> after execution action <Math mode="inline" tex="a_{t}\in A" text="a _ t element-of A" xml:id="S2.p1.m10">
          <XMath>
            <XMApp>
              <XMTok meaning="element-of" name="in" role="RELOP">∈</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">a</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
              </XMApp>
              <XMTok font="italic" role="UNKNOWN">A</XMTok>
            </XMApp>
          </XMath>
        </Math> in the environment. The reward function <Math mode="inline" tex="R:S\times A\times S\rightarrow\mathbb{R}" text="R colon S * A * S rightarrow R" xml:id="S2.p1.m11">
          <XMath>
            <XMApp>
              <XMTok name="colon" role="METARELOP">:</XMTok>
              <XMTok font="italic" role="UNKNOWN">R</XMTok>
              <XMApp>
                <XMTok name="rightarrow" role="ARROW">→</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">×</XMTok>
                  <XMTok font="italic" role="UNKNOWN">S</XMTok>
                  <XMTok font="italic" role="UNKNOWN">A</XMTok>
                  <XMTok font="italic" role="UNKNOWN">S</XMTok>
                </XMApp>
                <XMTok font="blackboard" role="UNKNOWN">R</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> evaluates a scalar characteristic (reward signal) of each transition triple <Math mode="inline" tex="\tau=(s_{t},a_{t},s_{t+1})" text="tau = vector@(s _ t, a _ t, s _ (t + 1))" xml:id="S2.p1.m12">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
              <XMDual>
                <XMApp>
                  <XMTok meaning="vector"/>
                  <XMRef idref="S2.p1.m12.1"/>
                  <XMRef idref="S2.p1.m12.2"/>
                  <XMRef idref="S2.p1.m12.3"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp xml:id="S2.p1.m12.1">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S2.p1.m12.2">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">a</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S2.p1.m12.3">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math>. The initial state of the agent in the environment <Math mode="inline" tex="s_{0}\in S" text="s _ 0 element-of S" xml:id="S2.p1.m13">
          <XMath>
            <XMApp>
              <XMTok meaning="element-of" name="in" role="RELOP">∈</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
              </XMApp>
              <XMTok font="italic" role="UNKNOWN">S</XMTok>
            </XMApp>
          </XMath>
        </Math> is from the corresponding distribution <Math mode="inline" tex="\rho:S\rightarrow p(S)" text="rho colon S rightarrow p * S" xml:id="S2.p1.m14">
          <XMath>
            <XMApp>
              <XMTok name="colon" role="METARELOP">:</XMTok>
              <XMTok font="italic" name="rho" role="UNKNOWN">ρ</XMTok>
              <XMApp>
                <XMTok name="rightarrow" role="ARROW">→</XMTok>
                <XMTok font="italic" role="UNKNOWN">S</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">p</XMTok>
                  <XMDual>
                    <XMRef idref="S2.p1.m14.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="italic" role="UNKNOWN" xml:id="S2.p1.m14.1">S</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>. The factor <Math mode="inline" tex="\gamma\in[0,1]" text="gamma element-of closed-interval@(0, 1)" xml:id="S2.p1.m15">
          <XMath>
            <XMApp>
              <XMTok meaning="element-of" name="in" role="RELOP">∈</XMTok>
              <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
              <XMDual>
                <XMApp>
                  <XMTok meaning="closed-interval"/>
                  <XMRef idref="S2.p1.m15.1"/>
                  <XMRef idref="S2.p1.m15.2"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">[</XMTok>
                  <XMTok meaning="0" role="NUMBER" xml:id="S2.p1.m15.1">0</XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok meaning="1" role="NUMBER" xml:id="S2.p1.m15.2">1</XMTok>
                  <XMTok role="CLOSE" stretchy="false">]</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math> is a discounting parameter.</p>
    </para>
    <para xml:id="S2.p2">
      <p>The policy <Math mode="inline" tex="\pi" text="pi" xml:id="S2.p2.m1">
          <XMath>
            <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
          </XMath>
        </Math> is responsible for the agent’s behavior — the function <Math mode="inline" tex="\pi:S\rightarrow p(A)" text="pi colon S rightarrow p * A" xml:id="S2.p2.m2">
          <XMath>
            <XMApp>
              <XMTok name="colon" role="METARELOP">:</XMTok>
              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
              <XMApp>
                <XMTok name="rightarrow" role="ARROW">→</XMTok>
                <XMTok font="italic" role="UNKNOWN">S</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">p</XMTok>
                  <XMDual>
                    <XMRef idref="S2.p2.m2.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="italic" role="UNKNOWN" xml:id="S2.p2.m2.1">A</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> that defines the probability <Math mode="inline" tex="p(a_{t}|s_{t})" text="p * conditional@(a _ t, s _ t)" xml:id="S2.p2.m3">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" role="UNKNOWN">p</XMTok>
              <XMDual>
                <XMRef idref="S2.p2.m3.1"/>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp xml:id="S2.p2.m3.1">
                    <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">a</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    </XMApp>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">s</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math> of the action execution in the current state <Math mode="inline" tex="s_{t}" text="s _ t" xml:id="S2.p2.m4">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">s</XMTok>
              <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
            </XMApp>
          </XMath>
        </Math>.</p>
    </para>
    <para xml:id="S2.p3">
      <p>Reinforcement learning methods solve the optimization problem of maximizing expected return:</p>
    </para>
    <para xml:id="S2.p4">
      <equation labels="LABEL:eq:mdp_opt" xml:id="S2.E1">
        <tags>
          <tag>(1)</tag>
          <tag role="autoref">Equation 1</tag>
          <tag role="refnum">1</tag>
        </tags>
        <Math content-tex="\pi^{*}=\argmax_{\pi}\mathbb{E}_{\begin{subarray}{c}s_{0}\sim\rho(s_{0}),a_{t}%&#10;\sim\pi(a_{t}|s_{t}),\\&#10;s_{t+1}\sim p(s_{t+1}|s_{t},a_{t})\end{subarray}}\left[\sum_{t=0}^{\infty}%&#10;\gamma^{t}R(s_{t},a_{t},s_{t+1})\right]." mode="display" tex="\pi^{*}=\operatorname*{argmax}_{\pi}\mathbb{E}_{\begin{subarray}{c}s_{0}\sim%&#10;\rho(s_{0}),a_{t}\sim\pi(a_{t}|s_{t}),\\&#10;s_{t+1}\sim p(s_{t+1}|s_{t},a_{t})\end{subarray}}\left[\sum_{t=0}^{\infty}%&#10;\gamma^{t}R(s_{t},a_{t},s_{t+1})\right]." text="pi ^ * = (argmax _ pi)@(E _ list[[[s0∼ρ(s0),at∼π(at|st),]], [s _ (t + 1) similar-to p * conditional@(s _ (t + 1), list@(s _ t, a _ t))]]) * delimited-[]@(((sum _ (t = 0)) ^ infinity)@(gamma ^ t * R * vector@(s _ t, a _ t, s _ (t + 1))))" xml:id="S2.E1.m1">
          <XMath>
            <XMDual>
              <XMRef idref="S2.E1.m1.2"/>
              <XMWrap>
                <XMApp xml:id="S2.E1.m1.2">
                  <XMTok meaning="equals" role="RELOP">=</XMTok>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                    <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMApp>
                      <XMApp scriptpos="mid">
                        <XMTok role="SUBSCRIPTOP" scriptpos="mid1"/>
                        <XMDual role="OPERATOR">
                          <XMTok name="argmax" role="OPERATOR" scriptpos="mid"/>
                          <XMTok role="OPERATOR" scriptpos="mid">argmax</XMTok>
                        </XMDual>
                        <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
                      </XMApp>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="blackboard" role="UNKNOWN">E</XMTok>
                        <XMArray class="ltx_align_c" name="list" rowsep="0.0pt">
                          <XMRow>
                            <XMCell>
                              <XMArg>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post3"/>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                                  <XMTok fontsize="50%" meaning="0" role="NUMBER">0</XMTok>
                                </XMApp>
                                <XMTok fontsize="70%" meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                                <XMTok font="italic" fontsize="70%" name="rho" role="UNKNOWN">ρ</XMTok>
                                <XMWrap>
                                  <XMTok fontsize="70%" role="OPEN" stretchy="false">(</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post3"/>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                                    <XMTok fontsize="50%" meaning="0" role="NUMBER">0</XMTok>
                                  </XMApp>
                                  <XMTok fontsize="70%" role="CLOSE" stretchy="false">)</XMTok>
                                </XMWrap>
                                <XMTok fontsize="70%" role="PUNCT">,</XMTok>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post3"/>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                                  <XMTok font="italic" fontsize="50%" role="UNKNOWN">t</XMTok>
                                </XMApp>
                                <XMTok fontsize="70%" meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                                <XMTok font="italic" fontsize="70%" name="pi" role="UNKNOWN">π</XMTok>
                                <XMWrap>
                                  <XMTok fontsize="70%" role="OPEN" stretchy="false">(</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post3"/>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                                    <XMTok font="italic" fontsize="50%" role="UNKNOWN">t</XMTok>
                                  </XMApp>
                                  <XMTok fontsize="70%" role="VERTBAR" stretchy="false">|</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post3"/>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                                    <XMTok font="italic" fontsize="50%" role="UNKNOWN">t</XMTok>
                                  </XMApp>
                                  <XMTok fontsize="70%" role="CLOSE" stretchy="false">)</XMTok>
                                </XMWrap>
                                <XMTok fontsize="70%" role="PUNCT">,</XMTok>
                              </XMArg>
                            </XMCell>
                          </XMRow>
                          <XMRow>
                            <XMCell>
                              <XMApp>
                                <XMTok fontsize="70%" meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post3"/>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                                  <XMApp>
                                    <XMTok fontsize="50%" meaning="plus" role="ADDOP">+</XMTok>
                                    <XMTok font="italic" fontsize="50%" role="UNKNOWN">t</XMTok>
                                    <XMTok fontsize="50%" meaning="1" role="NUMBER">1</XMTok>
                                  </XMApp>
                                </XMApp>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">p</XMTok>
                                  <XMDual>
                                    <XMRef idref="S2.E1.m1.1"/>
                                    <XMWrap>
                                      <XMTok fontsize="70%" role="OPEN" stretchy="false">(</XMTok>
                                      <XMApp xml:id="S2.E1.m1.1">
                                        <XMTok fontsize="70%" meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                                        <XMApp>
                                          <XMTok role="SUBSCRIPTOP" scriptpos="post3"/>
                                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                                          <XMApp>
                                            <XMTok fontsize="50%" meaning="plus" role="ADDOP">+</XMTok>
                                            <XMTok font="italic" fontsize="50%" role="UNKNOWN">t</XMTok>
                                            <XMTok fontsize="50%" meaning="1" role="NUMBER">1</XMTok>
                                          </XMApp>
                                        </XMApp>
                                        <XMDual>
                                          <XMApp>
                                            <XMTok meaning="list"/>
                                            <XMRef idref="S2.E1.m1.1.1"/>
                                            <XMRef idref="S2.E1.m1.1.2"/>
                                          </XMApp>
                                          <XMWrap>
                                            <XMApp xml:id="S2.E1.m1.1.1">
                                              <XMTok role="SUBSCRIPTOP" scriptpos="post3"/>
                                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                                              <XMTok font="italic" fontsize="50%" role="UNKNOWN">t</XMTok>
                                            </XMApp>
                                            <XMTok fontsize="70%" role="PUNCT">,</XMTok>
                                            <XMApp xml:id="S2.E1.m1.1.2">
                                              <XMTok role="SUBSCRIPTOP" scriptpos="post3"/>
                                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                                              <XMTok font="italic" fontsize="50%" role="UNKNOWN">t</XMTok>
                                            </XMApp>
                                          </XMWrap>
                                        </XMDual>
                                      </XMApp>
                                      <XMTok fontsize="70%" role="CLOSE" stretchy="false">)</XMTok>
                                    </XMWrap>
                                  </XMDual>
                                </XMApp>
                              </XMApp>
                            </XMCell>
                          </XMRow>
                        </XMArray>
                      </XMApp>
                    </XMApp>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="delimited-[]"/>
                        <XMRef idref="S2.E1.m1.2.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="true">[</XMTok>
                        <XMApp xml:id="S2.E1.m1.2.1">
                          <XMApp scriptpos="mid">
                            <XMTok role="SUPERSCRIPTOP" scriptpos="mid2"/>
                            <XMApp scriptpos="mid">
                              <XMTok role="SUBSCRIPTOP" scriptpos="mid2"/>
                              <XMTok mathstyle="display" meaning="sum" role="SUMOP" scriptpos="mid">∑</XMTok>
                              <XMApp>
                                <XMTok fontsize="70%" meaning="equals" role="RELOP">=</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                              </XMApp>
                            </XMApp>
                            <XMTok fontsize="70%" meaning="infinity" name="infty" role="ID">∞</XMTok>
                          </XMApp>
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMApp>
                              <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                              <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                            </XMApp>
                            <XMTok font="italic" role="UNKNOWN">R</XMTok>
                            <XMDual>
                              <XMApp>
                                <XMTok meaning="vector"/>
                                <XMRef idref="S2.E1.m1.2.1.1"/>
                                <XMRef idref="S2.E1.m1.2.1.2"/>
                                <XMRef idref="S2.E1.m1.2.1.3"/>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">(</XMTok>
                                <XMApp xml:id="S2.E1.m1.2.1.1">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                                  <XMTok font="italic" role="UNKNOWN">s</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                </XMApp>
                                <XMTok role="PUNCT">,</XMTok>
                                <XMApp xml:id="S2.E1.m1.2.1.2">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                                  <XMTok font="italic" role="UNKNOWN">a</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                </XMApp>
                                <XMTok role="PUNCT">,</XMTok>
                                <XMApp xml:id="S2.E1.m1.2.1.3">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                                  <XMTok font="italic" role="UNKNOWN">s</XMTok>
                                  <XMApp>
                                    <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                                  </XMApp>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false">)</XMTok>
                              </XMWrap>
                            </XMDual>
                          </XMApp>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="true">]</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMApp>
                <XMTok role="PERIOD">.</XMTok>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math>
      </equation>
    </para>
    <subsection inlist="toc" xml:id="S2.SS1">
      <tags>
        <tag>2.1</tag>
        <tag role="autoref">subsection 2.1</tag>
        <tag role="refnum">2.1</tag>
        <tag role="typerefnum">§2.1</tag>
      </tags>
      <title><tag close=" ">2.1</tag>Model-based RL</title>
<!--  %****␣main_eng.tex␣Line␣125␣**** -->      <para xml:id="S2.SS1.p1">
        <p>There are two main approaches to the problem of sequential decision-making: reinforcement learning and planning <cite class="ltx_citemacro_cite"><bibref bibrefs="moerland_model-based_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>. The model-based RL (MBRL) combines them. <emph font="italic">MBRL considers a class of MDP algorithms that use a world model (planning part) and store a global solution (RL part) <cite class="ltx_citemacro_cite"><bibref bibrefs="moerland_model-based_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite></emph>. Here the global solution is the optimal policy <Math mode="inline" tex="\pi^{*}" text="pi ^ *" xml:id="S2.SS1.p1.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
              </XMApp>
            </XMath>
          </Math> (eq. <ref labelref="LABEL:eq:mdp_opt"/>). The world model <Math mode="inline" tex="\mathcal{M}" text="M" xml:id="S2.SS1.p1.m2">
            <XMath>
              <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
            </XMath>
          </Math> is a set of functions that describes the dynamics in the environment <Math mode="inline" tex="\mathcal{E}" text="E" xml:id="S2.SS1.p1.m3">
            <XMath>
              <XMTok font="caligraphic" role="UNKNOWN">E</XMTok>
            </XMath>
          </Math>.</p>
      </para>
      <para xml:id="S2.SS1.p2">
        <p>The model can be obtained by learning or known in advance. In the first case, learning is based on the collected experience of interaction with the environment. In the second case, no training is required. The last option is very rare in real problems since in most cases the environment dynamics are unknown. In the following discussion, the world model is assumed to be obtained by learning unless otherwise stated.</p>
      </para>
      <para xml:id="S2.SS1.p3">
        <p>The formal description of the environment dynamics has a lot of variations. However, in all of them, there is a transition from one state to another when performing actions in the environment. The possible components of the model determining the specific one are presented in Fig. <ref labelref="LABEL:fig:fullmodel"/>.</p>
      </para>
      <figure inlist="lof" labels="LABEL:fig:fullmodel" placement="h" xml:id="S2.F1">
        <tags>
          <tag>Figure 1</tag>
          <tag role="autoref">Figure 1</tag>
          <tag role="refnum">1</tag>
          <tag role="typerefnum">Figure 1</tag>
        </tags>
        <graphics candidates="full_model_eng.png" class="ltx_centering" graphic="full_model_eng.png" options="width=433.62pt" xml:id="S2.F1.g1"/>
        <toccaption class="ltx_centering"><tag close=" ">1</tag>The structure of the model. <Math mode="inline" tex="S^{n}" text="S ^ n" xml:id="S2.F1.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">S</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
              </XMApp>
            </XMath>
          </Math> — the history of <Math mode="inline" tex="n" text="n" xml:id="S2.F1.m2">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">n</XMTok>
            </XMath>
          </Math> states. <Math mode="inline" tex="\mathcal{Z}" text="Z" xml:id="S2.F1.m3">
            <XMath>
              <XMTok font="caligraphic" role="UNKNOWN">Z</XMTok>
            </XMath>
          </Math> — the latent space, a special case is then <Math mode="inline" tex="\mathcal{Z}" text="Z" xml:id="S2.F1.m4">
            <XMath>
              <XMTok font="caligraphic" role="UNKNOWN">Z</XMTok>
            </XMath>
          </Math> matches <Math mode="inline" tex="S" text="S" xml:id="S2.F1.m5">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">S</XMTok>
            </XMath>
          </Math>. <Math mode="inline" tex="\mathcal{M}^{k}" text="M ^ k" xml:id="S2.F1.m6">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
              </XMApp>
            </XMath>
          </Math> — an ensemble of <Math mode="inline" tex="k" text="k" xml:id="S2.F1.m7">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">k</XMTok>
            </XMath>
          </Math> individual models. <Math mode="inline" tex="A^{h}" text="A ^ h" xml:id="S2.F1.m8">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">A</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">h</XMTok>
              </XMApp>
            </XMath>
          </Math> — a sequence of <Math mode="inline" tex="h" text="h" xml:id="S2.F1.m9">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">h</XMTok>
            </XMath>
          </Math> actions. <Math mode="inline" tex="\mathcal{Z}^{k}" text="Z ^ k" xml:id="S2.F1.m10">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="caligraphic" role="UNKNOWN">Z</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
              </XMApp>
            </XMath>
          </Math> — the result of the model execution. <Math mode="inline" tex="\tilde{S}" text="tilde@(S)" xml:id="S2.F1.m11">
            <XMath>
              <XMApp>
                <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                <XMTok font="italic" role="UNKNOWN">S</XMTok>
              </XMApp>
            </XMath>
          </Math> — the state after <Math mode="inline" tex="h" text="h" xml:id="S2.F1.m12">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">h</XMTok>
            </XMath>
          </Math> steps from the initial one.</toccaption>
        <caption class="ltx_centering"><tag close=": ">Figure 1</tag>The structure of the model. <Math mode="inline" tex="S^{n}" text="S ^ n" xml:id="S2.F1.m13">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">S</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
              </XMApp>
            </XMath>
          </Math> — the history of <Math mode="inline" tex="n" text="n" xml:id="S2.F1.m14">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">n</XMTok>
            </XMath>
          </Math> states. <Math mode="inline" tex="\mathcal{Z}" text="Z" xml:id="S2.F1.m15">
            <XMath>
              <XMTok font="caligraphic" role="UNKNOWN">Z</XMTok>
            </XMath>
          </Math> — the latent space, a special case is then <Math mode="inline" tex="\mathcal{Z}" text="Z" xml:id="S2.F1.m16">
            <XMath>
              <XMTok font="caligraphic" role="UNKNOWN">Z</XMTok>
            </XMath>
          </Math> matches <Math mode="inline" tex="S" text="S" xml:id="S2.F1.m17">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">S</XMTok>
            </XMath>
          </Math>. <Math mode="inline" tex="\mathcal{M}^{k}" text="M ^ k" xml:id="S2.F1.m18">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
              </XMApp>
            </XMath>
          </Math> — an ensemble of <Math mode="inline" tex="k" text="k" xml:id="S2.F1.m19">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">k</XMTok>
            </XMath>
          </Math> individual models. <Math mode="inline" tex="A^{h}" text="A ^ h" xml:id="S2.F1.m20">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">A</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">h</XMTok>
              </XMApp>
            </XMath>
          </Math> — a sequence of <Math mode="inline" tex="h" text="h" xml:id="S2.F1.m21">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">h</XMTok>
            </XMath>
          </Math> actions. <Math mode="inline" tex="\mathcal{Z}^{k}" text="Z ^ k" xml:id="S2.F1.m22">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="caligraphic" role="UNKNOWN">Z</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
              </XMApp>
            </XMath>
          </Math> — the result of the model execution. <Math mode="inline" tex="\tilde{S}" text="tilde@(S)" xml:id="S2.F1.m23">
            <XMath>
              <XMApp>
                <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                <XMTok font="italic" role="UNKNOWN">S</XMTok>
              </XMApp>
            </XMath>
          </Math> — the state after <Math mode="inline" tex="h" text="h" xml:id="S2.F1.m24">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">h</XMTok>
            </XMath>
          </Math> steps from the initial one.</caption>
      </figure>
      <para xml:id="S2.SS1.p4">
        <p>The model makes predictions of the next agent state <Math mode="inline" tex="\tilde{S}" text="tilde@(S)" xml:id="S2.SS1.p4.m1">
            <XMath>
              <XMApp>
                <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                <XMTok font="italic" role="UNKNOWN">S</XMTok>
              </XMApp>
            </XMath>
          </Math> based on the history of <Math mode="inline" tex="n+1" text="n + 1" xml:id="S2.SS1.p4.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="plus" role="ADDOP">+</XMTok>
                <XMTok font="italic" role="UNKNOWN">n</XMTok>
                <XMTok meaning="1" role="NUMBER">1</XMTok>
              </XMApp>
            </XMath>
          </Math> previous states <Math mode="inline" tex="S^{n}" text="S ^ n" xml:id="S2.SS1.p4.m3">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">S</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
              </XMApp>
            </XMath>
          </Math><note mark="1" role="footnote" xml:id="footnote1"><tags>
              <tag>1</tag>
              <tag role="autoref">footnote 1</tag>
              <tag role="refnum">1</tag>
              <tag role="typerefnum">footnote 1</tag>
            </tags><Math mode="inline" tex="n&gt;0" text="n &gt; 0" xml:id="footnote1.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
                  <XMTok font="italic" role="UNKNOWN">n</XMTok>
                  <XMTok meaning="0" role="NUMBER">0</XMTok>
                </XMApp>
              </XMath>
            </Math> means partial observability of the environment, there the history is necessary for determining Markov states.</note> and the sequence of <Math mode="inline" tex="h" text="h" xml:id="S2.SS1.p4.m4">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">h</XMTok>
            </XMath>
          </Math> actions <Math mode="inline" tex="A^{h}" text="A ^ h" xml:id="S2.SS1.p4.m5">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">A</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">h</XMTok>
              </XMApp>
            </XMath>
          </Math>. To construct such a transition a special latent space <Math mode="inline" tex="\mathcal{Z}" text="Z" xml:id="S2.SS1.p4.m6">
            <XMath>
              <XMTok font="caligraphic" role="UNKNOWN">Z</XMTok>
            </XMath>
          </Math> is considered. It makes it possible to distinguish essential features for the structure of the environment and allows performing transitions in a simpler space than the original state space. The transitions discussed above correspond to the forward dynamics model. However, the inverse dynamics model is also contained in the scheme, but only <Math mode="inline" tex="A^{h}" text="A ^ h" xml:id="S2.SS1.p4.m7">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">A</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">h</XMTok>
              </XMApp>
            </XMath>
          </Math> is calculated from <Math mode="inline" tex="S^{n}" text="S ^ n" xml:id="S2.SS1.p4.m8">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">S</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
              </XMApp>
            </XMath>
          </Math> and <Math mode="inline" tex="\tilde{S}" text="tilde@(S)" xml:id="S2.SS1.p4.m9">
            <XMath>
              <XMApp>
                <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                <XMTok font="italic" role="UNKNOWN">S</XMTok>
              </XMApp>
            </XMath>
          </Math>.</p>
      </para>
      <para xml:id="S2.SS1.p5">
        <p>The standard way is to use one model <Math mode="inline" tex="\mathcal{M}" text="M" xml:id="S2.SS1.p5.m1">
            <XMath>
              <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
            </XMath>
          </Math> <cite class="ltx_citemacro_cite"><bibref bibrefs="pathak_curiosity-driven_2017" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>, but also the ensemble <Math mode="inline" tex="\mathcal{M}^{k}" text="M ^ k" xml:id="S2.SS1.p5.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
              </XMApp>
            </XMath>
          </Math> of several models is widespread <cite class="ltx_citemacro_cite"><bibref bibrefs="hafez_deep_2019,mendonca_discovering_2021,pathak_self-supervised_2019,shyam_model-based_2019" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>. In the latter case, each separate part acts as an independent model that makes its own prediction. These predictions can be used as is or combined into the final result.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S2.SS2">
      <tags>
        <tag>2.2</tag>
        <tag role="autoref">subsection 2.2</tag>
        <tag role="refnum">2.2</tag>
        <tag role="typerefnum">§2.2</tag>
      </tags>
      <title><tag close=" ">2.2</tag>Intrinsic motivation</title>
      <para xml:id="S2.SS2.p1">
        <p>In RL many interesting and innovative ideas have been borrowed from the psychological theories of human behavior. One such example is the concept of intrinsic motivation. According to the founders of the theory of self-determination <cite class="ltx_citemacro_cite">(<bibref bibrefs="ryan_intrinsic_2000" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>, p. 3)</cite>: <emph font="italic">"Intrinsic motivation is defined as the doing of an activity for its inherent satisfactions rather than for some separable consequence"</emph>.</p>
      </para>
      <para xml:id="S2.SS2.p2">
        <p>An intrinsically motivated person enjoys the activity without extrinsic stimuli, which is the opposite of extrinsic motivation. This one is completely determined by teacher praise, monetary rewards, and fear of punishment. In reinforcement learning, the main idea of which is based on the signal of reward and punishment, extrinsic motivation is the foundation of learning. However, it was shown in <cite class="ltx_citemacro_cite"><bibref bibrefs="barto_intrinsic_2005" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> that intrinsic motivation is also naturally contained in the mathematical formalism of computational reinforcement learning. In this case, a special reward signal is introduced — intrinsic reward <Math mode="inline" tex="R_{int}" text="R _ (i * n * t)" xml:id="S2.SS2.p2.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">R</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>.</p>
      </para>
<!--  %****␣main_eng.tex␣Line␣150␣**** -->      <para xml:id="S2.SS2.p3">
        <p>The calculation of <Math mode="inline" tex="R_{int}" text="R _ (i * n * t)" xml:id="S2.SS2.p3.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">R</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> can use not only the transition triple <Math mode="inline" tex="\langle s_{t},a_{t},s_{t+1}\rangle" text="list@(s _ t, a _ t, s _ (t + 1))" xml:id="S2.SS2.p3.m2">
            <XMath>
              <XMDual>
                <XMApp>
                  <XMTok meaning="list"/>
                  <XMRef idref="S2.SS2.p3.m2.1"/>
                  <XMRef idref="S2.SS2.p3.m2.2"/>
                  <XMRef idref="S2.SS2.p3.m2.3"/>
                </XMApp>
                <XMWrap>
                  <XMTok name="langle" role="OPEN" stretchy="false">⟨</XMTok>
                  <XMApp xml:id="S2.SS2.p3.m2.1">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S2.SS2.p3.m2.2">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">a</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S2.SS2.p3.m2.3">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMTok name="rangle" role="CLOSE" stretchy="false">⟩</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>, as it happens for external rewards, but also on internal representations <Math mode="inline" tex="\mathcal{Z}" text="Z" xml:id="S2.SS2.p3.m3">
            <XMath>
              <XMTok font="caligraphic" role="UNKNOWN">Z</XMTok>
            </XMath>
          </Math> and model <Math mode="inline" tex="\mathcal{M}" text="M" xml:id="S2.SS2.p3.m4">
            <XMath>
              <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
            </XMath>
          </Math>. A detailed discussion of variations in the definition of the intrinsic reward signal is presented in Section <ref labelref="LABEL:subsec:intr_signal"/>.</p>
      </para>
    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:agent_learning" xml:id="S3">
    <tags>
      <tag>3</tag>
      <tag role="autoref">section 3</tag>
      <tag role="refnum">3</tag>
      <tag role="typerefnum">§3</tag>
    </tags>
    <title><tag close=" ">3</tag>Intrinsically motivated agent learning</title>
    <para xml:id="S3.p1">
      <p>In intrinsically motivated reinforcement learning both extrinsic and intrinsic motivation signals are considered. It becomes essential to determine the way these two approaches interact with each other and with other parts of the agent - model, and policy (see Fig. <ref labelref="LABEL:fig:fullagent"/>).</p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:fullagent" placement="h" xml:id="S3.F2">
      <tags>
        <tag>Figure 2</tag>
        <tag role="autoref">Figure 2</tag>
        <tag role="refnum">2</tag>
        <tag role="typerefnum">Figure 2</tag>
      </tags>
      <graphics candidates="full_agent_eng.png" class="ltx_centering" graphic="full_agent_eng.png" options="width=433.62pt" xml:id="S3.F2.g1"/>
      <toccaption class="ltx_centering"><tag close=" ">2</tag>The scheme of interacting components when an agent is trained by intrinsic motivation methods based on the world model. Highlighted elements — the basis of the approach under consideration.</toccaption>
      <caption class="ltx_centering"><tag close=": ">Figure 2</tag>The scheme of interacting components when an agent is trained by intrinsic motivation methods based on the world model. Highlighted elements — the basis of the approach under consideration.</caption>
    </figure>
    <para xml:id="S3.p2">
      <p>From the left side of the diagram (see Fig. <ref labelref="LABEL:fig:fullagent"/>) there is an extrinsic motivation determined by the task goal. It is the main controller of task policy learning for the given MDP. From the right side, intrinsic motivation is determined by the characteristics and interaction of the world model and the environment. This type of motivation provides the agent with:</p>
      <itemize xml:id="S3.I1">
        <item xml:id="S3.I1.i1">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">1st item</tag>
          </tags>
          <para xml:id="S3.I1.i1.p1">
            <p>a complementary intrinsic reward <Math mode="inline" tex="R_{int}^{*}" text="(R _ (i * n * t)) ^ *" xml:id="S3.I1.i1.p1.m1">
                <XMath>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">R</XMTok>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                  </XMApp>
                </XMath>
              </Math> that corrects the main task reward from the environment, which affects the learning of the target strategy;</p>
          </para>
        </item>
        <item xml:id="S3.I1.i2">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">2nd item</tag>
          </tags>
          <para xml:id="S3.I1.i2.p1">
            <p>an exploration policy <Math mode="inline" tex="\pi_{\epsilon}" text="pi _ epsilon" xml:id="S3.I1.i2.p1.m1">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                    <XMTok font="italic" fontsize="70%" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                  </XMApp>
                </XMath>
              </Math>, guided by which the agent collects experience in the environment for learning the entire system;</p>
          </para>
        </item>
        <item xml:id="S3.I1.i3">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">3rd item</tag>
          </tags>
          <para xml:id="S3.I1.i3.p1">
            <p>a set of intrinsic goals <Math mode="inline" tex="\mathcal{G}_{int}" text="G _ (i * n * t)" xml:id="S3.I1.i3.p1.m1">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="caligraphic" role="UNKNOWN">G</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    </XMApp>
                  </XMApp>
                </XMath>
              </Math> and a schedule for learning them, that is task-agnostic and significantly increases the degree of agent autonomy.</p>
          </para>
        </item>
      </itemize>
    </para>
    <para xml:id="S3.p3">
      <p>Not every existing method of intrinsic motivation implements all three components <Math mode="inline" tex="R_{int}^{*},\pi_{\epsilon},\mathcal{G}_{int}" text="list@((R _ (i * n * t)) ^ *, pi _ epsilon, G _ (i * n * t))" xml:id="S3.p3.m1">
          <XMath>
            <XMDual>
              <XMApp>
                <XMTok meaning="list"/>
                <XMRef idref="S3.p3.m1.1"/>
                <XMRef idref="S3.p3.m1.2"/>
                <XMRef idref="S3.p3.m1.3"/>
              </XMApp>
              <XMWrap>
                <XMApp xml:id="S3.p3.m1.1">
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">R</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp xml:id="S3.p3.m1.2">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                  <XMTok font="italic" fontsize="70%" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp xml:id="S3.p3.m1.3">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="caligraphic" role="UNKNOWN">G</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  </XMApp>
                </XMApp>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math> at once. Table <ref labelref="LABEL:table:intr_methods"/> (for description of the “Signal type” column see Section <ref labelref="LABEL:subsec:intr_signal"/>) contains information about the presence or absence of each of them in the agents considered in the review.</p>
    </para>
    <para xml:id="S3.p4">
      <p>An agent learning includes learning of its model and two policies, which is determined as the optimization problem (see Section <ref labelref="LABEL:subsec:losses"/> for details). The main source of information in this case is the agent’s interaction with the world. By choosing certain actions in the current state, the agent as a result receives training examples, which are usually stored in memory <Math mode="inline" tex="\mathcal{D}" text="D" xml:id="S3.p4.m1">
          <XMath>
            <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
          </XMath>
        </Math> — the source of the training set.
<!--  %****␣main_eng.tex␣Line␣175␣**** --></p>
    </para>
    <table inlist="lot" labels="LABEL:table:intr_methods" xml:id="S3.T1">
      <tags>
        <tag>Table 1</tag>
        <tag role="autoref">Table 1</tag>
        <tag role="refnum">1</tag>
        <tag role="typerefnum">Table 1</tag>
      </tags>
      <toccaption class="ltx_centering"><tag close=" ">1</tag>The presence of components used by methods of intrinsic motivation based on the world model. <Math mode="inline" tex="\chi[\mathcal{M}]^{*}" text="chi * (delimited-[]@(M)) ^ *" xml:id="S3.T1.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" name="chi" role="UNKNOWN">χ</XMTok>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="delimited-[]"/>
                    <XMRef idref="S3.T1.m1.1"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">[</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m1.1">M</XMTok>
                    <XMTok role="CLOSE" stretchy="false">]</XMTok>
                  </XMWrap>
                </XMDual>
                <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> — the scalar motivational signal is not calculated, but the structure of the world model is used.</toccaption>
      <caption class="ltx_centering"><tag close=": ">Table 1</tag>The presence of components used by methods of intrinsic motivation based on the world model. <Math mode="inline" tex="\chi[\mathcal{M}]^{*}" text="chi * (delimited-[]@(M)) ^ *" xml:id="S3.T1.m2">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" name="chi" role="UNKNOWN">χ</XMTok>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="delimited-[]"/>
                    <XMRef idref="S3.T1.m2.1"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">[</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m2.1">M</XMTok>
                    <XMTok role="CLOSE" stretchy="false">]</XMTok>
                  </XMWrap>
                </XMDual>
                <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> — the scalar motivational signal is not calculated, but the structure of the world model is used.</caption>
      <tabular class="ltx_centering ltx_guessed_headers" vattach="middle">
        <thead>
          <tr>
            <td align="left" border="r tt" thead="column">Method name</td>
            <td align="center" border="tt" thead="column"><Math mode="inline" tex="R^{*}_{int}" text="(R ^ *) _ (i * n * t)" xml:id="S3.T1.m3">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">R</XMTok>
                      <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                    </XMApp>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    </XMApp>
                  </XMApp>
                </XMath>
              </Math></td>
            <td align="center" border="tt" thead="column"><Math mode="inline" tex="\pi_{\varepsilon}" text="pi _ varepsilon" xml:id="S3.T1.m4">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                    <XMTok font="italic" fontsize="70%" name="varepsilon" role="UNKNOWN">ε</XMTok>
                  </XMApp>
                </XMath>
              </Math></td>
            <td align="center" border="tt" thead="column"><Math mode="inline" tex="G_{int}" text="G _ (i * n * t)" xml:id="S3.T1.m5">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">G</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    </XMApp>
                  </XMApp>
                </XMath>
              </Math></td>
            <td align="center" border="tt" thead="column">Signal type</td>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td align="left" border="r t">SelMo<cite class="ltx_citemacro_cite"><bibref bibrefs="groth_is_2021" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center" border="t">—</td>
            <td align="center" border="t">—</td>
            <td align="center" border="t">+</td>
            <td align="center" border="t"><Math mode="inline" tex="\mathcal{L}[\mathcal{M}]" text="L * delimited-[]@(M)" xml:id="S3.T1.m6">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="delimited-[]"/>
                        <XMRef idref="S3.T1.m6.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">[</XMTok>
                        <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m6.1">M</XMTok>
                        <XMTok role="CLOSE" stretchy="false">]</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">ICM<cite class="ltx_citemacro_cite"><bibref bibrefs="pathak_curiosity-driven_2017" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">+</td>
            <td align="center">—</td>
            <td align="center">—</td>
            <td align="center"><Math mode="inline" tex="\mathcal{L}[\mathcal{M}]" text="L * delimited-[]@(M)" xml:id="S3.T1.m7">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="delimited-[]"/>
                        <XMRef idref="S3.T1.m7.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">[</XMTok>
                        <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m7.1">M</XMTok>
                        <XMTok role="CLOSE" stretchy="false">]</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">EMI<cite class="ltx_citemacro_cite"><bibref bibrefs="kim_emi_2019" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">+</td>
            <td align="center">—</td>
            <td align="center">—</td>
            <td align="center"><Math mode="inline" tex="\mathcal{L}[\mathcal{M}]" text="L * delimited-[]@(M)" xml:id="S3.T1.m8">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="delimited-[]"/>
                        <XMRef idref="S3.T1.m8.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">[</XMTok>
                        <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m8.1">M</XMTok>
                        <XMTok role="CLOSE" stretchy="false">]</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">Plan2Explore<cite class="ltx_citemacro_cite"><bibref bibrefs="sekar_planning_2020" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">—</td>
            <td align="center">+</td>
            <td align="center">—</td>
            <td align="center"><Math mode="inline" tex="\mathcal{L}[\mathcal{M}]" text="L * delimited-[]@(M)" xml:id="S3.T1.m9">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="delimited-[]"/>
                        <XMRef idref="S3.T1.m9.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">[</XMTok>
                        <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m9.1">M</XMTok>
                        <XMTok role="CLOSE" stretchy="false">]</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">LEXA<cite class="ltx_citemacro_cite"><bibref bibrefs="mendonca_discovering_2021" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">—</td>
            <td align="center">+</td>
            <td align="center">+</td>
            <td align="center"><Math mode="inline" tex="\mathcal{L}[\mathcal{M}]" text="L * delimited-[]@(M)" xml:id="S3.T1.m10">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="delimited-[]"/>
                        <XMRef idref="S3.T1.m10.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">[</XMTok>
                        <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m10.1">M</XMTok>
                        <XMTok role="CLOSE" stretchy="false">]</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">MEEE<cite class="ltx_citemacro_cite"><bibref bibrefs="yao_sample_2021" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">+</td>
            <td align="center">—</td>
            <td align="center">—</td>
            <td align="center"><Math mode="inline" tex="\mathcal{L}[\mathcal{M}]" text="L * delimited-[]@(M)" xml:id="S3.T1.m11">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="delimited-[]"/>
                        <XMRef idref="S3.T1.m11.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">[</XMTok>
                        <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m11.1">M</XMTok>
                        <XMTok role="CLOSE" stretchy="false">]</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">MAX<cite class="ltx_citemacro_cite"><bibref bibrefs="shyam_model-based_2019" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">—</td>
            <td align="center">+</td>
            <td align="center">—</td>
            <td align="center"><Math mode="inline" tex="\mathcal{L}[\mathcal{M}]" text="L * delimited-[]@(M)" xml:id="S3.T1.m12">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="delimited-[]"/>
                        <XMRef idref="S3.T1.m12.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">[</XMTok>
                        <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m12.1">M</XMTok>
                        <XMTok role="CLOSE" stretchy="false">]</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">AWML<cite class="ltx_citemacro_cite"><bibref bibrefs="kim_active_2020" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">—</td>
            <td align="center">+</td>
            <td align="center">—</td>
            <td align="center"><Math mode="inline" tex="\Delta\mathcal{M}" text="Delta * M" xml:id="S3.T1.m13">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok name="Delta" role="UNKNOWN">Δ</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">VIME<cite class="ltx_citemacro_cite"><bibref bibrefs="houthooft_vime_2017" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">+</td>
            <td align="center">—</td>
            <td align="center">—</td>
            <td align="center"><Math mode="inline" tex="\Delta\mathcal{M}" text="Delta * M" xml:id="S3.T1.m14">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok name="Delta" role="UNKNOWN">Δ</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">Deep ICAC<cite class="ltx_citemacro_cite"><bibref bibrefs="hafez_deep_2019" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">+</td>
            <td align="center">—</td>
            <td align="center">—</td>
            <td align="center"><Math mode="inline" tex="\Delta\mathcal{M}" text="Delta * M" xml:id="S3.T1.m15">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok name="Delta" role="UNKNOWN">Δ</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">GDE<cite class="ltx_citemacro_cite"><bibref bibrefs="volpi_goal-directed_2020" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">+</td>
            <td align="center">—</td>
            <td align="center">—</td>
            <td align="center"><Math mode="inline" tex="\chi[\mathcal{M}]" text="chi * delimited-[]@(M)" xml:id="S3.T1.m16">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" name="chi" role="UNKNOWN">χ</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="delimited-[]"/>
                        <XMRef idref="S3.T1.m16.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">[</XMTok>
                        <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m16.1">M</XMTok>
                        <XMTok role="CLOSE" stretchy="false">]</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">WRW<cite class="ltx_citemacro_cite"><bibref bibrefs="mezghani_walk_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">—</td>
            <td align="center">+</td>
            <td align="center">+</td>
            <td align="center"><Math mode="inline" tex="\chi[\mathcal{M}]" text="chi * delimited-[]@(M)" xml:id="S3.T1.m17">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" name="chi" role="UNKNOWN">χ</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="delimited-[]"/>
                        <XMRef idref="S3.T1.m17.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">[</XMTok>
                        <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m17.1">M</XMTok>
                        <XMTok role="CLOSE" stretchy="false">]</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">EC<cite class="ltx_citemacro_cite"><bibref bibrefs="savinov_episodic_2019" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">+</td>
            <td align="center">—</td>
            <td align="center">—</td>
            <td align="center"><Math mode="inline" tex="\chi[\mathcal{M}]" text="chi * delimited-[]@(M)" xml:id="S3.T1.m18">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" name="chi" role="UNKNOWN">χ</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="delimited-[]"/>
                        <XMRef idref="S3.T1.m18.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">[</XMTok>
                        <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m18.1">M</XMTok>
                        <XMTok role="CLOSE" stretchy="false">]</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">CEE-US<cite class="ltx_citemacro_cite"><bibref bibrefs="sancaktar_curious_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">—</td>
            <td align="center">+</td>
            <td align="center">—</td>
            <td align="center"><Math mode="inline" tex="\mathcal{L}[\mathcal{M}]" text="L * delimited-[]@(M)" xml:id="S3.T1.m19">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="delimited-[]"/>
                        <XMRef idref="S3.T1.m19.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">[</XMTok>
                        <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m19.1">M</XMTok>
                        <XMTok role="CLOSE" stretchy="false">]</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">Director<cite class="ltx_citemacro_cite"><bibref bibrefs="hafner_deep_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">—</td>
            <td align="center">—</td>
            <td align="center">+</td>
            <td align="center"><Math mode="inline" tex="\mathcal{L}[\mathcal{M}]" text="L * delimited-[]@(M)" xml:id="S3.T1.m20">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="delimited-[]"/>
                        <XMRef idref="S3.T1.m20.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">[</XMTok>
                        <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m20.1">M</XMTok>
                        <XMTok role="CLOSE" stretchy="false">]</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">CC-RIG<cite class="ltx_citemacro_cite"><bibref bibrefs="nair_contextual_2020" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">—</td>
            <td align="center">—</td>
            <td align="center">+</td>
            <td align="center"><Math mode="inline" tex="\chi[\mathcal{M}]^{*}" text="chi * (delimited-[]@(M)) ^ *" xml:id="S3.T1.m21">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" name="chi" role="UNKNOWN">χ</XMTok>
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="delimited-[]"/>
                          <XMRef idref="S3.T1.m21.1"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">[</XMTok>
                          <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m21.1">M</XMTok>
                          <XMTok role="CLOSE" stretchy="false">]</XMTok>
                        </XMWrap>
                      </XMDual>
                      <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                    </XMApp>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="r">SMORL<cite class="ltx_citemacro_cite"><bibref bibrefs="zadaianchuk_smorl_2020" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center">—</td>
            <td align="center">—</td>
            <td align="center">+</td>
            <td align="center"><Math mode="inline" tex="\chi[\mathcal{M}]^{*}" text="chi * (delimited-[]@(M)) ^ *" xml:id="S3.T1.m22">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" name="chi" role="UNKNOWN">χ</XMTok>
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="delimited-[]"/>
                          <XMRef idref="S3.T1.m22.1"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">[</XMTok>
                          <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m22.1">M</XMTok>
                          <XMTok role="CLOSE" stretchy="false">]</XMTok>
                        </XMWrap>
                      </XMDual>
                      <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                    </XMApp>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
          <tr>
            <td align="left" border="bb r">SRICS<cite class="ltx_citemacro_cite"><bibref bibrefs="zadaianchuk_self-supervised_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite></td>
            <td align="center" border="bb">—</td>
            <td align="center" border="bb">—</td>
            <td align="center" border="bb">+</td>
            <td align="center" border="bb"><Math mode="inline" tex="\chi[\mathcal{M}]^{*}" text="chi * (delimited-[]@(M)) ^ *" xml:id="S3.T1.m23">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" name="chi" role="UNKNOWN">χ</XMTok>
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="delimited-[]"/>
                          <XMRef idref="S3.T1.m23.1"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">[</XMTok>
                          <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.T1.m23.1">M</XMTok>
                          <XMTok role="CLOSE" stretchy="false">]</XMTok>
                        </XMWrap>
                      </XMDual>
                      <XMTok fontsize="70%" meaning="times" role="MULOP">*</XMTok>
                    </XMApp>
                  </XMApp>
                </XMath>
              </Math></td>
          </tr>
        </tbody>
      </tabular>
    </table>
    <subsection inlist="toc" labels="LABEL:subsec:buffers" xml:id="S3.SS1">
      <tags>
        <tag>3.1</tag>
        <tag role="autoref">subsection 3.1</tag>
        <tag role="refnum">3.1</tag>
        <tag role="typerefnum">§3.1</tag>
      </tags>
      <title><tag close=" ">3.1</tag>Training set</title>
      <para xml:id="S3.SS1.p1">
        <p>An agent has three learning components: task policy <Math mode="inline" tex="\pi_{g}" text="pi _ g" xml:id="S3.SS1.p1.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">g</XMTok>
              </XMApp>
            </XMath>
          </Math>, exploration policy <Math mode="inline" tex="\pi_{\epsilon}" text="pi _ epsilon" xml:id="S3.SS1.p1.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                <XMTok font="italic" fontsize="70%" name="epsilon" role="UNKNOWN">ϵ</XMTok>
              </XMApp>
            </XMath>
          </Math>, and model <Math mode="inline" tex="\mathcal{M}" text="M" xml:id="S3.SS1.p1.m3">
            <XMath>
              <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
            </XMath>
          </Math>. In general, for each of them, there is its memory (training set): <Math mode="inline" tex="\mathcal{D}_{g}" text="D _ g" xml:id="S3.SS1.p1.m4">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">g</XMTok>
              </XMApp>
            </XMath>
          </Math>, <Math mode="inline" tex="\mathcal{D}_{\epsilon}" text="D _ epsilon" xml:id="S3.SS1.p1.m5">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                <XMTok font="italic" fontsize="70%" name="epsilon" role="UNKNOWN">ϵ</XMTok>
              </XMApp>
            </XMath>
          </Math>, <Math mode="inline" tex="\mathcal{D}_{\mathcal{M}}" text="D _ M" xml:id="S3.SS1.p1.m6">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">M</XMTok>
              </XMApp>
            </XMath>
          </Math>, respectively (see Fig. <ref labelref="LABEL:fig:buffers"/>). The memory stores trajectories of the agent <Math mode="inline" tex="\tau_{H}=(s_{i},a_{i},s_{i+1})_{0:H-1}" text="tau _ H = (vector@(s _ i, a _ i, s _ (i + 1))) _ (0 colon H - 1)" xml:id="S3.SS1.p1.m7">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">H</XMTok>
                </XMApp>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="vector"/>
                      <XMRef idref="S3.SS1.p1.m7.1"/>
                      <XMRef idref="S3.SS1.p1.m7.2"/>
                      <XMRef idref="S3.SS1.p1.m7.3"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S3.SS1.p1.m7.1">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" role="UNKNOWN">s</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      </XMApp>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMApp xml:id="S3.SS1.p1.m7.2">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" role="UNKNOWN">a</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      </XMApp>
                      <XMTok role="PUNCT">,</XMTok>
                      <XMApp xml:id="S3.SS1.p1.m7.3">
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" role="UNKNOWN">s</XMTok>
                        <XMApp>
                          <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                          <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                  <XMApp>
                    <XMTok fontsize="70%" name="colon" role="METARELOP">:</XMTok>
                    <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">H</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> consisting of <Math mode="inline" tex="H" text="H" xml:id="S3.SS1.p1.m8">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">H</XMTok>
            </XMath>
          </Math> transitions from one state <Math mode="inline" tex="s_{i}" text="s _ i" xml:id="S3.SS1.p1.m9">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
              </XMApp>
            </XMath>
          </Math> to another <Math mode="inline" tex="s_{i+1}" text="s _ (i + 1)" xml:id="S3.SS1.p1.m10">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                <XMApp>
                  <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> under the condition of action <Math mode="inline" tex="a_{i}" text="a _ i" xml:id="S3.SS1.p1.m11">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">a</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
              </XMApp>
            </XMath>
          </Math> execution. In this case, the action is sampled from the policy, and the next state is sampled from the transition dynamics.</p>
      </para>
      <para xml:id="S3.SS1.p2">
        <p>To fill the memory, a policy interacts with the environment (to store examples of transitions from the true dynamics), the model ("dreaming," when the learned model determines the transitions), or mixed (some trajectories are determined by the model and the other by the environment):</p>
      </para>
      <para xml:id="S3.SS1.p3">
        <equation xml:id="S3.E2">
          <tags>
            <tag>(2)</tag>
            <tag role="autoref">Equation 2</tag>
            <tag role="refnum">2</tag>
          </tags>
          <Math mode="display" tex="\begin{gathered}\displaystyle\mathcal{D}_{g}=\{\tau_{H}=(s_{i},a_{i},s_{i+1})_%&#10;{0:H-1}|a_{i}\sim\pi_{g}(s_{i}),s_{i+1}\sim[\mathcal{M},T](s_{i},a_{i})\},\\&#10;\displaystyle\mathcal{D}_{\epsilon}=\{\tau_{H}=(s_{i},a_{i},s_{i+1})_{0:H-1}|a%&#10;_{i}\sim\pi_{\epsilon}(s_{i}),s_{i+1}\sim[\mathcal{M},T](s_{i},a_{i})\}.\end{gathered}" text="formulae@(D _ g = conditional-set@(tau _ H = (vector@(s _ i, a _ i, s _ (i + 1))) _ (0 colon H - 1), formulae@(a _ i similar-to pi _ g * s _ i, s _ (i + 1) similar-to closed-interval@(M, T) * open-interval@(s _ i, a _ i))), D _ epsilon = conditional-set@(tau _ H = (vector@(s _ i, a _ i, s _ (i + 1))) _ (0 colon H - 1), formulae@(a _ i similar-to pi _ epsilon * s _ i, s _ (i + 1) similar-to closed-interval@(M, T) * open-interval@(s _ i, a _ i))))" xml:id="S3.E2.m1">
            <XMath>
              <XMDual>
                <XMDual>
                  <XMRef idref="S3.E2.m1.93"/>
                  <XMWrap>
                    <XMDual xml:id="S3.E2.m1.93">
                      <XMApp>
                        <XMTok meaning="formulae"/>
                        <XMRef idref="S3.E2.m1.93.1"/>
                        <XMRef idref="S3.E2.m1.93.2"/>
                      </XMApp>
                      <XMWrap>
                        <XMApp xml:id="S3.E2.m1.93.1">
                          <XMRef idref="S3.E2.m1.3"/>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                            <XMRef idref="S3.E2.m1.1"/>
                            <XMRef idref="S3.E2.m1.2.1"/>
                          </XMApp>
                          <XMDual>
                            <XMApp>
                              <XMTok meaning="conditional-set"/>
                              <XMRef idref="S3.E2.m1.93.1.1"/>
                              <XMRef idref="S3.E2.m1.93.1.2"/>
                            </XMApp>
                            <XMWrap>
                              <XMRef idref="S3.E2.m1.4"/>
                              <XMApp xml:id="S3.E2.m1.93.1.1">
                                <XMRef idref="S3.E2.m1.7"/>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMRef idref="S3.E2.m1.5"/>
                                  <XMRef idref="S3.E2.m1.6.1"/>
                                </XMApp>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMDual>
                                    <XMApp>
                                      <XMTok meaning="vector"/>
                                      <XMRef idref="S3.E2.m1.93.1.1.1"/>
                                      <XMRef idref="S3.E2.m1.93.1.1.2"/>
                                      <XMRef idref="S3.E2.m1.93.1.1.3"/>
                                    </XMApp>
                                    <XMWrap>
                                      <XMRef idref="S3.E2.m1.8"/>
                                      <XMApp xml:id="S3.E2.m1.93.1.1.1">
                                        <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                        <XMRef idref="S3.E2.m1.9"/>
                                        <XMRef idref="S3.E2.m1.10.1"/>
                                      </XMApp>
                                      <XMRef idref="S3.E2.m1.11"/>
                                      <XMApp xml:id="S3.E2.m1.93.1.1.2">
                                        <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                        <XMRef idref="S3.E2.m1.12"/>
                                        <XMRef idref="S3.E2.m1.13.1"/>
                                      </XMApp>
                                      <XMRef idref="S3.E2.m1.14"/>
                                      <XMApp xml:id="S3.E2.m1.93.1.1.3">
                                        <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                        <XMRef idref="S3.E2.m1.15"/>
                                        <XMRef idref="S3.E2.m1.16.1"/>
                                      </XMApp>
                                      <XMRef idref="S3.E2.m1.17"/>
                                    </XMWrap>
                                  </XMDual>
                                  <XMRef idref="S3.E2.m1.18.1"/>
                                </XMApp>
                              </XMApp>
                              <XMRef idref="S3.E2.m1.19"/>
                              <XMDual xml:id="S3.E2.m1.93.1.2">
                                <XMApp>
                                  <XMTok meaning="formulae"/>
                                  <XMRef idref="S3.E2.m1.93.1.2.1"/>
                                  <XMRef idref="S3.E2.m1.93.1.2.2"/>
                                </XMApp>
                                <XMWrap>
                                  <XMApp xml:id="S3.E2.m1.93.1.2.1">
                                    <XMRef idref="S3.E2.m1.22"/>
                                    <XMApp>
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMRef idref="S3.E2.m1.20"/>
                                      <XMRef idref="S3.E2.m1.21.1"/>
                                    </XMApp>
                                    <XMApp>
                                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                      <XMApp>
                                        <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                        <XMRef idref="S3.E2.m1.23"/>
                                        <XMRef idref="S3.E2.m1.24.1"/>
                                      </XMApp>
                                      <XMDual>
                                        <XMRef idref="S3.E2.m1.93.1.2.1.1"/>
                                        <XMWrap>
                                          <XMRef idref="S3.E2.m1.25"/>
                                          <XMApp xml:id="S3.E2.m1.93.1.2.1.1">
                                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                            <XMRef idref="S3.E2.m1.26"/>
                                            <XMRef idref="S3.E2.m1.27.1"/>
                                          </XMApp>
                                          <XMRef idref="S3.E2.m1.28"/>
                                        </XMWrap>
                                      </XMDual>
                                    </XMApp>
                                  </XMApp>
                                  <XMRef idref="S3.E2.m1.29"/>
                                  <XMApp xml:id="S3.E2.m1.93.1.2.2">
                                    <XMRef idref="S3.E2.m1.32"/>
                                    <XMApp>
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMRef idref="S3.E2.m1.30"/>
                                      <XMRef idref="S3.E2.m1.31.1"/>
                                    </XMApp>
                                    <XMApp>
                                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                      <XMApp>
                                        <XMTok meaning="closed-interval"/>
                                        <XMRef idref="S3.E2.m1.34"/>
                                        <XMRef idref="S3.E2.m1.36"/>
                                      </XMApp>
                                      <XMDual>
                                        <XMApp>
                                          <XMTok meaning="open-interval"/>
                                          <XMRef idref="S3.E2.m1.93.1.2.2.1"/>
                                          <XMRef idref="S3.E2.m1.93.1.2.2.2"/>
                                        </XMApp>
                                        <XMWrap>
                                          <XMRef idref="S3.E2.m1.38"/>
                                          <XMApp xml:id="S3.E2.m1.93.1.2.2.1">
                                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                            <XMRef idref="S3.E2.m1.39"/>
                                            <XMRef idref="S3.E2.m1.40.1"/>
                                          </XMApp>
                                          <XMRef idref="S3.E2.m1.41"/>
                                          <XMApp xml:id="S3.E2.m1.93.1.2.2.2">
                                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                            <XMRef idref="S3.E2.m1.42"/>
                                            <XMRef idref="S3.E2.m1.43.1"/>
                                          </XMApp>
                                          <XMRef idref="S3.E2.m1.44"/>
                                        </XMWrap>
                                      </XMDual>
                                    </XMApp>
                                  </XMApp>
                                </XMWrap>
                              </XMDual>
                              <XMRef idref="S3.E2.m1.45"/>
                            </XMWrap>
                          </XMDual>
                        </XMApp>
                        <XMTok role="PUNCT"/>
                        <XMApp xml:id="S3.E2.m1.93.2">
                          <XMRef idref="S3.E2.m1.49"/>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                            <XMRef idref="S3.E2.m1.47"/>
                            <XMRef idref="S3.E2.m1.48.1"/>
                          </XMApp>
                          <XMDual>
                            <XMApp>
                              <XMTok meaning="conditional-set"/>
                              <XMRef idref="S3.E2.m1.93.2.1"/>
                              <XMRef idref="S3.E2.m1.93.2.2"/>
                            </XMApp>
                            <XMWrap>
                              <XMRef idref="S3.E2.m1.50"/>
                              <XMApp xml:id="S3.E2.m1.93.2.1">
                                <XMRef idref="S3.E2.m1.53"/>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMRef idref="S3.E2.m1.51"/>
                                  <XMRef idref="S3.E2.m1.52.1"/>
                                </XMApp>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMDual>
                                    <XMApp>
                                      <XMTok meaning="vector"/>
                                      <XMRef idref="S3.E2.m1.93.2.1.1"/>
                                      <XMRef idref="S3.E2.m1.93.2.1.2"/>
                                      <XMRef idref="S3.E2.m1.93.2.1.3"/>
                                    </XMApp>
                                    <XMWrap>
                                      <XMRef idref="S3.E2.m1.54"/>
                                      <XMApp xml:id="S3.E2.m1.93.2.1.1">
                                        <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                        <XMRef idref="S3.E2.m1.55"/>
                                        <XMRef idref="S3.E2.m1.56.1"/>
                                      </XMApp>
                                      <XMRef idref="S3.E2.m1.57"/>
                                      <XMApp xml:id="S3.E2.m1.93.2.1.2">
                                        <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                        <XMRef idref="S3.E2.m1.58"/>
                                        <XMRef idref="S3.E2.m1.59.1"/>
                                      </XMApp>
                                      <XMRef idref="S3.E2.m1.60"/>
                                      <XMApp xml:id="S3.E2.m1.93.2.1.3">
                                        <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                        <XMRef idref="S3.E2.m1.61"/>
                                        <XMRef idref="S3.E2.m1.62.1"/>
                                      </XMApp>
                                      <XMRef idref="S3.E2.m1.63"/>
                                    </XMWrap>
                                  </XMDual>
                                  <XMRef idref="S3.E2.m1.64.1"/>
                                </XMApp>
                              </XMApp>
                              <XMRef idref="S3.E2.m1.65"/>
                              <XMDual xml:id="S3.E2.m1.93.2.2">
                                <XMApp>
                                  <XMTok meaning="formulae"/>
                                  <XMRef idref="S3.E2.m1.93.2.2.1"/>
                                  <XMRef idref="S3.E2.m1.93.2.2.2"/>
                                </XMApp>
                                <XMWrap>
                                  <XMApp xml:id="S3.E2.m1.93.2.2.1">
                                    <XMRef idref="S3.E2.m1.68"/>
                                    <XMApp>
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMRef idref="S3.E2.m1.66"/>
                                      <XMRef idref="S3.E2.m1.67.1"/>
                                    </XMApp>
                                    <XMApp>
                                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                      <XMApp>
                                        <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                        <XMRef idref="S3.E2.m1.69"/>
                                        <XMRef idref="S3.E2.m1.70.1"/>
                                      </XMApp>
                                      <XMDual>
                                        <XMRef idref="S3.E2.m1.93.2.2.1.1"/>
                                        <XMWrap>
                                          <XMRef idref="S3.E2.m1.71"/>
                                          <XMApp xml:id="S3.E2.m1.93.2.2.1.1">
                                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                            <XMRef idref="S3.E2.m1.72"/>
                                            <XMRef idref="S3.E2.m1.73.1"/>
                                          </XMApp>
                                          <XMRef idref="S3.E2.m1.74"/>
                                        </XMWrap>
                                      </XMDual>
                                    </XMApp>
                                  </XMApp>
                                  <XMRef idref="S3.E2.m1.75"/>
                                  <XMApp xml:id="S3.E2.m1.93.2.2.2">
                                    <XMRef idref="S3.E2.m1.78"/>
                                    <XMApp>
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMRef idref="S3.E2.m1.76"/>
                                      <XMRef idref="S3.E2.m1.77.1"/>
                                    </XMApp>
                                    <XMApp>
                                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                      <XMApp>
                                        <XMTok meaning="closed-interval"/>
                                        <XMRef idref="S3.E2.m1.80"/>
                                        <XMRef idref="S3.E2.m1.82"/>
                                      </XMApp>
                                      <XMDual>
                                        <XMApp>
                                          <XMTok meaning="open-interval"/>
                                          <XMRef idref="S3.E2.m1.93.2.2.2.1"/>
                                          <XMRef idref="S3.E2.m1.93.2.2.2.2"/>
                                        </XMApp>
                                        <XMWrap>
                                          <XMRef idref="S3.E2.m1.84"/>
                                          <XMApp xml:id="S3.E2.m1.93.2.2.2.1">
                                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                            <XMRef idref="S3.E2.m1.85"/>
                                            <XMRef idref="S3.E2.m1.86.1"/>
                                          </XMApp>
                                          <XMRef idref="S3.E2.m1.87"/>
                                          <XMApp xml:id="S3.E2.m1.93.2.2.2.2">
                                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                            <XMRef idref="S3.E2.m1.88"/>
                                            <XMRef idref="S3.E2.m1.89.1"/>
                                          </XMApp>
                                          <XMRef idref="S3.E2.m1.90"/>
                                        </XMWrap>
                                      </XMDual>
                                    </XMApp>
                                  </XMApp>
                                </XMWrap>
                              </XMDual>
                              <XMRef idref="S3.E2.m1.91"/>
                            </XMWrap>
                          </XMDual>
                        </XMApp>
                      </XMWrap>
                    </XMDual>
                    <XMTok role="PERIOD"/>
                  </XMWrap>
                </XMDual>
                <XMArray name="gathered" vattach="middle">
                  <XMRow>
                    <XMCell align="center">
                      <XMWrap>
                        <XMApp xml:id="S3.E2.m1.94">
                          <XMTok meaning="equals" role="RELOP" xml:id="S3.E2.m1.3">=</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                            <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.E2.m1.1">D</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.2.1">g</XMTok>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.4">{</XMTok>
                            <XMApp xml:id="S3.E2.m1.94.1">
                              <XMTok meaning="equals" role="RELOP" xml:id="S3.E2.m1.7">=</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMTok font="italic" name="tau" role="UNKNOWN" xml:id="S3.E2.m1.5">τ</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.6.1">H</XMTok>
                              </XMApp>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMWrap>
                                  <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.8">(</XMTok>
                                  <XMApp xml:id="S3.E2.m1.94.1.1">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.9">s</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.10.1">i</XMTok>
                                  </XMApp>
                                  <XMTok role="PUNCT" xml:id="S3.E2.m1.11">,</XMTok>
                                  <XMApp xml:id="S3.E2.m1.94.1.2">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.12">a</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.13.1">i</XMTok>
                                  </XMApp>
                                  <XMTok role="PUNCT" xml:id="S3.E2.m1.14">,</XMTok>
                                  <XMApp xml:id="S3.E2.m1.94.1.3">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.15">s</XMTok>
                                    <XMApp xml:id="S3.E2.m1.16.1">
                                      <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                                    </XMApp>
                                  </XMApp>
                                  <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.17">)</XMTok>
                                </XMWrap>
                                <XMApp xml:id="S3.E2.m1.18.1">
                                  <XMTok fontsize="70%" name="colon" role="METARELOP">:</XMTok>
                                  <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                                  <XMApp>
                                    <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">H</XMTok>
                                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                                  </XMApp>
                                </XMApp>
                              </XMApp>
                            </XMApp>
                            <XMTok role="VERTBAR" stretchy="false" xml:id="S3.E2.m1.19">|</XMTok>
                            <XMWrap xml:id="S3.E2.m1.94.2">
                              <XMApp xml:id="S3.E2.m1.94.2.1">
                                <XMTok meaning="similar-to" name="sim" role="RELOP" xml:id="S3.E2.m1.22">∼</XMTok>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.20">a</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.21.1">i</XMTok>
                                </XMApp>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="italic" name="pi" role="UNKNOWN" xml:id="S3.E2.m1.23">π</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.24.1">g</XMTok>
                                  </XMApp>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.25">(</XMTok>
                                    <XMApp xml:id="S3.E2.m1.94.2.1.1">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.26">s</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.27.1">i</XMTok>
                                    </XMApp>
                                    <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.28">)</XMTok>
                                  </XMWrap>
                                </XMApp>
                              </XMApp>
                              <XMTok role="PUNCT" xml:id="S3.E2.m1.29">,</XMTok>
                              <XMApp xml:id="S3.E2.m1.94.2.2">
                                <XMTok meaning="similar-to" name="sim" role="RELOP" xml:id="S3.E2.m1.32">∼</XMTok>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.30">s</XMTok>
                                  <XMApp xml:id="S3.E2.m1.31.1">
                                    <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                                  </XMApp>
                                </XMApp>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.33">[</XMTok>
                                    <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.E2.m1.34">M</XMTok>
                                    <XMTok role="PUNCT" xml:id="S3.E2.m1.35">,</XMTok>
                                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.36">T</XMTok>
                                    <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.37">]</XMTok>
                                  </XMWrap>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.38">(</XMTok>
                                    <XMApp xml:id="S3.E2.m1.94.2.2.1">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.39">s</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.40.1">i</XMTok>
                                    </XMApp>
                                    <XMTok role="PUNCT" xml:id="S3.E2.m1.41">,</XMTok>
                                    <XMApp xml:id="S3.E2.m1.94.2.2.2">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.42">a</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.43.1">i</XMTok>
                                    </XMApp>
                                    <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.44">)</XMTok>
                                  </XMWrap>
                                </XMApp>
                              </XMApp>
                            </XMWrap>
                            <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.45">}</XMTok>
                          </XMWrap>
                        </XMApp>
                        <XMTok role="PUNCT" xml:id="S3.E2.m1.46">,</XMTok>
                      </XMWrap>
                    </XMCell>
                  </XMRow>
                  <XMRow>
                    <XMCell align="center">
                      <XMWrap>
                        <XMApp xml:id="S3.E2.m1.95">
                          <XMTok meaning="equals" role="RELOP" xml:id="S3.E2.m1.49">=</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                            <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.E2.m1.47">D</XMTok>
                            <XMTok font="italic" fontsize="70%" name="epsilon" role="UNKNOWN" xml:id="S3.E2.m1.48.1">ϵ</XMTok>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.50">{</XMTok>
                            <XMApp xml:id="S3.E2.m1.95.1">
                              <XMTok meaning="equals" role="RELOP" xml:id="S3.E2.m1.53">=</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMTok font="italic" name="tau" role="UNKNOWN" xml:id="S3.E2.m1.51">τ</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.52.1">H</XMTok>
                              </XMApp>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMWrap>
                                  <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.54">(</XMTok>
                                  <XMApp xml:id="S3.E2.m1.95.1.1">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.55">s</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.56.1">i</XMTok>
                                  </XMApp>
                                  <XMTok role="PUNCT" xml:id="S3.E2.m1.57">,</XMTok>
                                  <XMApp xml:id="S3.E2.m1.95.1.2">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.58">a</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.59.1">i</XMTok>
                                  </XMApp>
                                  <XMTok role="PUNCT" xml:id="S3.E2.m1.60">,</XMTok>
                                  <XMApp xml:id="S3.E2.m1.95.1.3">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.61">s</XMTok>
                                    <XMApp xml:id="S3.E2.m1.62.1">
                                      <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                                    </XMApp>
                                  </XMApp>
                                  <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.63">)</XMTok>
                                </XMWrap>
                                <XMApp xml:id="S3.E2.m1.64.1">
                                  <XMTok fontsize="70%" name="colon" role="METARELOP">:</XMTok>
                                  <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                                  <XMApp>
                                    <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">H</XMTok>
                                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                                  </XMApp>
                                </XMApp>
                              </XMApp>
                            </XMApp>
                            <XMTok role="VERTBAR" stretchy="false" xml:id="S3.E2.m1.65">|</XMTok>
                            <XMWrap xml:id="S3.E2.m1.95.2">
                              <XMApp xml:id="S3.E2.m1.95.2.1">
                                <XMTok meaning="similar-to" name="sim" role="RELOP" xml:id="S3.E2.m1.68">∼</XMTok>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.66">a</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.67.1">i</XMTok>
                                </XMApp>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMTok font="italic" name="pi" role="UNKNOWN" xml:id="S3.E2.m1.69">π</XMTok>
                                    <XMTok font="italic" fontsize="70%" name="epsilon" role="UNKNOWN" xml:id="S3.E2.m1.70.1">ϵ</XMTok>
                                  </XMApp>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.71">(</XMTok>
                                    <XMApp xml:id="S3.E2.m1.95.2.1.1">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.72">s</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.73.1">i</XMTok>
                                    </XMApp>
                                    <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.74">)</XMTok>
                                  </XMWrap>
                                </XMApp>
                              </XMApp>
                              <XMTok role="PUNCT" xml:id="S3.E2.m1.75">,</XMTok>
                              <XMApp xml:id="S3.E2.m1.95.2.2">
                                <XMTok meaning="similar-to" name="sim" role="RELOP" xml:id="S3.E2.m1.78">∼</XMTok>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.76">s</XMTok>
                                  <XMApp xml:id="S3.E2.m1.77.1">
                                    <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                                  </XMApp>
                                </XMApp>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.79">[</XMTok>
                                    <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.E2.m1.80">M</XMTok>
                                    <XMTok role="PUNCT" xml:id="S3.E2.m1.81">,</XMTok>
                                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.82">T</XMTok>
                                    <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.83">]</XMTok>
                                  </XMWrap>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false" xml:id="S3.E2.m1.84">(</XMTok>
                                    <XMApp xml:id="S3.E2.m1.95.2.2.1">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.85">s</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.86.1">i</XMTok>
                                    </XMApp>
                                    <XMTok role="PUNCT" xml:id="S3.E2.m1.87">,</XMTok>
                                    <XMApp xml:id="S3.E2.m1.95.2.2.2">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMTok font="italic" role="UNKNOWN" xml:id="S3.E2.m1.88">a</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E2.m1.89.1">i</XMTok>
                                    </XMApp>
                                    <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.90">)</XMTok>
                                  </XMWrap>
                                </XMApp>
                              </XMApp>
                            </XMWrap>
                            <XMTok role="CLOSE" stretchy="false" xml:id="S3.E2.m1.91">}</XMTok>
                          </XMWrap>
                        </XMApp>
                        <XMTok role="PERIOD" xml:id="S3.E2.m1.92">.</XMTok>
                      </XMWrap>
                    </XMCell>
                  </XMRow>
                </XMArray>
              </XMDual>
            </XMath>
          </Math>
        </equation>
      </para>
      <para xml:id="S3.SS1.p4">
        <p>To learn the model of the world, only trajectories generated by the true dynamics of the environment can be used since the model’s purpose is to learn exactly it. However, the agent can perform actions by the exploration policy and by task one or even by mixing the trajectories of both:</p>
      </para>
      <para class="ltx_noindent" xml:id="S3.SS1.p5">
        <equation labels="LABEL:eq:model_buffer" xml:id="S3.E3">
          <tags>
            <tag>(3)</tag>
            <tag role="autoref">Equation 3</tag>
            <tag role="refnum">3</tag>
          </tags>
          <Math mode="display" tex="\mathcal{D}_{\mathcal{M}}=\{\tau_{H}=(s_{i},a_{i},s_{i+1})_{0:H-1}|a_{i}\sim[%&#10;\pi_{g},\pi_{\epsilon}](s_{i}),s_{i+1}\sim T(s_{i},a_{i})\}." text="D _ M = conditional-set@(tau _ H = (vector@(s _ i, a _ i, s _ (i + 1))) _ (0 colon H - 1), formulae@(a _ i similar-to closed-interval@(pi _ g, pi _ epsilon) * s _ i, s _ (i + 1) similar-to T * open-interval@(s _ i, a _ i)))" xml:id="S3.E3.m1">
            <XMath>
              <XMDual>
                <XMRef idref="S3.E3.m1.1"/>
                <XMWrap>
                  <XMApp xml:id="S3.E3.m1.1">
                    <XMTok meaning="equals" role="RELOP">=</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                      <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">M</XMTok>
                    </XMApp>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="conditional-set"/>
                        <XMRef idref="S3.E3.m1.1.1"/>
                        <XMRef idref="S3.E3.m1.1.2"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">{</XMTok>
                        <XMApp xml:id="S3.E3.m1.1.1">
                          <XMTok meaning="equals" role="RELOP">=</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">H</XMTok>
                          </XMApp>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMDual>
                              <XMApp>
                                <XMTok meaning="vector"/>
                                <XMRef idref="S3.E3.m1.1.1.1"/>
                                <XMRef idref="S3.E3.m1.1.1.2"/>
                                <XMRef idref="S3.E3.m1.1.1.3"/>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">(</XMTok>
                                <XMApp xml:id="S3.E3.m1.1.1.1">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                  <XMTok font="italic" role="UNKNOWN">s</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                </XMApp>
                                <XMTok role="PUNCT">,</XMTok>
                                <XMApp xml:id="S3.E3.m1.1.1.2">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                  <XMTok font="italic" role="UNKNOWN">a</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                </XMApp>
                                <XMTok role="PUNCT">,</XMTok>
                                <XMApp xml:id="S3.E3.m1.1.1.3">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                  <XMTok font="italic" role="UNKNOWN">s</XMTok>
                                  <XMApp>
                                    <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                                  </XMApp>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false">)</XMTok>
                              </XMWrap>
                            </XMDual>
                            <XMApp>
                              <XMTok fontsize="70%" name="colon" role="METARELOP">:</XMTok>
                              <XMTok fontsize="70%" meaning="0" role="NUMBER">0</XMTok>
                              <XMApp>
                                <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">H</XMTok>
                                <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                              </XMApp>
                            </XMApp>
                          </XMApp>
                        </XMApp>
                        <XMTok role="VERTBAR" stretchy="false">|</XMTok>
                        <XMDual xml:id="S3.E3.m1.1.2">
                          <XMApp>
                            <XMTok meaning="formulae"/>
                            <XMRef idref="S3.E3.m1.1.2.1"/>
                            <XMRef idref="S3.E3.m1.1.2.2"/>
                          </XMApp>
                          <XMWrap>
                            <XMApp xml:id="S3.E3.m1.1.2.1">
                              <XMTok meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                <XMTok font="italic" role="UNKNOWN">a</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                              </XMApp>
                              <XMApp>
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMDual>
                                  <XMApp>
                                    <XMTok meaning="closed-interval"/>
                                    <XMRef idref="S3.E3.m1.1.2.1.1"/>
                                    <XMRef idref="S3.E3.m1.1.2.1.2"/>
                                  </XMApp>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false">[</XMTok>
                                    <XMApp xml:id="S3.E3.m1.1.2.1.1">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                      <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">g</XMTok>
                                    </XMApp>
                                    <XMTok role="PUNCT">,</XMTok>
                                    <XMApp xml:id="S3.E3.m1.1.2.1.2">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                      <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                                      <XMTok font="italic" fontsize="70%" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                                    </XMApp>
                                    <XMTok role="CLOSE" stretchy="false">]</XMTok>
                                  </XMWrap>
                                </XMDual>
                                <XMDual>
                                  <XMRef idref="S3.E3.m1.1.2.1.3"/>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                                    <XMApp xml:id="S3.E3.m1.1.2.1.3">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                      <XMTok font="italic" role="UNKNOWN">s</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                    </XMApp>
                                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                  </XMWrap>
                                </XMDual>
                              </XMApp>
                            </XMApp>
                            <XMTok role="PUNCT">,</XMTok>
                            <XMApp xml:id="S3.E3.m1.1.2.2">
                              <XMTok meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                                <XMApp>
                                  <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                                </XMApp>
                              </XMApp>
                              <XMApp>
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMTok font="italic" role="UNKNOWN">T</XMTok>
                                <XMDual>
                                  <XMApp>
                                    <XMTok meaning="open-interval"/>
                                    <XMRef idref="S3.E3.m1.1.2.2.1"/>
                                    <XMRef idref="S3.E3.m1.1.2.2.2"/>
                                  </XMApp>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                                    <XMApp xml:id="S3.E3.m1.1.2.2.1">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                      <XMTok font="italic" role="UNKNOWN">s</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                    </XMApp>
                                    <XMTok role="PUNCT">,</XMTok>
                                    <XMApp xml:id="S3.E3.m1.1.2.2.2">
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                      <XMTok font="italic" role="UNKNOWN">a</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                    </XMApp>
                                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                  </XMWrap>
                                </XMDual>
                              </XMApp>
                            </XMApp>
                          </XMWrap>
                        </XMDual>
                        <XMTok role="CLOSE" stretchy="false">}</XMTok>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                  <XMTok role="PERIOD">.</XMTok>
<!--  %**** main_eng.tex Line 225 **** -->                </XMWrap>
              </XMDual>
            </XMath>
          </Math>
        </equation>
      </para>
      <para xml:id="S3.SS1.p6">
        <p>But it is worth noting that the main objective of the task policy is to solve the MDP for the initial task. The exploration policy, at the same time, helps to obtain the most significant amount of information from the world. It has a greater variety of transitions in trajectories than task one, which improves model learning.</p>
      </para>
      <figure inlist="lof" labels="LABEL:fig:buffers" placement="h" xml:id="S3.F3">
        <tags>
          <tag>Figure 3</tag>
          <tag role="autoref">Figure 3</tag>
          <tag role="refnum">3</tag>
          <tag role="typerefnum">Figure 3</tag>
        </tags>
        <graphics candidates="buffers_eng.png" class="ltx_centering" graphic="buffers_eng.png" options="width=433.62pt" xml:id="S3.F3.g1"/>
        <toccaption class="ltx_centering"><tag close=" ">3</tag>Data collection for training. Task and exploration policies collect data for learning from the world model and environment into the memory <Math mode="inline" tex="\mathcal{D}_{g},\mathcal{D}_{\epsilon},\mathcal{D}_{\mathcal{M}}" text="list@(D _ g, D _ epsilon, D _ M)" xml:id="S3.F3.m1">
            <XMath>
              <XMDual>
                <XMApp>
                  <XMTok meaning="list"/>
                  <XMRef idref="S3.F3.m1.1"/>
                  <XMRef idref="S3.F3.m1.2"/>
                  <XMRef idref="S3.F3.m1.3"/>
                </XMApp>
                <XMWrap>
                  <XMApp xml:id="S3.F3.m1.1">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">g</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S3.F3.m1.2">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                    <XMTok font="italic" fontsize="70%" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S3.F3.m1.3">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                    <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">M</XMTok>
                  </XMApp>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>.</toccaption>
        <caption class="ltx_centering"><tag close=": ">Figure 3</tag>Data collection for training. Task and exploration policies collect data for learning from the world model and environment into the memory <Math mode="inline" tex="\mathcal{D}_{g},\mathcal{D}_{\epsilon},\mathcal{D}_{\mathcal{M}}" text="list@(D _ g, D _ epsilon, D _ M)" xml:id="S3.F3.m2">
            <XMath>
              <XMDual>
                <XMApp>
                  <XMTok meaning="list"/>
                  <XMRef idref="S3.F3.m2.1"/>
                  <XMRef idref="S3.F3.m2.2"/>
                  <XMRef idref="S3.F3.m2.3"/>
                </XMApp>
                <XMWrap>
                  <XMApp xml:id="S3.F3.m2.1">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">g</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S3.F3.m2.2">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                    <XMTok font="italic" fontsize="70%" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S3.F3.m2.3">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                    <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">M</XMTok>
                  </XMApp>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>.</caption>
      </figure>
    </subsection>
    <subsection inlist="toc" labels="LABEL:subsec:losses" xml:id="S3.SS2">
      <tags>
        <tag>3.2</tag>
        <tag role="autoref">subsection 3.2</tag>
        <tag role="refnum">3.2</tag>
        <tag role="typerefnum">§3.2</tag>
      </tags>
      <title><tag close=" ">3.2</tag>Loss functions</title>
      <para xml:id="S3.SS2.p1">
        <p>In the formal statement of the learning problem, an optimization problem is defined to find the minimum of a certain function often expressed as the loss function <Math mode="inline" tex="\mathcal{L}" text="L" xml:id="S3.SS2.p1.m1">
            <XMath>
              <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
            </XMath>
          </Math>. Thus, each of the main agent components has its own functional <Math mode="inline" tex="\mathcal{L}_{g},\mathcal{L}_{\epsilon},\mathcal{L}_{\mathcal{M}}" text="list@(L _ g, L _ epsilon, L _ M)" xml:id="S3.SS2.p1.m2">
            <XMath>
              <XMDual>
                <XMApp>
                  <XMTok meaning="list"/>
                  <XMRef idref="S3.SS2.p1.m2.1"/>
                  <XMRef idref="S3.SS2.p1.m2.2"/>
                  <XMRef idref="S3.SS2.p1.m2.3"/>
                </XMApp>
                <XMWrap>
                  <XMApp xml:id="S3.SS2.p1.m2.1">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">g</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S3.SS2.p1.m2.2">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                    <XMTok font="italic" fontsize="70%" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S3.SS2.p1.m2.3">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                    <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">M</XMTok>
                  </XMApp>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>. However, it is possible to simultaneously optimize the entire function of the system that combines corresponding components. Arguments of the loss functions for the exploration policy, the task policy, and the world model are displayed on the agent diagram (see Fig. <ref labelref="LABEL:fig:fullagent"/>). Let’s consider each of the cases in detail.</p>
      </para>
      <paragraph inlist="toc" xml:id="S3.SS2.SSS0.Px1">
        <title>The task policy</title>
        <para class="ltx_noindent" xml:id="S3.SS2.SSS0.Px1.p1">
          <p>naturally requires a pair of states and actions for learning and the reward signal. The reward can be determined entirely by extrinsic or complementary signals defined by intrinsic motivation. Such mixing of different motivations into one is a standard method of combining. However, it introduces a bias in the MDP problem defined for the task, based only on extrinsic rewards. The reward implicitly transfers information about the task to the policy, but the latter can explicitly depend on the goal. Thus, the loss function for the task policy can be represented:</p>
          <equation xml:id="S3.E4">
            <tags>
              <tag>(4)</tag>
              <tag role="autoref">Equation 4</tag>
              <tag role="refnum">4</tag>
            </tags>
            <Math mode="display" tex="\mathcal{L}_{g}=\mathbb{E}_{\tau_{H}\sim\mathcal{D}_{g}}l(\tau_{H},[R,R_{int}]%&#10;,g,\pi_{g})," text="L _ g = E _ (tau _ H similar-to D _ g) * l * vector@(tau _ H, closed-interval@(R, R _ (i * n * t)), g, pi _ g)" xml:id="S3.E4.m1">
              <XMath>
                <XMDual>
                  <XMRef idref="S3.E4.m1.3"/>
                  <XMWrap>
                    <XMApp xml:id="S3.E4.m1.3">
                      <XMTok meaning="equals" role="RELOP">=</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">g</XMTok>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="blackboard" role="UNKNOWN">E</XMTok>
                          <XMApp>
                            <XMTok fontsize="70%" meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                              <XMTok font="italic" fontsize="70%" name="tau" role="UNKNOWN">τ</XMTok>
                              <XMTok font="italic" fontsize="50%" role="UNKNOWN">H</XMTok>
                            </XMApp>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                              <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">D</XMTok>
                              <XMTok font="italic" fontsize="50%" role="UNKNOWN">g</XMTok>
                            </XMApp>
                          </XMApp>
                        </XMApp>
                        <XMTok font="italic" role="UNKNOWN">l</XMTok>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="vector"/>
                            <XMRef idref="S3.E4.m1.3.1"/>
                            <XMRef idref="S3.E4.m1.3.2"/>
                            <XMRef idref="S3.E4.m1.2"/>
                            <XMRef idref="S3.E4.m1.3.3"/>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">(</XMTok>
                            <XMApp xml:id="S3.E4.m1.3.1">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">H</XMTok>
                            </XMApp>
                            <XMTok role="PUNCT">,</XMTok>
                            <XMDual xml:id="S3.E4.m1.3.2">
                              <XMApp>
                                <XMTok meaning="closed-interval"/>
                                <XMRef idref="S3.E4.m1.1"/>
                                <XMRef idref="S3.E4.m1.3.2.1"/>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">[</XMTok>
                                <XMTok font="italic" role="UNKNOWN" xml:id="S3.E4.m1.1">R</XMTok>
                                <XMTok role="PUNCT">,</XMTok>
                                <XMApp xml:id="S3.E4.m1.3.2.1">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                                  <XMTok font="italic" role="UNKNOWN">R</XMTok>
                                  <XMApp>
                                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                  </XMApp>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false">]</XMTok>
                              </XMWrap>
                            </XMDual>
                            <XMTok role="PUNCT">,</XMTok>
                            <XMTok font="italic" role="UNKNOWN" xml:id="S3.E4.m1.2">g</XMTok>
                            <XMTok role="PUNCT">,</XMTok>
                            <XMApp xml:id="S3.E4.m1.3.3">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">g</XMTok>
                            </XMApp>
                            <XMTok role="CLOSE" stretchy="false">)</XMTok>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                  </XMWrap>
                </XMDual>
              </XMath>
            </Math>
          </equation>
          <p>where <Math mode="inline" tex="\mathcal{D}_{g}" text="D _ g" xml:id="S3.SS2.SSS0.Px1.p1.m1">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">g</XMTok>
                </XMApp>
              </XMath>
            </Math> — the memory (see Section <ref labelref="LABEL:subsec:buffers"/>), <Math mode="inline" tex="g" text="g" xml:id="S3.SS2.SSS0.Px1.p1.m2">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">g</XMTok>
              </XMath>
            </Math> — the goal, <Math mode="inline" tex="[R,R_{int}]" text="closed-interval@(R, R _ (i * n * t))" xml:id="S3.SS2.SSS0.Px1.p1.m3">
              <XMath>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="closed-interval"/>
                    <XMRef idref="S3.SS2.SSS0.Px1.p1.m3.1"/>
                    <XMRef idref="S3.SS2.SSS0.Px1.p1.m3.2"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">[</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="S3.SS2.SSS0.Px1.p1.m3.1">R</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMApp xml:id="S3.SS2.SSS0.Px1.p1.m3.2">
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">R</XMTok>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">]</XMTok>
                  </XMWrap>
                </XMDual>
              </XMath>
            </Math> is the combined reward signal, and <Math mode="inline" tex="l" text="l" xml:id="S3.SS2.SSS0.Px1.p1.m4">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">l</XMTok>
              </XMath>
            </Math> — one of the standard reinforcement learning loss functions such as discounted return (see eq. (<ref labelref="LABEL:eq:mdp_opt"/>)).</p>
        </para>
      </paragraph>
      <paragraph inlist="toc" xml:id="S3.SS2.SSS0.Px2">
        <title>The exploration policy</title>
        <para class="ltx_noindent" xml:id="S3.SS2.SSS0.Px2.p1">
          <p>solves a similar optimization problem, but it relies solely on intrinsic rewards, unlike task one. Such a policy is task-agnostic since intrinsic motivation does not seek to achieve the goal. Then the loss function is:
<!--  %****␣main_eng.tex␣Line␣250␣**** --></p>
          <equation labels="LABEL:eq:loss_pi_e" xml:id="S3.E5">
            <tags>
              <tag>(5)</tag>
              <tag role="autoref">Equation 5</tag>
              <tag role="refnum">5</tag>
            </tags>
            <Math mode="display" tex="\mathcal{L}_{\epsilon}=\mathbb{E}_{\tau_{H}\sim\mathcal{D}_{\epsilon}}l(\tau_{%&#10;H},R_{int},\pi_{\epsilon})," text="L _ epsilon = E _ (tau _ H similar-to D _ epsilon) * l * vector@(tau _ H, R _ (i * n * t), pi _ epsilon)" xml:id="S3.E5.m1">
              <XMath>
                <XMDual>
                  <XMRef idref="S3.E5.m1.1"/>
                  <XMWrap>
                    <XMApp xml:id="S3.E5.m1.1">
                      <XMTok meaning="equals" role="RELOP">=</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                        <XMTok font="italic" fontsize="70%" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="blackboard" role="UNKNOWN">E</XMTok>
                          <XMApp>
                            <XMTok fontsize="70%" meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                              <XMTok font="italic" fontsize="70%" name="tau" role="UNKNOWN">τ</XMTok>
                              <XMTok font="italic" fontsize="50%" role="UNKNOWN">H</XMTok>
                            </XMApp>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                              <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">D</XMTok>
                              <XMTok font="italic" fontsize="50%" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                            </XMApp>
                          </XMApp>
                        </XMApp>
                        <XMTok font="italic" role="UNKNOWN">l</XMTok>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="vector"/>
                            <XMRef idref="S3.E5.m1.1.1"/>
                            <XMRef idref="S3.E5.m1.1.2"/>
                            <XMRef idref="S3.E5.m1.1.3"/>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">(</XMTok>
                            <XMApp xml:id="S3.E5.m1.1.1">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="italic" name="tau" role="UNKNOWN">τ</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">H</XMTok>
                            </XMApp>
                            <XMTok role="PUNCT">,</XMTok>
                            <XMApp xml:id="S3.E5.m1.1.2">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="italic" role="UNKNOWN">R</XMTok>
                              <XMApp>
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                              </XMApp>
                            </XMApp>
                            <XMTok role="PUNCT">,</XMTok>
                            <XMApp xml:id="S3.E5.m1.1.3">
                              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                              <XMTok font="italic" fontsize="70%" name="epsilon" role="UNKNOWN">ϵ</XMTok>
                            </XMApp>
                            <XMTok role="CLOSE" stretchy="false">)</XMTok>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                  </XMWrap>
                </XMDual>
              </XMath>
            </Math>
          </equation>
          <p>where <Math mode="inline" tex="l" text="l" xml:id="S3.SS2.SSS0.Px2.p1.m1">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">l</XMTok>
              </XMath>
            </Math> is one of the standard functions for reinforcement learning objectives, such as task policy.</p>
        </para>
      </paragraph>
      <paragraph inlist="toc" xml:id="S3.SS2.SSS0.Px3">
        <title>The world model</title>
        <para class="ltx_noindent" xml:id="S3.SS2.SSS0.Px3.p1">
          <p>learning procedure is different from learning policies since it solves the problem of supervised learning rather than reinforcement. In the former, the true examples are already known. Policies in the environment collected them. It is only needed to learn how to reproduce them. The variety of the loss functions for the world model is excellent, but the loss function mainly calculates the difference between the predictions of the model and true examples from the training set:</p>
          <equation xml:id="S3.E6">
            <tags>
              <tag>(6)</tag>
              <tag role="autoref">Equation 6</tag>
              <tag role="refnum">6</tag>
            </tags>
            <Math mode="display" tex="\begin{gathered}\displaystyle\mathcal{L}_{\mathcal{M}}=\mathbb{E}_{\tau_{H}%&#10;\sim\mathcal{D}_{\mathcal{M}}}l_{M}(\tau_{H},\mathcal{M}),\\&#10;\displaystyle\mathcal{L}_{\mathcal{M}}=\mathbb{E}_{\tau_{H}\sim\mathcal{D}_{%&#10;\mathcal{M}}}l_{F}(\mathcal{M}(S^{n},A^{h}),\tilde{S}),\\&#10;\displaystyle\mathcal{L}_{\mathcal{M}}=\mathbb{E}_{\tau_{H}\sim\mathcal{D}_{%&#10;\mathcal{M}}}l_{I}(\mathcal{M}(S^{n},\tilde{S}),A^{h}),\end{gathered}" text="formulae@(L _ M = E _ (tau _ H similar-to D _ M) * l _ M * open-interval@(tau _ H, M), formulae@(L _ M = E _ (tau _ H similar-to D _ M) * l _ F * open-interval@(M * open-interval@(S ^ n, A ^ h), tilde@(S)), L _ M = E _ (tau _ H similar-to D _ M) * l _ I * open-interval@(M * open-interval@(S ^ n, tilde@(S)), A ^ h)))" xml:id="S3.E6.m1">
              <XMath>
                <XMDual>
                  <XMDual>
                    <XMRef idref="S3.E6.m1.55"/>
                    <XMWrap>
                      <XMDual xml:id="S3.E6.m1.55">
                        <XMApp>
                          <XMTok meaning="formulae"/>
                          <XMRef idref="S3.E6.m1.55.1"/>
                          <XMRef idref="S3.E6.m1.55.2"/>
                        </XMApp>
                        <XMWrap>
                          <XMApp xml:id="S3.E6.m1.55.1">
                            <XMRef idref="S3.E6.m1.3"/>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                              <XMRef idref="S3.E6.m1.1"/>
                              <XMRef idref="S3.E6.m1.2.1"/>
                            </XMApp>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMRef idref="S3.E6.m1.4"/>
                                <XMRef idref="S3.E6.m1.5.1"/>
                              </XMApp>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMRef idref="S3.E6.m1.6"/>
                                <XMRef idref="S3.E6.m1.7.1"/>
                              </XMApp>
                              <XMDual>
                                <XMApp>
                                  <XMTok meaning="open-interval"/>
                                  <XMRef idref="S3.E6.m1.55.1.1"/>
                                  <XMRef idref="S3.E6.m1.12"/>
                                </XMApp>
                                <XMWrap>
                                  <XMRef idref="S3.E6.m1.8"/>
                                  <XMApp xml:id="S3.E6.m1.55.1.1">
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMRef idref="S3.E6.m1.9"/>
                                    <XMRef idref="S3.E6.m1.10.1"/>
                                  </XMApp>
                                  <XMRef idref="S3.E6.m1.11"/>
                                  <XMRef idref="S3.E6.m1.12"/>
                                  <XMRef idref="S3.E6.m1.13"/>
                                </XMWrap>
                              </XMDual>
                            </XMApp>
                          </XMApp>
                          <XMTok role="PUNCT"/>
                          <XMDual xml:id="S3.E6.m1.55.2">
                            <XMApp>
                              <XMTok meaning="formulae"/>
                              <XMRef idref="S3.E6.m1.55.2.1"/>
                              <XMRef idref="S3.E6.m1.55.2.2"/>
                            </XMApp>
                            <XMWrap>
                              <XMApp xml:id="S3.E6.m1.55.2.1">
                                <XMRef idref="S3.E6.m1.17"/>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMRef idref="S3.E6.m1.15"/>
                                  <XMRef idref="S3.E6.m1.16.1"/>
                                </XMApp>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMRef idref="S3.E6.m1.18"/>
                                    <XMRef idref="S3.E6.m1.19.1"/>
                                  </XMApp>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMRef idref="S3.E6.m1.20"/>
                                    <XMRef idref="S3.E6.m1.21.1"/>
                                  </XMApp>
                                  <XMDual>
                                    <XMApp>
                                      <XMTok meaning="open-interval"/>
                                      <XMRef idref="S3.E6.m1.55.2.1.1"/>
                                      <XMRef idref="S3.E6.m1.32"/>
                                    </XMApp>
                                    <XMWrap>
                                      <XMRef idref="S3.E6.m1.22"/>
                                      <XMApp xml:id="S3.E6.m1.55.2.1.1">
                                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                        <XMRef idref="S3.E6.m1.23"/>
                                        <XMDual>
                                          <XMApp>
                                            <XMTok meaning="open-interval"/>
                                            <XMRef idref="S3.E6.m1.55.2.1.1.1"/>
                                            <XMRef idref="S3.E6.m1.55.2.1.1.2"/>
                                          </XMApp>
                                          <XMWrap>
                                            <XMRef idref="S3.E6.m1.24"/>
                                            <XMApp xml:id="S3.E6.m1.55.2.1.1.1">
                                              <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                              <XMRef idref="S3.E6.m1.25"/>
                                              <XMRef idref="S3.E6.m1.26.1"/>
                                            </XMApp>
                                            <XMRef idref="S3.E6.m1.27"/>
                                            <XMApp xml:id="S3.E6.m1.55.2.1.1.2">
                                              <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                              <XMRef idref="S3.E6.m1.28"/>
                                              <XMRef idref="S3.E6.m1.29.1"/>
                                            </XMApp>
                                            <XMRef idref="S3.E6.m1.30"/>
                                          </XMWrap>
                                        </XMDual>
                                      </XMApp>
                                      <XMRef idref="S3.E6.m1.31"/>
                                      <XMRef idref="S3.E6.m1.32"/>
                                      <XMRef idref="S3.E6.m1.33"/>
                                    </XMWrap>
                                  </XMDual>
                                </XMApp>
                              </XMApp>
                              <XMTok role="PUNCT"/>
                              <XMApp xml:id="S3.E6.m1.55.2.2">
                                <XMRef idref="S3.E6.m1.37"/>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMRef idref="S3.E6.m1.35"/>
                                  <XMRef idref="S3.E6.m1.36.1"/>
                                </XMApp>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMRef idref="S3.E6.m1.38"/>
                                    <XMRef idref="S3.E6.m1.39.1"/>
                                  </XMApp>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMRef idref="S3.E6.m1.40"/>
                                    <XMRef idref="S3.E6.m1.41.1"/>
                                  </XMApp>
                                  <XMDual>
                                    <XMApp>
                                      <XMTok meaning="open-interval"/>
                                      <XMRef idref="S3.E6.m1.55.2.2.1"/>
                                      <XMRef idref="S3.E6.m1.55.2.2.2"/>
                                    </XMApp>
                                    <XMWrap>
                                      <XMRef idref="S3.E6.m1.42"/>
                                      <XMApp xml:id="S3.E6.m1.55.2.2.1">
                                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                        <XMRef idref="S3.E6.m1.43"/>
                                        <XMDual>
                                          <XMApp>
                                            <XMTok meaning="open-interval"/>
                                            <XMRef idref="S3.E6.m1.55.2.2.1.1"/>
                                            <XMRef idref="S3.E6.m1.48"/>
                                          </XMApp>
                                          <XMWrap>
                                            <XMRef idref="S3.E6.m1.44"/>
                                            <XMApp xml:id="S3.E6.m1.55.2.2.1.1">
                                              <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                              <XMRef idref="S3.E6.m1.45"/>
                                              <XMRef idref="S3.E6.m1.46.1"/>
                                            </XMApp>
                                            <XMRef idref="S3.E6.m1.47"/>
                                            <XMRef idref="S3.E6.m1.48"/>
                                            <XMRef idref="S3.E6.m1.49"/>
                                          </XMWrap>
                                        </XMDual>
                                      </XMApp>
                                      <XMRef idref="S3.E6.m1.50"/>
                                      <XMApp xml:id="S3.E6.m1.55.2.2.2">
                                        <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                        <XMRef idref="S3.E6.m1.51"/>
                                        <XMRef idref="S3.E6.m1.52.1"/>
                                      </XMApp>
                                      <XMRef idref="S3.E6.m1.53"/>
                                    </XMWrap>
                                  </XMDual>
                                </XMApp>
                              </XMApp>
                            </XMWrap>
                          </XMDual>
                        </XMWrap>
                      </XMDual>
                      <XMTok role="PUNCT"/>
                    </XMWrap>
                  </XMDual>
                  <XMArray name="gathered" vattach="middle">
                    <XMRow>
                      <XMCell align="center">
                        <XMWrap>
                          <XMApp xml:id="S3.E6.m1.56">
                            <XMTok meaning="equals" role="RELOP" xml:id="S3.E6.m1.3">=</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                              <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.E6.m1.1">L</XMTok>
                              <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN" xml:id="S3.E6.m1.2.1">M</XMTok>
                            </XMApp>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMTok font="blackboard" role="UNKNOWN" xml:id="S3.E6.m1.4">E</XMTok>
                                <XMApp xml:id="S3.E6.m1.5.1">
                                  <XMTok fontsize="70%" meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post8"/>
                                    <XMTok font="italic" fontsize="70%" name="tau" role="UNKNOWN">τ</XMTok>
                                    <XMTok font="italic" fontsize="50%" role="UNKNOWN">H</XMTok>
                                  </XMApp>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post8"/>
                                    <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">D</XMTok>
                                    <XMTok font="caligraphic" fontsize="50%" role="UNKNOWN">M</XMTok>
                                  </XMApp>
                                </XMApp>
                              </XMApp>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMTok font="italic" role="UNKNOWN" xml:id="S3.E6.m1.6">l</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E6.m1.7.1">M</XMTok>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false" xml:id="S3.E6.m1.8">(</XMTok>
                                <XMApp xml:id="S3.E6.m1.56.1">
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" name="tau" role="UNKNOWN" xml:id="S3.E6.m1.9">τ</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E6.m1.10.1">H</XMTok>
                                </XMApp>
                                <XMTok role="PUNCT" xml:id="S3.E6.m1.11">,</XMTok>
                                <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.E6.m1.12">M</XMTok>
                                <XMTok role="CLOSE" stretchy="false" xml:id="S3.E6.m1.13">)</XMTok>
                              </XMWrap>
                            </XMApp>
                          </XMApp>
                          <XMTok role="PUNCT" xml:id="S3.E6.m1.14">,</XMTok>
                        </XMWrap>
                      </XMCell>
                    </XMRow>
                    <XMRow>
                      <XMCell align="center">
                        <XMWrap>
                          <XMApp xml:id="S3.E6.m1.57">
                            <XMTok meaning="equals" role="RELOP" xml:id="S3.E6.m1.17">=</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                              <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.E6.m1.15">L</XMTok>
                              <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN" xml:id="S3.E6.m1.16.1">M</XMTok>
                            </XMApp>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMTok font="blackboard" role="UNKNOWN" xml:id="S3.E6.m1.18">E</XMTok>
                                <XMApp xml:id="S3.E6.m1.19.1">
                                  <XMTok fontsize="70%" meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post8"/>
                                    <XMTok font="italic" fontsize="70%" name="tau" role="UNKNOWN">τ</XMTok>
                                    <XMTok font="italic" fontsize="50%" role="UNKNOWN">H</XMTok>
                                  </XMApp>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post8"/>
                                    <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">D</XMTok>
                                    <XMTok font="caligraphic" fontsize="50%" role="UNKNOWN">M</XMTok>
                                  </XMApp>
                                </XMApp>
                              </XMApp>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMTok font="italic" role="UNKNOWN" xml:id="S3.E6.m1.20">l</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E6.m1.21.1">F</XMTok>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false" xml:id="S3.E6.m1.22">(</XMTok>
                                <XMApp xml:id="S3.E6.m1.57.1">
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.E6.m1.23">M</XMTok>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false" xml:id="S3.E6.m1.24">(</XMTok>
                                    <XMApp xml:id="S3.E6.m1.57.1.1">
                                      <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                      <XMTok font="italic" role="UNKNOWN" xml:id="S3.E6.m1.25">S</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E6.m1.26.1">n</XMTok>
                                    </XMApp>
                                    <XMTok role="PUNCT" xml:id="S3.E6.m1.27">,</XMTok>
                                    <XMApp xml:id="S3.E6.m1.57.1.2">
                                      <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                      <XMTok font="italic" role="UNKNOWN" xml:id="S3.E6.m1.28">A</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E6.m1.29.1">h</XMTok>
                                    </XMApp>
                                    <XMTok role="CLOSE" stretchy="false" xml:id="S3.E6.m1.30">)</XMTok>
                                  </XMWrap>
                                </XMApp>
                                <XMTok role="PUNCT" xml:id="S3.E6.m1.31">,</XMTok>
                                <XMApp xml:id="S3.E6.m1.32">
                                  <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">S</XMTok>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false" xml:id="S3.E6.m1.33">)</XMTok>
                              </XMWrap>
                            </XMApp>
                          </XMApp>
                          <XMTok role="PUNCT" xml:id="S3.E6.m1.34">,</XMTok>
                        </XMWrap>
                      </XMCell>
                    </XMRow>
                    <XMRow>
                      <XMCell align="center">
                        <XMWrap>
                          <XMApp xml:id="S3.E6.m1.58">
                            <XMTok meaning="equals" role="RELOP" xml:id="S3.E6.m1.37">=</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                              <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.E6.m1.35">L</XMTok>
                              <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN" xml:id="S3.E6.m1.36.1">M</XMTok>
                            </XMApp>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMTok font="blackboard" role="UNKNOWN" xml:id="S3.E6.m1.38">E</XMTok>
                                <XMApp xml:id="S3.E6.m1.39.1">
                                  <XMTok fontsize="70%" meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post8"/>
                                    <XMTok font="italic" fontsize="70%" name="tau" role="UNKNOWN">τ</XMTok>
                                    <XMTok font="italic" fontsize="50%" role="UNKNOWN">H</XMTok>
                                  </XMApp>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post8"/>
                                    <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">D</XMTok>
                                    <XMTok font="caligraphic" fontsize="50%" role="UNKNOWN">M</XMTok>
                                  </XMApp>
                                </XMApp>
                              </XMApp>
                              <XMApp>
                                <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                <XMTok font="italic" role="UNKNOWN" xml:id="S3.E6.m1.40">l</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E6.m1.41.1">I</XMTok>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false" xml:id="S3.E6.m1.42">(</XMTok>
                                <XMApp xml:id="S3.E6.m1.58.1">
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMTok font="caligraphic" role="UNKNOWN" xml:id="S3.E6.m1.43">M</XMTok>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false" xml:id="S3.E6.m1.44">(</XMTok>
                                    <XMApp xml:id="S3.E6.m1.58.1.1">
                                      <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                      <XMTok font="italic" role="UNKNOWN" xml:id="S3.E6.m1.45">S</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E6.m1.46.1">n</XMTok>
                                    </XMApp>
                                    <XMTok role="PUNCT" xml:id="S3.E6.m1.47">,</XMTok>
                                    <XMApp xml:id="S3.E6.m1.48">
                                      <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                                      <XMTok font="italic" role="UNKNOWN">S</XMTok>
                                    </XMApp>
                                    <XMTok role="CLOSE" stretchy="false" xml:id="S3.E6.m1.49">)</XMTok>
                                  </XMWrap>
                                </XMApp>
                                <XMTok role="PUNCT" xml:id="S3.E6.m1.50">,</XMTok>
                                <XMApp xml:id="S3.E6.m1.58.2">
                                  <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                  <XMTok font="italic" role="UNKNOWN" xml:id="S3.E6.m1.51">A</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN" xml:id="S3.E6.m1.52.1">h</XMTok>
                                </XMApp>
                                <XMTok role="CLOSE" stretchy="false" xml:id="S3.E6.m1.53">)</XMTok>
                              </XMWrap>
                            </XMApp>
                          </XMApp>
                          <XMTok role="PUNCT" xml:id="S3.E6.m1.54">,</XMTok>
                        </XMWrap>
                      </XMCell>
                    </XMRow>
                  </XMArray>
                </XMDual>
              </XMath>
            </Math>
          </equation>
          <p>where <Math mode="inline" tex="S^{n},A^{h},\tilde{S}" text="list@(S ^ n, A ^ h, tilde@(S))" xml:id="S3.SS2.SSS0.Px3.p1.m1">
              <XMath>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="list"/>
                    <XMRef idref="S3.SS2.SSS0.Px3.p1.m1.2"/>
                    <XMRef idref="S3.SS2.SSS0.Px3.p1.m1.3"/>
                    <XMRef idref="S3.SS2.SSS0.Px3.p1.m1.1"/>
                  </XMApp>
                  <XMWrap>
                    <XMApp xml:id="S3.SS2.SSS0.Px3.p1.m1.2">
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">S</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMApp xml:id="S3.SS2.SSS0.Px3.p1.m1.3">
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">A</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">h</XMTok>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMApp xml:id="S3.SS2.SSS0.Px3.p1.m1.1">
                      <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                      <XMTok font="italic" role="UNKNOWN">S</XMTok>
                    </XMApp>
                  </XMWrap>
                </XMDual>
              </XMath>
            </Math> are the state, actions, and predicted states from the <Math mode="inline" tex="\mathcal{D}_{\mathcal{M}}" text="D _ M" xml:id="S3.SS2.SSS0.Px3.p1.m2">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                  <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">M</XMTok>
                </XMApp>
              </XMath>
            </Math>, <Math mode="inline" tex="l_{M}" text="l _ M" xml:id="S3.SS2.SSS0.Px3.p1.m3">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">l</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">M</XMTok>
                </XMApp>
              </XMath>
            </Math> is a function that determines the difference between prediction and truth (e.g., the sum of the squared difference for vector states), <Math mode="inline" tex="l_{I}" text="l _ I" xml:id="S3.SS2.SSS0.Px3.p1.m4">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">l</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">I</XMTok>
                </XMApp>
              </XMath>
            </Math> and <Math mode="inline" tex="l_{F}" text="l _ F" xml:id="S3.SS2.SSS0.Px3.p1.m5">
              <XMath>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">l</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">F</XMTok>
                </XMApp>
              </XMath>
            </Math> are the loss functions of the inverse and forward dynamics model respectively.</p>
        </para>
      </paragraph>
    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:interaction_model_im" xml:id="S4">
    <tags>
      <tag>4</tag>
      <tag role="autoref">section 4</tag>
      <tag role="refnum">4</tag>
      <tag role="typerefnum">§4</tag>
    </tags>
    <title><tag close=" ">4</tag>Intrinsic motivation and model</title>
    <para xml:id="S4.p1">
      <p>Learned world model contains the agent’s knowledge about the dynamics of the surrounding world. On the one hand, this makes it possible to reduce the number of interactions with the environment for learning the policy. The data from the model compensates for the decrease in the information flow from the environment. On the other hand, direct access to the transition dynamics through the model makes it possible to simplify the exploration problem.</p>
    </para>
    <para xml:id="S4.p2">
      <p>The application of a model for exploration<note mark="2" role="footnote" xml:id="footnote2"><tags>
            <tag>2</tag>
            <tag role="autoref">footnote 2</tag>
            <tag role="refnum">2</tag>
            <tag role="typerefnum">footnote 2</tag>
          </tags>Here exploration is considered in a broad sense: there is a search not only for possible states in the environment but also for skills — behavioral policies for various goals.</note> is an area of direct contact with methods of intrinsic motivation, that consists of three approaches distinguished by the influence on the task policy (see Fig. <ref labelref="LABEL:fig:model_explore"/>):</p>
    </para>
<!--  %****␣main_eng.tex␣Line␣275␣**** -->    <para xml:id="S4.p3">
      <itemize xml:id="S4.I1">
        <item xml:id="S4.I1.i1">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">1st item</tag>
          </tags>
          <para xml:id="S4.I1.i1.p1">
            <p>the first is based on modifying the reward <Math mode="inline" tex="R" text="R" xml:id="S4.I1.i1.p1.m1">
                <XMath>
                  <XMTok font="italic" role="UNKNOWN">R</XMTok>
                </XMath>
              </Math> for the agent (the intrinsic reward is added to task one) (see Section <ref labelref="LABEL:subsec:additional_reward"/>);</p>
          </para>
        </item>
        <item xml:id="S4.I1.i2">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">2nd item</tag>
          </tags>
          <para xml:id="S4.I1.i2.p1">
            <p>the second is based on changing the data acquisition into memory <Math mode="inline" tex="\mathcal{D}" text="D" xml:id="S4.I1.i2.p1.m1">
                <XMath>
                  <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                </XMath>
              </Math> by using an exploration policy (see Section <ref labelref="LABEL:subsec:expl_policy"/>);</p>
          </para>
        </item>
        <item xml:id="S4.I1.i3">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">3rd item</tag>
          </tags>
          <para xml:id="S4.I1.i3.p1">
            <p>the third is based on setting goals <Math mode="inline" tex="g" text="g" xml:id="S4.I1.i3.p1.m1">
                <XMath>
                  <XMTok font="italic" role="UNKNOWN">g</XMTok>
                </XMath>
              </Math> (determined by signals of intrinsic motivation, as well as by the structure of the world) as learning problems for the task policy <Math mode="inline" tex="\pi_{g}" text="pi _ g" xml:id="S4.I1.i3.p1.m2">
                <XMath>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">g</XMTok>
                  </XMApp>
                </XMath>
              </Math> (see Section <ref labelref="LABEL:subsec:intr_goals"/>).</p>
          </para>
        </item>
      </itemize>
    </para>
    <para xml:id="S4.p4">
      <p>In each of these three ways, there is an intrinsic motivation signal. Its maximization provides the agent with exploratory capabilities.</p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:model_explore" placement="h" xml:id="S4.F4">
      <tags>
        <tag>Figure 4</tag>
        <tag role="autoref">Figure 4</tag>
        <tag role="refnum">4</tag>
        <tag role="typerefnum">Figure 4</tag>
      </tags>
      <graphics candidates="model_explore_eng.png" class="ltx_centering" graphic="model_explore_eng.png" options="width=433.62pt" xml:id="S4.F4.g1"/>
      <toccaption class="ltx_centering"><tag close=" ">4</tag>Model and intrinsic motivation.</toccaption>
      <caption class="ltx_centering"><tag close=": ">Figure 4</tag>Model and intrinsic motivation.</caption>
    </figure>
    <subsection inlist="toc" labels="LABEL:subsec:intr_signal" xml:id="S4.SS1">
      <tags>
        <tag>4.1</tag>
        <tag role="autoref">subsection 4.1</tag>
        <tag role="refnum">4.1</tag>
        <tag role="typerefnum">§4.1</tag>
      </tags>
      <title><tag close=" ">4.1</tag>Intrinsic signal</title>
      <para xml:id="S4.SS1.p1">
        <p>A feedback signal is needed to train the policy. It numerically is characterized by the quadruple <Math mode="inline" tex="\langle s_{t},a_{t},s_{t+1},g\rangle" text="list@(s _ t, a _ t, s _ (t + 1), g)" xml:id="S4.SS1.p1.m1">
            <XMath>
              <XMDual>
                <XMApp>
                  <XMTok meaning="list"/>
                  <XMRef idref="S4.SS1.p1.m1.2"/>
                  <XMRef idref="S4.SS1.p1.m1.3"/>
                  <XMRef idref="S4.SS1.p1.m1.4"/>
                  <XMRef idref="S4.SS1.p1.m1.1"/>
                </XMApp>
                <XMWrap>
                  <XMTok name="langle" role="OPEN" stretchy="false">⟨</XMTok>
                  <XMApp xml:id="S4.SS1.p1.m1.2">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S4.SS1.p1.m1.3">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">a</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S4.SS1.p1.m1.4">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S4.SS1.p1.m1.1">g</XMTok>
                  <XMTok name="rangle" role="CLOSE" stretchy="false">⟩</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>. There are two main approaches to evaluate this signal in the intrinsic motivation methods: <emph font="italic">knowledge-based</emph>, which takes into account only states and actions <Math mode="inline" tex="\langle s_{t},a_{t},s_{t+1}\rangle" text="list@(s _ t, a _ t, s _ (t + 1))" xml:id="S4.SS1.p1.m2">
            <XMath>
              <XMDual>
                <XMApp>
                  <XMTok meaning="list"/>
                  <XMRef idref="S4.SS1.p1.m2.1"/>
                  <XMRef idref="S4.SS1.p1.m2.2"/>
                  <XMRef idref="S4.SS1.p1.m2.3"/>
                </XMApp>
                <XMWrap>
                  <XMTok name="langle" role="OPEN" stretchy="false">⟨</XMTok>
                  <XMApp xml:id="S4.SS1.p1.m2.1">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S4.SS1.p1.m2.2">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">a</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S4.SS1.p1.m2.3">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMTok name="rangle" role="CLOSE" stretchy="false">⟩</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>, and <emph font="italic">competence-based</emph>, which characterizes both intrinsic goals and the possibility of achieving them <Math mode="inline" tex="\langle s_{t},a_{t},s_{t+1},g\rangle" text="list@(s _ t, a _ t, s _ (t + 1), g)" xml:id="S4.SS1.p1.m3">
            <XMath>
              <XMDual>
                <XMApp>
                  <XMTok meaning="list"/>
                  <XMRef idref="S4.SS1.p1.m3.2"/>
                  <XMRef idref="S4.SS1.p1.m3.3"/>
                  <XMRef idref="S4.SS1.p1.m3.4"/>
                  <XMRef idref="S4.SS1.p1.m3.1"/>
                </XMApp>
                <XMWrap>
                  <XMTok name="langle" role="OPEN" stretchy="false">⟨</XMTok>
                  <XMApp xml:id="S4.SS1.p1.m3.2">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S4.SS1.p1.m3.3">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">a</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMApp xml:id="S4.SS1.p1.m3.4">
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    <XMApp>
                      <XMTok fontsize="70%" meaning="plus" role="ADDOP">+</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                  <XMTok font="italic" role="UNKNOWN" xml:id="S4.SS1.p1.m3.1">g</XMTok>
                  <XMTok name="rangle" role="CLOSE" stretchy="false">⟩</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>. Such a division of methods does not reflect the specifics of using the model to determine intrinsic motivation.</p>
      </para>
      <para xml:id="S4.SS1.p2">
        <p>One of the agent’s objectives is to search for information about the real dynamics of the environment to train its world model. To perform such exploration, markers are needed to indicate the success of the process. Such markers — intrinsic motivation signals — use</p>
        <itemize xml:id="S4.I2">
          <item xml:id="S4.I2.i1">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">1st item</tag>
            </tags>
            <para xml:id="S4.I2.i1.p1">
              <p>the current uncertainty of the model, i.e., some model error;</p>
            </para>
          </item>
          <item xml:id="S4.I2.i2">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">2nd item</tag>
            </tags>
            <para xml:id="S4.I2.i2.p1">
              <p>the knowledge gained of the model from the received data;</p>
            </para>
          </item>
          <item xml:id="S4.I2.i3">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">3rd item</tag>
            </tags>
            <para xml:id="S4.I2.i3.p1">
              <p>the morphology of the environment, which makes it possible to identify important transitions, actions, or states.
<!--  %****␣main_eng.tex␣Line␣300␣**** --></p>
            </para>
          </item>
        </itemize>
      </para>
      <paragraph inlist="toc" xml:id="S4.SS1.SSS0.Px1">
        <title>Model uncertainty <Math mode="inline" tex="\mathcal{L}[\mathcal{M}]" text="L * delimited-[]@(M)" xml:id="S4.SS1.SSS0.Px1.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="delimited-[]"/>
                    <XMRef idref="S4.SS1.SSS0.Px1.m1.1"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">[</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN" xml:id="S4.SS1.SSS0.Px1.m1.1">M</XMTok>
                    <XMTok role="CLOSE" stretchy="false">]</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math></title>
        <para class="ltx_noindent" xml:id="S4.SS1.SSS0.Px1.p1">
          <p>is the most comprehensive group of methods to determine the intrinsic motivation signal:</p>
          <equationgroup labels="LABEL:eq:lm" xml:id="S4.E7">
            <tags>
              <tag>(7)</tag>
              <tag role="autoref">Equation 7</tag>
              <tag role="refnum">7</tag>
            </tags>
            <equation xml:id="S4.E7X">
              <MathFork>
                <Math tex="\displaystyle\mathcal{L}[\mathcal{M}]:R_{int}=\begin{cases}|\tilde{S}-\mathcal%&#10;{M}(S)|,&amp;\text{between the model and true dynamics};\\&#10;D[\mathcal{M}^{k}(S)],&amp;\text{between ensemble components,}\end{cases}" text="L * delimited-[]@(M) colon R _ (i * n * t) = cases@(absolute-value@(tilde@(S) - M * S), [between the model and true dynamics], D * delimited-[]@(M ^ k * S), [between ensemble components,])" xml:id="S4.E7X.m1">
                  <XMath>
                    <XMApp>
                      <XMTok name="colon" role="METARELOP">:</XMTok>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="delimited-[]"/>
                            <XMRef idref="S4.E7X.m1.1"/>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">[</XMTok>
                            <XMTok font="caligraphic" role="UNKNOWN" xml:id="S4.E7X.m1.1">M</XMTok>
                            <XMTok role="CLOSE" stretchy="false">]</XMTok>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="equals" role="RELOP">=</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                          <XMTok font="italic" role="UNKNOWN">R</XMTok>
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="cases"/>
                            <XMRef idref="S4.E7.m1.1"/>
                            <XMRef idref="S4.E7.m1.2"/>
                            <XMRef idref="S4.E7.m1.3"/>
                            <XMRef idref="S4.E7.m1.4"/>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="true">{</XMTok>
                            <XMArray>
                              <XMRow>
                                <XMCell align="left">
                                  <XMDual xml:id="S4.E7.m1.1">
                                    <XMRef idref="S4.E7.m1.1.2"/>
                                    <XMWrap>
                                      <XMDual xml:id="S4.E7.m1.1.2">
                                        <XMApp>
                                          <XMTok meaning="absolute-value"/>
                                          <XMRef idref="S4.E7.m1.1.2.1"/>
                                        </XMApp>
                                        <XMWrap>
                                          <XMTok role="VERTBAR" stretchy="false">|</XMTok>
                                          <XMApp xml:id="S4.E7.m1.1.2.1">
                                            <XMTok meaning="minus" role="ADDOP">-</XMTok>
                                            <XMApp>
                                              <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                                              <XMTok font="italic" role="UNKNOWN">S</XMTok>
                                            </XMApp>
                                            <XMApp>
                                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                              <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
                                              <XMDual>
                                                <XMRef idref="S4.E7.m1.1.1"/>
                                                <XMWrap>
                                                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                                                  <XMTok font="italic" role="UNKNOWN" xml:id="S4.E7.m1.1.1">S</XMTok>
                                                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                                </XMWrap>
                                              </XMDual>
                                            </XMApp>
                                          </XMApp>
                                          <XMTok role="VERTBAR" stretchy="false">|</XMTok>
                                        </XMWrap>
                                      </XMDual>
                                      <XMTok role="PUNCT">,</XMTok>
                                    </XMWrap>
                                  </XMDual>
                                </XMCell>
                                <XMCell align="left">
                                  <XMDual xml:id="S4.E7.m1.2">
                                    <XMRef idref="S4.E7.m1.2.1"/>
                                    <XMWrap>
                                      <XMText xml:id="S4.E7.m1.2.1">between the model and true dynamics</XMText>
                                      <XMTok role="PUNCT">;</XMTok>
                                    </XMWrap>
                                  </XMDual>
                                </XMCell>
                              </XMRow>
                              <XMRow>
                                <XMCell align="left">
                                  <XMDual xml:id="S4.E7.m1.3">
                                    <XMRef idref="S4.E7.m1.3.2"/>
                                    <XMWrap>
                                      <XMApp xml:id="S4.E7.m1.3.2">
                                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                        <XMTok font="italic" role="UNKNOWN">D</XMTok>
                                        <XMDual>
                                          <XMApp>
                                            <XMTok meaning="delimited-[]"/>
                                            <XMRef idref="S4.E7.m1.3.2.1"/>
                                          </XMApp>
                                          <XMWrap>
                                            <XMTok role="OPEN" stretchy="false">[</XMTok>
                                            <XMApp xml:id="S4.E7.m1.3.2.1">
                                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                              <XMApp>
                                                <XMTok role="SUPERSCRIPTOP" scriptpos="post12"/>
                                                <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
                                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                                              </XMApp>
                                              <XMDual>
                                                <XMRef idref="S4.E7.m1.3.1"/>
                                                <XMWrap>
                                                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                                                  <XMTok font="italic" role="UNKNOWN" xml:id="S4.E7.m1.3.1">S</XMTok>
                                                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                                </XMWrap>
                                              </XMDual>
                                            </XMApp>
                                            <XMTok role="CLOSE" stretchy="false">]</XMTok>
                                          </XMWrap>
                                        </XMDual>
                                      </XMApp>
                                      <XMTok role="PUNCT">,</XMTok>
                                    </XMWrap>
                                  </XMDual>
                                </XMCell>
                                <XMCell align="left">
                                  <XMText xml:id="S4.E7.m1.4">between ensemble components,</XMText>
                                </XMCell>
                              </XMRow>
                            </XMArray>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                    </XMApp>
                  </XMath>
                </Math>
                <MathBranch>
                  <td align="right"><Math tex="\displaystyle\mathcal{L}[\mathcal{M}]:R_{int}" text="L * delimited-[]@(M) colon R _ (i * n * t)" xml:id="S4.E7X.m2">
                      <XMath>
                        <XMApp>
                          <XMTok name="colon" role="METARELOP">:</XMTok>
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                            <XMDual>
                              <XMApp>
                                <XMTok meaning="delimited-[]"/>
                                <XMRef idref="S4.E7X.m2.1"/>
                              </XMApp>
                              <XMWrap>
                                <XMTok role="OPEN" stretchy="false">[</XMTok>
                                <XMTok font="caligraphic" role="UNKNOWN" xml:id="S4.E7X.m2.1">M</XMTok>
                                <XMTok role="CLOSE" stretchy="false">]</XMTok>
                              </XMWrap>
                            </XMDual>
                          </XMApp>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                            <XMTok font="italic" role="UNKNOWN">R</XMTok>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                            </XMApp>
                          </XMApp>
                        </XMApp>
                      </XMath>
                    </Math></td>
                  <td align="left"><Math tex="\displaystyle=\begin{cases}|\tilde{S}-\mathcal{M}(S)|,&amp;\text{between the model%&#10; and true dynamics};\\&#10;D[\mathcal{M}^{k}(S)],&amp;\text{between ensemble components,}\end{cases}" text="absent = cases@(absolute-value@(tilde@(S) - M * S), [between the model and true dynamics], D * delimited-[]@(M ^ k * S), [between ensemble components,])" xml:id="S4.E7X.m3">
                      <XMath>
                        <XMApp>
                          <XMTok meaning="equals" role="RELOP">=</XMTok>
                          <XMTok meaning="absent"/>
                          <XMDual>
                            <XMApp>
                              <XMTok meaning="cases"/>
                              <XMRef idref="S4.E7.m1.1.mf"/>
                              <XMRef idref="S4.E7.m1.2.mf"/>
                              <XMRef idref="S4.E7.m1.3.mf"/>
                              <XMRef idref="S4.E7.m1.4.mf"/>
                            </XMApp>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="true">{</XMTok>
                              <XMArray>
                                <XMRow>
                                  <XMCell align="left">
                                    <XMDual xml:id="S4.E7.m1.1.mf">
                                      <XMRef idref="S4.E7.m1.1.mf.2"/>
                                      <XMWrap>
                                        <XMDual xml:id="S4.E7.m1.1.mf.2">
                                          <XMApp>
                                            <XMTok meaning="absolute-value"/>
                                            <XMRef idref="S4.E7.m1.1.mf.2.1"/>
                                          </XMApp>
                                          <XMWrap>
                                            <XMTok role="VERTBAR" stretchy="false">|</XMTok>
                                            <XMApp xml:id="S4.E7.m1.1.mf.2.1">
                                              <XMTok meaning="minus" role="ADDOP">-</XMTok>
                                              <XMApp>
                                                <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                                                <XMTok font="italic" role="UNKNOWN">S</XMTok>
                                              </XMApp>
                                              <XMApp>
                                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                                <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
                                                <XMDual>
                                                  <XMRef idref="S4.E7.m1.1.mf.1"/>
                                                  <XMWrap>
                                                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                                                    <XMTok font="italic" role="UNKNOWN" xml:id="S4.E7.m1.1.mf.1">S</XMTok>
                                                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                                  </XMWrap>
                                                </XMDual>
                                              </XMApp>
                                            </XMApp>
                                            <XMTok role="VERTBAR" stretchy="false">|</XMTok>
                                          </XMWrap>
                                        </XMDual>
                                        <XMTok role="PUNCT">,</XMTok>
                                      </XMWrap>
                                    </XMDual>
                                  </XMCell>
                                  <XMCell align="left">
                                    <XMDual xml:id="S4.E7.m1.2.mf">
                                      <XMRef idref="S4.E7.m1.2.mf.1"/>
                                      <XMWrap>
                                        <XMText xml:id="S4.E7.m1.2.mf.1">between the model and true dynamics</XMText>
                                        <XMTok role="PUNCT">;</XMTok>
                                      </XMWrap>
                                    </XMDual>
                                  </XMCell>
                                </XMRow>
                                <XMRow>
                                  <XMCell align="left">
                                    <XMDual xml:id="S4.E7.m1.3.mf">
                                      <XMRef idref="S4.E7.m1.3.mf.2"/>
                                      <XMWrap>
                                        <XMApp xml:id="S4.E7.m1.3.mf.2">
                                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                          <XMTok font="italic" role="UNKNOWN">D</XMTok>
                                          <XMDual>
                                            <XMApp>
                                              <XMTok meaning="delimited-[]"/>
                                              <XMRef idref="S4.E7.m1.3.mf.2.1"/>
                                            </XMApp>
                                            <XMWrap>
                                              <XMTok role="OPEN" stretchy="false">[</XMTok>
                                              <XMApp xml:id="S4.E7.m1.3.mf.2.1">
                                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                                <XMApp>
                                                  <XMTok role="SUPERSCRIPTOP" scriptpos="post12"/>
                                                  <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
                                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">k</XMTok>
                                                </XMApp>
                                                <XMDual>
                                                  <XMRef idref="S4.E7.m1.3.mf.1"/>
                                                  <XMWrap>
                                                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                                                    <XMTok font="italic" role="UNKNOWN" xml:id="S4.E7.m1.3.mf.1">S</XMTok>
                                                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                                  </XMWrap>
                                                </XMDual>
                                              </XMApp>
                                              <XMTok role="CLOSE" stretchy="false">]</XMTok>
                                            </XMWrap>
                                          </XMDual>
                                        </XMApp>
                                        <XMTok role="PUNCT">,</XMTok>
                                      </XMWrap>
                                    </XMDual>
                                  </XMCell>
                                  <XMCell align="left">
                                    <XMText xml:id="S4.E7.m1.4.mf">between ensemble components,</XMText>
                                  </XMCell>
                                </XMRow>
                              </XMArray>
                            </XMWrap>
                          </XMDual>
                        </XMApp>
                      </XMath>
                    </Math></td>
                </MathBranch>
              </MathFork>
            </equation>
          </equationgroup>
          <p>where <Math mode="inline" tex="|\ldots-\ldots|" text="absolute-value@(ldots - ldots)" xml:id="S4.SS1.SSS0.Px1.p1.m1">
              <XMath>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="absolute-value"/>
                    <XMRef idref="S4.SS1.SSS0.Px1.p1.m1.1"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="VERTBAR" stretchy="false">|</XMTok>
                    <XMApp xml:id="S4.SS1.SSS0.Px1.p1.m1.1">
                      <XMTok meaning="minus" role="ADDOP">-</XMTok>
                      <XMTok name="ldots" role="ID">…</XMTok>
                      <XMTok name="ldots" role="ID">…</XMTok>
                    </XMApp>
                    <XMTok role="VERTBAR" stretchy="false">|</XMTok>
                  </XMWrap>
                </XMDual>
              </XMath>
            </Math> denotes the difference between elements, and <Math mode="inline" tex="D[\dots]" text="D * delimited-[]@(dots)" xml:id="S4.SS1.SSS0.Px1.p1.m2">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">D</XMTok>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="delimited-[]"/>
                      <XMRef idref="S4.SS1.SSS0.Px1.p1.m2.1"/>
                    </XMApp>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">[</XMTok>
                      <XMTok name="dots" role="ID" xml:id="S4.SS1.SSS0.Px1.p1.m2.1">…</XMTok>
                      <XMTok role="CLOSE" stretchy="false">]</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math> defines the diversity of elements (e.g., deviation, variance). One of the intrinsic reward signals is the error of the predicted next state by comparing it with the true one (see eq. (<ref labelref="LABEL:eq:lm"/>); ICM <cite class="ltx_citemacro_cite"><bibref bibrefs="pathak_curiosity-driven_2017" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>, SelMo <cite class="ltx_citemacro_cite"><bibref bibrefs="groth_is_2021" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>, EMI <cite class="ltx_citemacro_cite"><bibref bibrefs="kim_emi_2019" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>, Director <cite class="ltx_citemacro_cite"><bibref bibrefs="hafner_deep_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>). However, this approach requires constant access to the real state. Another way is to determine the variance of ensemble predictions — see eq. (<ref labelref="LABEL:eq:lm"/>) (Plan2Explore <cite class="ltx_citemacro_cite"><bibref bibrefs="sekar_planning_2020" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>, LEXA <cite class="ltx_citemacro_cite"><bibref bibrefs="mendonca_discovering_2021" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>, MEEE <cite class="ltx_citemacro_cite"><bibref bibrefs="yao_sample_2021" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>; MAX <cite class="ltx_citemacro_cite"><bibref bibrefs="shyam_model-based_2019" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite> — the signal is similar but uses the Jensen-Shannon divergence).</p>
        </para>
      </paragraph>
      <paragraph inlist="toc" xml:id="S4.SS1.SSS0.Px2">
        <title>Knowledge gain <Math mode="inline" tex="\Delta\mathcal{M}" text="Delta * M" xml:id="S4.SS1.SSS0.Px2.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok name="Delta" role="UNKNOWN">Δ</XMTok>
                <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
              </XMApp>
            </XMath>
          </Math></title>
        <para class="ltx_noindent" xml:id="S4.SS1.SSS0.Px2.p1">
          <p>— signals that determine the change in the model when new information is received:</p>
          <equation xml:id="S4.E8">
            <tags>
              <tag>(8)</tag>
              <tag role="autoref">Equation 8</tag>
              <tag role="refnum">8</tag>
            </tags>
            <Math mode="display" tex="\Delta\mathcal{M}:R_{int}(t)=|\mathcal{M}(t)-\mathcal{M}(t-n)|," text="Delta * M colon R _ (i * n * t) * t = absolute-value@(M * t - M * (t - n))" xml:id="S4.E8.m1">
              <XMath>
                <XMDual>
                  <XMRef idref="S4.E8.m1.3"/>
                  <XMWrap>
                    <XMApp xml:id="S4.E8.m1.3">
                      <XMTok name="colon" role="METARELOP">:</XMTok>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok name="Delta" role="UNKNOWN">Δ</XMTok>
                        <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="equals" role="RELOP">=</XMTok>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="italic" role="UNKNOWN">R</XMTok>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                            </XMApp>
                          </XMApp>
                          <XMDual>
                            <XMRef idref="S4.E8.m1.1"/>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false">(</XMTok>
                              <XMTok font="italic" role="UNKNOWN" xml:id="S4.E8.m1.1">t</XMTok>
                              <XMTok role="CLOSE" stretchy="false">)</XMTok>
                            </XMWrap>
                          </XMDual>
                        </XMApp>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="absolute-value"/>
                            <XMRef idref="S4.E8.m1.3.1"/>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="VERTBAR" stretchy="false">|</XMTok>
                            <XMApp xml:id="S4.E8.m1.3.1">
                              <XMTok meaning="minus" role="ADDOP">-</XMTok>
                              <XMApp>
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
                                <XMDual>
                                  <XMRef idref="S4.E8.m1.2"/>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                                    <XMTok font="italic" role="UNKNOWN" xml:id="S4.E8.m1.2">t</XMTok>
                                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                  </XMWrap>
                                </XMDual>
                              </XMApp>
                              <XMApp>
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
                                <XMDual>
                                  <XMRef idref="S4.E8.m1.3.1.1"/>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                                    <XMApp xml:id="S4.E8.m1.3.1.1">
                                      <XMTok meaning="minus" role="ADDOP">-</XMTok>
                                      <XMTok font="italic" role="UNKNOWN">t</XMTok>
                                      <XMTok font="italic" role="UNKNOWN">n</XMTok>
                                    </XMApp>
                                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                  </XMWrap>
                                </XMDual>
                              </XMApp>
                            </XMApp>
                            <XMTok role="VERTBAR" stretchy="false">|</XMTok>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                  </XMWrap>
                </XMDual>
              </XMath>
            </Math>
          </equation>
          <p>where <Math mode="inline" tex="\mathcal{M}(t)" text="M * t" xml:id="S4.SS1.SSS0.Px2.p1.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
                  <XMDual>
                    <XMRef idref="S4.SS1.SSS0.Px2.p1.m1.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMTok font="italic" role="UNKNOWN" xml:id="S4.SS1.SSS0.Px2.p1.m1.1">t</XMTok>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
              </XMath>
            </Math> is the model state at time <Math mode="inline" tex="t" text="t" xml:id="S4.SS1.SSS0.Px2.p1.m2">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">t</XMTok>
              </XMath>
            </Math>, and <Math mode="inline" tex="n" text="n" xml:id="S4.SS1.SSS0.Px2.p1.m3">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">n</XMTok>
              </XMath>
            </Math> is the horizon of time steps over which we track the change in the model. Examples of such signals are the difference of loss function in several steps (AWML <Math mode="inline" tex="\gamma" text="gamma" xml:id="S4.SS1.SSS0.Px2.p1.m4">
              <XMath>
                <XMTok font="italic" name="gamma" role="UNKNOWN">γ</XMTok>
              </XMath>
            </Math>-Progress <cite class="ltx_citemacro_cite"><bibref bibrefs="kim_active_2020" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>), Kullback-Leibler divergence between current and updated model predictions (VIME <cite class="ltx_citemacro_cite"><bibref bibrefs="houthooft_vime_2017" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>), ensemble prediction improvement (Deep ICAC <cite class="ltx_citemacro_cite"><bibref bibrefs="hafez_deep_2019" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>).</p>
        </para>
      </paragraph>
      <paragraph inlist="toc" xml:id="S4.SS1.SSS0.Px3">
        <title>Environment morphology <Math mode="inline" tex="\chi[\mathcal{M}]" text="chi * delimited-[]@(M)" xml:id="S4.SS1.SSS0.Px3.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" name="chi" role="UNKNOWN">χ</XMTok>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="delimited-[]"/>
                    <XMRef idref="S4.SS1.SSS0.Px3.m1.1"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">[</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN" xml:id="S4.SS1.SSS0.Px3.m1.1">M</XMTok>
                    <XMTok role="CLOSE" stretchy="false">]</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math></title>
        <para class="ltx_noindent" xml:id="S4.SS1.SSS0.Px3.p1">
          <p>defines a set of reward signals that characterizes the structural properties of the world:</p>
          <equation xml:id="S4.E9">
            <tags>
              <tag>(9)</tag>
              <tag role="autoref">Equation 9</tag>
              <tag role="refnum">9</tag>
            </tags>
            <Math mode="display" tex="\chi[\mathcal{M}]:R_{int}=X[\mathcal{M},S]," text="chi * delimited-[]@(M) colon R _ (i * n * t) = X * closed-interval@(M, S)" xml:id="S4.E9.m1">
              <XMath>
                <XMDual>
                  <XMRef idref="S4.E9.m1.4"/>
                  <XMWrap>
                    <XMApp xml:id="S4.E9.m1.4">
                      <XMTok name="colon" role="METARELOP">:</XMTok>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
<!--  %**** main_eng.tex Line 325 **** -->                        <XMTok font="italic" name="chi" role="UNKNOWN">χ</XMTok>
                        <XMDual>
                          <XMApp>
                            <XMTok meaning="delimited-[]"/>
                            <XMRef idref="S4.E9.m1.1"/>
                          </XMApp>
                          <XMWrap>
                            <XMTok role="OPEN" stretchy="false">[</XMTok>
                            <XMTok font="caligraphic" role="UNKNOWN" xml:id="S4.E9.m1.1">M</XMTok>
                            <XMTok role="CLOSE" stretchy="false">]</XMTok>
                          </XMWrap>
                        </XMDual>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="equals" role="RELOP">=</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="italic" role="UNKNOWN">R</XMTok>
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" role="UNKNOWN">X</XMTok>
                          <XMDual>
                            <XMApp>
                              <XMTok meaning="closed-interval"/>
                              <XMRef idref="S4.E9.m1.2"/>
                              <XMRef idref="S4.E9.m1.3"/>
                            </XMApp>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false">[</XMTok>
                              <XMTok font="caligraphic" role="UNKNOWN" xml:id="S4.E9.m1.2">M</XMTok>
                              <XMTok role="PUNCT">,</XMTok>
                              <XMTok font="italic" role="UNKNOWN" xml:id="S4.E9.m1.3">S</XMTok>
                              <XMTok role="CLOSE" stretchy="false">]</XMTok>
                            </XMWrap>
                          </XMDual>
                        </XMApp>
                      </XMApp>
                    </XMApp>
                    <XMTok role="PUNCT">,</XMTok>
                  </XMWrap>
                </XMDual>
              </XMath>
            </Math>
          </equation>
          <p>where <Math mode="inline" tex="X" text="X" xml:id="S4.SS1.SSS0.Px3.p1.m1">
              <XMath>
                <XMTok font="italic" role="UNKNOWN">X</XMTok>
              </XMath>
            </Math> is a specific function that makes it possible to characterize one of the morphological properties of the environment numerically, such a signal is not explicitly associated with the process of learning of either the policy or the world model since there is no comparison: the models with each other, the models at different time steps. This intrinsic motivation signal characterizes the morphology of the environment to mark states or actions that are important for the particular environment. For example, the empowerment <cite class="ltx_citemacro_cite"><bibref bibrefs="klyubin_all_2005,volpi_goal-directed_2020" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite> defines the reward as the information capacity between the sequence of actions <Math mode="inline" tex="A^{h}" text="A ^ h" xml:id="S4.SS1.SSS0.Px3.p1.m2">
              <XMath>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">A</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">h</XMTok>
                </XMApp>
              </XMath>
            </Math> and the subsequent state <Math mode="inline" tex="\tilde{S}" text="tilde@(S)" xml:id="S4.SS1.SSS0.Px3.p1.m3">
              <XMath>
                <XMApp>
                  <XMTok name="tilde" role="OVERACCENT" stretchy="false">~</XMTok>
                  <XMTok font="italic" role="UNKNOWN">S</XMTok>
                </XMApp>
              </XMath>
            </Math>. Another example is the reachability signal, which helps to form intrinsic goals for the agent <cite class="ltx_citemacro_cite"><bibref bibrefs="mezghani_walk_2022,savinov_episodic_2019" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>.</p>
        </para>
      </paragraph>
    </subsection>
    <subsection inlist="toc" labels="LABEL:subsec:additional_reward" xml:id="S4.SS2">
      <tags>
        <tag>4.2</tag>
        <tag role="autoref">subsection 4.2</tag>
        <tag role="refnum">4.2</tag>
        <tag role="typerefnum">§4.2</tag>
      </tags>
      <title><tag close=" ">4.2</tag>Intrinsic reward as a complement to extrinsic</title>
      <para xml:id="S4.SS2.p1">
        <p>Determining the reward signal is one possible way to influence the performance of the task policy. Extrinsic reward is the main component that determines the target behavior of the agent. However, researchers have many problems with defining such a signal that prevents the agent from stagnation in local optima and provides learning in a reasonable time. Closely related to this is the sparse reward problem, when the agent cannot progress in learning because it does not receive a feedback signal. One way to overcome this difficulty is to complement a rare extrinsic reward with an auxiliary dense signal — intrinsic reward (e.g., solving the problem of passing the game "Montezuma’s Revenge" <cite class="ltx_citemacro_cite"><bibref bibrefs="burda_exploration_2018" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>).</p>
      </para>
      <para xml:id="S4.SS2.p2">
        <p>The mixing of extrinsic and intrinsic signals is a linear combination of rewards:</p>
        <equation labels="LABEL:eq:lin_intr_extr" xml:id="S4.E10">
          <tags>
            <tag>(10)</tag>
            <tag role="autoref">Equation 10</tag>
            <tag role="refnum">10</tag>
          </tags>
          <Math mode="display" tex="[R,R_{int}]:r=R+\alpha R_{int}," text="closed-interval@(R, R _ (i * n * t)) colon r = R + alpha * R _ (i * n * t)" xml:id="S4.E10.m1">
            <XMath>
              <XMDual>
                <XMRef idref="S4.E10.m1.2"/>
                <XMWrap>
                  <XMApp xml:id="S4.E10.m1.2">
                    <XMTok name="colon" role="METARELOP">:</XMTok>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="closed-interval"/>
                        <XMRef idref="S4.E10.m1.1"/>
                        <XMRef idref="S4.E10.m1.2.1"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">[</XMTok>
                        <XMTok font="italic" role="UNKNOWN" xml:id="S4.E10.m1.1">R</XMTok>
                        <XMTok role="PUNCT">,</XMTok>
                        <XMApp xml:id="S4.E10.m1.2.1">
                          <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="italic" role="UNKNOWN">R</XMTok>
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">]</XMTok>
                      </XMWrap>
                    </XMDual>
                    <XMApp>
                      <XMTok meaning="equals" role="RELOP">=</XMTok>
                      <XMTok font="italic" role="UNKNOWN">r</XMTok>
                      <XMApp>
                        <XMTok meaning="plus" role="ADDOP">+</XMTok>
                        <XMTok font="italic" role="UNKNOWN">R</XMTok>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" name="alpha" role="UNKNOWN">α</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="italic" role="UNKNOWN">R</XMTok>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                            </XMApp>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMTok role="PUNCT">,</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>
        </equation>
        <p>which is used in the classical version of policy learning (see eq. (<ref labelref="LABEL:eq:loss_pi_e"/>)). For example, such a method is implemented in <cite class="ltx_citemacro_cite"><bibref bibrefs="pathak_curiosity-driven_2017,houthooft_vime_2017,kim_emi_2019" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>. However, it is possible to combine value functions — intrinsic and extrinsic — instead of raw rewards (e.g., the MEEE algorithm <cite class="ltx_citemacro_cite"><bibref bibrefs="yao_sample_2021" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>), which saves collected information about the extrinsic task in the value function separately from the exploratory signal.</p>
      </para>
      <para xml:id="S4.SS2.p3">
        <p>The mixed reward signal has a bias for the given MDP. To eliminate this effect, it is necessary to select an adaptive coefficient <Math mode="inline" tex="\alpha" text="alpha" xml:id="S4.SS2.p3.m1">
            <XMath>
              <XMTok font="italic" name="alpha" role="UNKNOWN">α</XMTok>
            </XMath>
          </Math> (see eq. (<ref labelref="LABEL:eq:lin_intr_extr"/>)), which decays over time, or the intrinsic signal should converge to zero as the agent learns. The latter property is true for <Math mode="inline" tex="\mathcal{L}[\mathcal{M}]" text="L * delimited-[]@(M)" xml:id="S4.SS2.p3.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="delimited-[]"/>
                    <XMRef idref="S4.SS2.p3.m2.1"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">[</XMTok>
                    <XMTok font="caligraphic" role="UNKNOWN" xml:id="S4.SS2.p3.m2.1">M</XMTok>
                    <XMTok role="CLOSE" stretchy="false">]</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math><note mark="3" role="footnote" xml:id="footnote3"><tags>
              <tag>3</tag>
              <tag role="autoref">footnote 3</tag>
              <tag role="refnum">3</tag>
              <tag role="typerefnum">footnote 3</tag>
            </tags>In some cases, the prediction error will give a constant reinforcing signal to uncontrolled noise in the environment, e.g., the noisy TV problem <cite class="ltx_citemacro_cite"><bibref bibrefs="pathak_curiosity-driven_2017" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>.</note> and <Math mode="inline" tex="\Delta\mathcal{M}" text="Delta * M" xml:id="S4.SS2.p3.m3">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok name="Delta" role="UNKNOWN">Δ</XMTok>
                <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
              </XMApp>
            </XMath>
          </Math> signals.</p>
      </para>
      <para xml:id="S4.SS2.p4">
        <p>The scheme of model application through complementary reward to improve exploration consists of several components (see Fig. <ref labelref="LABEL:fig:model_reward"/>). The world model generates an intrinsic reward. The intrinsic motivation signal, mixed with task one, trains the task policy and provides it with an exploratory component. This component helps the agent to get the necessary data in the <Math mode="inline" tex="\mathcal{D}_{\mathcal{M}}" text="D _ M" xml:id="S4.SS2.p4.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">M</XMTok>
              </XMApp>
            </XMath>
          </Math> memory (see eq. (<ref labelref="LABEL:eq:model_buffer"/>)). The model is used until it becomes sufficiently well trained (that the intrinsic motivation signal drops to zero) or until the task policy becomes successful (heuristic for the adaptive coefficient <Math mode="inline" tex="\alpha" text="alpha" xml:id="S4.SS2.p4.m2">
            <XMath>
              <XMTok font="italic" name="alpha" role="UNKNOWN">α</XMTok>
            </XMath>
          </Math>).</p>
      </para>
      <figure inlist="lof" labels="LABEL:fig:model_reward" placement="h" xml:id="S4.F5">
        <tags>
          <tag>Figure 5</tag>
          <tag role="autoref">Figure 5</tag>
          <tag role="refnum">5</tag>
          <tag role="typerefnum">Figure 5</tag>
        </tags>
        <graphics candidates="model_reward_eng.png" class="ltx_centering" graphic="model_reward_eng.png" options="width=325.215pt" xml:id="S4.F5.g1"/>
<!--  %****␣main_eng.tex␣Line␣350␣**** -->        <toccaption class="ltx_centering"><tag close=" ">5</tag>The scheme of the world model application through the reward to the training of the agent.</toccaption>
        <caption class="ltx_centering"><tag close=": ">Figure 5</tag>The scheme of the world model application through the reward to the training of the agent.</caption>
      </figure>
    </subsection>
    <subsection inlist="toc" labels="LABEL:subsec:expl_policy" xml:id="S4.SS3">
      <tags>
        <tag>4.3</tag>
        <tag role="autoref">subsection 4.3</tag>
        <tag role="refnum">4.3</tag>
        <tag role="typerefnum">§4.3</tag>
      </tags>
      <title><tag close=" ">4.3</tag>Exploration policy</title>
      <para xml:id="S4.SS3.p1">
        <p>Mixing an intrinsic signal with an extrinsic reward binds the learning of the policy with the learning of the world model, which can become a problem. The model trained in such a way is biased in the context of achieving a specific goal. Therefore, it needs additional fine-tuning on new tasks. The indicated problem is resolved by the insertion of an additional exploration policy that is not related to task one. The only objective of which is to generate training data that does not depend on the specific goal given to the agent.</p>
      </para>
      <para xml:id="S4.SS3.p2">
        <p>The problem of learning an exploratory policy (see Section <ref labelref="LABEL:subsec:losses"/>) is defined similarly to learning a task policy, except that the reward is determined only by intrinsic motivation.</p>
      </para>
      <para xml:id="S4.SS3.p3">
        <p>The exploration policy can work alone, training the agent model later used for specific tasks in the environment. Thus, an agent with a world model can learn to achieve the goal set by extrinsic motivation without interacting with the environment and immediately performs it at an acceptable level. For example, in <cite class="ltx_citemacro_cite"><bibref bibrefs="sekar_planning_2020" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> and <cite class="ltx_citemacro_cite"><bibref bibrefs="shyam_model-based_2019" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>, exploration policies (Plan2Explore and MAX respectively) collection of data in <Math mode="inline" tex="\mathcal{D}_{\mathcal{M}}" text="D _ M" xml:id="S4.SS3.p3.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">M</XMTok>
              </XMApp>
            </XMath>
          </Math> is based on the variance of the ensemble of models. Moreover, the training of the policy itself occurs according to the model — the memory <Math mode="inline" tex="\mathcal{D}_{\epsilon}" text="D _ epsilon" xml:id="S4.SS3.p3.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                <XMTok font="italic" fontsize="70%" name="epsilon" role="UNKNOWN">ϵ</XMTok>
              </XMApp>
            </XMath>
          </Math> is determined by the model <Math mode="inline" tex="\mathcal{M}" text="M" xml:id="S4.SS3.p3.m3">
            <XMath>
              <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
            </XMath>
          </Math>. The CEE-US agent <cite class="ltx_citemacro_cite"><bibref bibrefs="sancaktar_curious_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> works according to a similar scheme, but it uses an ensemble of graph networks. The task policy can also train the model — collecting the <Math mode="inline" tex="\mathcal{D}_{\mathcal{M}}" text="D _ M" xml:id="S4.SS3.p3.m4">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="caligraphic" role="UNKNOWN">D</XMTok>
                <XMTok font="caligraphic" fontsize="70%" role="UNKNOWN">M</XMTok>
              </XMApp>
            </XMath>
          </Math> memory. Usually, the proportion of samples from different policies is an agent hyperparameter (see <cite class="ltx_citemacro_cite"><bibref bibrefs="mendonca_discovering_2021" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>).</p>
      </para>
      <para xml:id="S4.SS3.p4">
        <p>In addition to training the model to solve previously unknown tasks, the exploration policy can be used as an initialization for the task policy. For example, in <cite class="ltx_citemacro_cite"><bibref bibrefs="groth_is_2021" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>, the authors propose to save copies of the exploration policy during learning to use them as a set of pretrained skills.</p>
      </para>
      <para xml:id="S4.SS3.p5">
        <p>Thus, the model of the world is involved in the agent training through the exploration policy as follows below (see Fig. <ref labelref="LABEL:fig:model_policy"/>). The model trains the exploration policy on intrinsic reward and data generated by the model. This policy collects the data in the most relevant environment for training the world model. A side application of the exploration policy can be the initialization of the task policy and the definition of intrinsically motivated goals (e.g., the LEXA agent <cite class="ltx_citemacro_cite"><bibref bibrefs="mendonca_discovering_2021" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>).</p>
      </para>
      <figure inlist="lof" labels="LABEL:fig:model_policy" placement="h" xml:id="S4.F6">
        <tags>
          <tag>Figure 6</tag>
          <tag role="autoref">Figure 6</tag>
          <tag role="refnum">6</tag>
          <tag role="typerefnum">Figure 6</tag>
        </tags>
        <graphics candidates="model_policy_eng.png" class="ltx_centering" graphic="model_policy_eng.png" options="width=325.215pt" xml:id="S4.F6.g1"/>
        <toccaption class="ltx_centering"><tag close=" ">6</tag>The scheme of the model application through the exploration policy to the training of the agent.</toccaption>
        <caption class="ltx_centering"><tag close=": ">Figure 6</tag>The scheme of the model application through the exploration policy to the training of the agent.</caption>
      </figure>
<!--  %****␣main_eng.tex␣Line␣375␣**** -->    </subsection>
    <subsection inlist="toc" labels="LABEL:subsec:intr_goals" xml:id="S4.SS4">
      <tags>
        <tag>4.4</tag>
        <tag role="autoref">subsection 4.4</tag>
        <tag role="refnum">4.4</tag>
        <tag role="typerefnum">§4.4</tag>
      </tags>
      <title><tag close=" ">4.4</tag>Intrinsically motivated goals</title>
      <para xml:id="S4.SS4.p1">
        <p>To train an agent to achieve goals, it is necessary to define the space of goals and set a schedule for their selection. The standard way to create a goal space is to make it the same as a state space. The choice of goals can be represented as some MDP: the agent has a hierarchical structure, where at the lower level, the policy <Math mode="inline" tex="\pi(a|s,g)" text="pi * conditional@(a, list@(s, g))" xml:id="S4.SS4.p1.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                <XMDual>
                  <XMRef idref="S4.SS4.p1.m1.3"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp xml:id="S4.SS4.p1.m1.3">
                      <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                      <XMTok font="italic" role="UNKNOWN">a</XMTok>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="list"/>
                          <XMRef idref="S4.SS4.p1.m1.1"/>
                          <XMRef idref="S4.SS4.p1.m1.2"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok font="italic" role="UNKNOWN" xml:id="S4.SS4.p1.m1.1">s</XMTok>
                          <XMTok role="PUNCT">,</XMTok>
                          <XMTok font="italic" role="UNKNOWN" xml:id="S4.SS4.p1.m1.2">g</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> selects actions, and at the next level, the policy <Math mode="inline" tex="\pi(g|s)" text="pi * conditional@(g, s)" xml:id="S4.SS4.p1.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" name="pi" role="UNKNOWN">π</XMTok>
                <XMDual>
                  <XMRef idref="S4.SS4.p1.m2.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMApp xml:id="S4.SS4.p1.m2.1">
                      <XMTok meaning="conditional" role="MODIFIEROP" stretchy="false">|</XMTok>
                      <XMTok font="italic" role="UNKNOWN">g</XMTok>
                      <XMTok font="italic" role="UNKNOWN">s</XMTok>
                    </XMApp>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
            </XMath>
          </Math> selects the goals that determine the behavior of the lower level. This representation makes it possible to transfer the standard methods of learning policy with a model to train the upper-level policy by intrinsic rewards (Director <cite class="ltx_citemacro_cite"><bibref bibrefs="hafner_deep_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>).</p>
      </para>
      <para xml:id="S4.SS4.p2">
        <p>Many agents do not use such a homogeneous hierarchical representation: they define goals from the history of the agent’s interaction with the environment or world model. In this case, the policy learning algorithm consists of two subtasks. The first is to collect a set of goals or construct a space of goals. The second is to sample targets from the previously formed set.</p>
      </para>
      <paragraph inlist="toc" xml:id="S4.SS4.SSS0.Px1">
        <title>Formation of goals set.</title>
        <para xml:id="S4.SS4.SSS0.Px1.p1">
          <p>The model of the world is capable of defining a set of goals that improve it. Here goals are states in which the model makes poor predictions. For example, an agent uses an exploratory policy trained on the data from a model to acquire a set of states that become potential targets for the policy (LEXA <cite class="ltx_citemacro_cite"><bibref bibrefs="mendonca_discovering_2021" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>).</p>
        </para>
        <para xml:id="S4.SS4.SSS0.Px1.p2">
          <p>The world model can also select the entire set of states that the agent can achieve, on the basis of information about their reachability, so as not to try to learn the impossible. This approach usually uses a simplified model version, which determines the probability of reaching one state from another in some fixed number of steps (e.g., <cite class="ltx_citemacro_cite"><bibref bibrefs="mezghani_walk_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>). Another way is to bind the representation of the goals with the states from which they are achievable (e.g., CC-RIG <cite class="ltx_citemacro_cite"><bibref bibrefs="nair_contextual_2020" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>).</p>
        </para>
        <para xml:id="S4.SS4.SSS0.Px1.p3">
          <p>An algorithmically convenient representation of transitions and states in the environment can be formed using the model. It highlights the morphology of the world, from which a set of goals is determined without unnecessary challenges. For example, you can use an object-oriented representation of states when their vector is formed based on raw sensory information, the components of which correspond to the characteristics of individual objects (as implemented in the SMORL algorithm <cite class="ltx_citemacro_cite"><bibref bibrefs="zadaianchuk_smorl_2020" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>). Or the model can define an interaction graph with separate objects as nodes. This method makes it possible to decompose the achievement of a common goal into subgoals. Each of them is an intrinsic goal for the agent (e.g., the SRICS algorithm <cite class="ltx_citemacro_cite"><bibref bibrefs="zadaianchuk_self-supervised_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>).</p>
        </para>
      </paragraph>
      <paragraph inlist="toc" xml:id="S4.SS4.SSS0.Px2">
        <title>The choice of goals</title>
        <para xml:id="S4.SS4.SSS0.Px2.p1">
          <p>is primarily determined by the same principles that formed the space of goals. When forming a set of goals, a numerical selection criterion is used (e.g., the probability of reaching, see above), which naturally corresponds to the priority of choosing a goal<note mark="4" role="footnote" xml:id="footnote4"><tags>
                <tag>4</tag>
                <tag role="autoref">footnote 4</tag>
                <tag role="refnum">4</tag>
                <tag role="typerefnum">footnote 4</tag>
              </tags>It is worth noting that this is not the only possible signal, but one that uses information from the model of the world. There are other signals, for example, the success of the agent in performing the goal, see review <cite class="ltx_citemacro_cite"><bibref bibrefs="oudeyer_what_2007" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase>(</bibrefphrase>
                  <bibrefphrase>)</bibrefphrase>
                </bibref></cite>.</note>. Or the constructed interaction graph specifies the learning schedule for individual nodes (SRICS <cite class="ltx_citemacro_cite"><bibref bibrefs="zadaianchuk_self-supervised_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                <bibrefphrase>(</bibrefphrase>
                <bibrefphrase>)</bibrefphrase>
              </bibref></cite>).</p>
        </para>
        <para xml:id="S4.SS4.SSS0.Px2.p2">
          <p>Thus, the model of the world makes it possible through intrinsic reward (see Fig. 7 arrow A) or through an exploration policy (see Fig. 7 arrow B) to choose goals for further learning. Also, the model can help to choose goals that are important in a particular environment based on the trained world structure (see Fig. 7 arrow C).</p>
        </para>
        <figure inlist="lof" labels="LABEL:fig:model_goal" placement="h" xml:id="S4.F7">
          <tags>
            <tag>Figure 7</tag>
            <tag role="autoref">Figure 7</tag>
            <tag role="refnum">7</tag>
            <tag role="typerefnum">Figure 7</tag>
          </tags>
          <graphics candidates="model_goal_eng.png" class="ltx_centering" graphic="model_goal_eng.png" options="width=325.215pt" xml:id="S4.F7.g1"/>
          <toccaption class="ltx_centering"><tag close=" ">7</tag>The scheme of the model application through setting goals to the agent’s training. Dashed arrows reveal the relationship between goal setting and model-based intrinsic motivation.</toccaption>
          <caption class="ltx_centering"><tag close=": ">Figure 7</tag>The scheme of the model application through setting goals to the agent’s training. Dashed arrows reveal the relationship between goal setting and model-based intrinsic motivation.</caption>
        </figure>
<!--  %****␣main_eng.tex␣Line␣400␣**** -->      </paragraph>
    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:problems" xml:id="S5">
    <tags>
      <tag>5</tag>
      <tag role="autoref">section 5</tag>
      <tag role="refnum">5</tag>
      <tag role="typerefnum">§5</tag>
    </tags>
    <title><tag close=" ">5</tag>Challenges and problems</title>
    <para xml:id="S5.p1">
      <p>The main application of the intrinsic motivation methods is to solve some problems in reinforcement learning related to exploring the environment and the agent capabilities (see review <cite class="ltx_citemacro_cite"><bibref bibrefs="aubret_survey_2019" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>). These problems are sparse reward problems, construction of the latent space, abstract skills formation, and training schedule.</p>
    </para>
    <para xml:id="S5.p2">
      <p>Reinforcement learning algorithms perform well when the agent receives a dense reward, i.e., for almost every completed action. However, in some cases, such a signal is absent or <emph font="italic">sparse</emph> (e.g., "Montezuma’s Revenge" <cite class="ltx_citemacro_cite"><bibref bibrefs="bellemare13arcade" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>). Intrinsic motivation methods suggest using a specific exploratory policy (see Section <ref labelref="LABEL:subsec:expl_policy"/>) or adding a dense intrinsic reward signal that directs the agent to perform exploration. Moreover, the trained world model corrects the behavior, considering the already-known knowledge about the environment.</p>
    </para>
    <para xml:id="S5.p3">
      <p>In deep reinforcement learning <emph font="italic">latent representations</emph> created by artificial neural networks according to the reward signal are task-specific, which causes difficulties in applying the trained agent model in tasks with another reward signal. Intrinsic motivation provides independent learning aimed at obtaining useful information from the environment and building representations that take into account the environment dynamics but not the specifics of the task (e.g., similar methods were considered when discussing the formation of a goal space, see the review <cite class="ltx_citemacro_cite"><bibref bibrefs="aubret_survey_2019" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> for more details).</p>
    </para>
    <para xml:id="S5.p4">
      <p>Training an agent with skills<note mark="5" role="footnote" xml:id="footnote5"><tags>
            <tag>5</tag>
            <tag role="autoref">footnote 5</tag>
            <tag role="refnum">5</tag>
            <tag role="typerefnum">footnote 5</tag>
          </tags>some abstract actions that certain group behaviors to achieve an intermediate goal</note> defines two more problems: <emph font="italic">the formation of skills</emph> and <emph font="italic">learning schedule</emph>. Among the works on intrinsic motivation, this area of research is referred to as competence-based motivation. Research studies <cite class="ltx_citemacro_cite"><bibref bibrefs="oudeyer_what_2007,aubret_survey_2019" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> consider this type of motivation in detail and <cite class="ltx_citemacro_cite"><bibref bibrefs="forestier_intrinsically_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> proposes the IMGEP approach taken it as a basis. As discussed above, one of the ways to form goals, which in turn explicitly defines the corresponding skill, is to use a model of the world. So, the relationships between the representations formed in the model make it possible to determine the sequence of learning goals.</p>
    </para>
    <para xml:id="S5.p5">
      <p>Intrinsic motivation solves many RL problems. However, there are problems and limitations with intrinsic motivation methods themselves. A lot of intrinsic reward signals are based on the prediction error. If the environment is stochastic, such an error may be irremovable, which means it will attract the agent. However, it does not improve his behavior, for example, a noisy TV <cite class="ltx_citemacro_cite"><bibref bibrefs="burda_exploration_2018" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>. Using an exploratory policy raises the question of when and how long exploration motives should determine the agent’s behavior. Many methods of intrinsic motivation, especially those that do not use the model of the world, provide exploration only in the short term. Still, the model-based approach with the planning solves this drawback.</p>
    </para>
    <para xml:id="S5.p6">
      <p>Others can solve problems faced by some methods of intrinsic motivation. So, for example, to choose between exploration and exploitation, each of the behaviors can be represented as a separate skill and trained in this paradigm (see Section <ref labelref="LABEL:subsec:intr_goals"/>). Thus, the integration of intrinsic motivation approaches into one whole system is one of the promising tasks that have not yet been solved in science. However, there are some works in this direction (IMGEP <cite class="ltx_citemacro_cite"><bibref bibrefs="forestier_intrinsically_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, Vygotskian Artificial Intelligence <cite class="ltx_citemacro_cite"><bibref bibrefs="colas_vygotskian_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>).</p>
    </para>
  </section>
  <section inlist="toc" labels="LABEL:sec:conclusion" xml:id="S6">
    <tags>
      <tag>6</tag>
      <tag role="autoref">section 6</tag>
      <tag role="refnum">6</tag>
      <tag role="typerefnum">§6</tag>
    </tags>
    <title><tag close=" ">6</tag>Conclusion</title>
    <para xml:id="S6.p1">
      <p>This article reviews the existing methods of intrinsic motivation that form a model of the world for learning. On the one hand, the presence of a model increases the learning performance in the environment on account of additional experience. However, this approach is standard practice for many reinforcement learning algorithms unrelated to intrinsic motivation. On the other hand, the model, as a source of experience accumulated by the agent, determines the intrinsic motivation that guides the agent in exploration.</p>
    </para>
    <para xml:id="S6.p2">
      <p>The framework is presented that systematizes the considered methods. Three ways of applying the world model determine the main classes that make up the proposed classification. The first is based on <emph font="italic">complementing the main reward</emph> with intrinsic model signals. Among them, we have identified three types: uncertainty <Math mode="inline" tex="\mathcal{L}[\mathcal{M}]" text="L * delimited-[]@(M)" xml:id="S6.p2.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
              <XMDual>
                <XMApp>
                  <XMTok meaning="delimited-[]"/>
                  <XMRef idref="S6.p2.m1.1"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">[</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN" xml:id="S6.p2.m1.1">M</XMTok>
                  <XMTok role="CLOSE" stretchy="false">]</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math> (usually model prediction error), knowledge gain <Math mode="inline" tex="\Delta\mathcal{M}" text="Delta * M" xml:id="S6.p2.m2">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok name="Delta" role="UNKNOWN">Δ</XMTok>
              <XMTok font="caligraphic" role="UNKNOWN">M</XMTok>
            </XMApp>
          </XMath>
        </Math> (change in the model itself when new information is received), and environment morphology <Math mode="inline" tex="\chi[\mathcal{M}]" text="chi * delimited-[]@(M)" xml:id="S6.p2.m3">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" name="chi" role="UNKNOWN">χ</XMTok>
              <XMDual>
                <XMApp>
                  <XMTok meaning="delimited-[]"/>
                  <XMRef idref="S6.p2.m3.1"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">[</XMTok>
                  <XMTok font="caligraphic" role="UNKNOWN" xml:id="S6.p2.m3.1">M</XMTok>
                  <XMTok role="CLOSE" stretchy="false">]</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math> (internal relationships between elements of the model of the world). The second class of methods relies on the application of the model to train <emph font="italic">the exploration policy</emph>. And the third class contains methods that form <emph font="italic">the goals and the sequence of their achievement</emph>, based on the signals of the model and its structural properties.</p>
    </para>
    <para xml:id="S6.p3">
      <p>Intrinsic motivation algorithms often implement only one of the ways to apply the model (see Table <ref labelref="LABEL:table:intr_methods"/>) without using all its potential. The consistent coordination of complementary intrinsic rewards, exploration policy, and intrinsic goals sets the direction for our future research.</p>
    </para>
<!--  %****␣main_eng.tex␣Line␣425␣**** -->  </section>
  <bibliography citestyle="numbers" files="bibliography" xml:id="bib">
    <title>References</title>
  </bibliography>
</document>
