<?xml version="1.0" encoding="UTF-8"?>
<?latexml searchpaths="/home/japhy/scienceReplication.artiswrong.com/paper_files/arxiv/2005.02880/latex_extracted"?>
<?latexml class="article"?>
<?latexml package="iclr2020_conference,times"?>
<!--  %Optional math commands from https://github.com/goodfeli/dlbook˙notation. --><!--  %%%%% NEW MATH DEFINITIONS %%%%% --><?latexml package="amsmath,amsfonts,bm"?>
<!--  %Mark sections of captions for referring to divisions of figures --><!--  %Highlight a newly defined term --><!--  %Figure reference, lower-case. --><!--  %Figure reference, capital. For start of sentence --><!--  %**** math˙commands.tex Line 25 **** --><!--  %Section reference, lower-case. --><!--  %Section reference, capital. --><!--  %Reference to two sections. --><!--  %Reference to three sections. --><!--  %Reference to an equation, lower-case. --><!--  %Reference to an equation, upper case --><!--  %A raw reference to an equation__avoid using if possible --><!--  %Reference to a chapter, lower-case. --><!--  %Reference to an equation, upper case. --><!--  %Reference to a range of chapters --><!--  %Reference to an algorithm, lower-case. --><!--  %Reference to an algorithm, upper case. --><!--  %**** math˙commands.tex Line 50 **** --><!--  %Reference to a part, lower case --><!--  %Reference to a part, upper case --><!--  %Random variables --><!--  %**** math˙commands.tex Line 75 **** --><!--  %rm is already a command, just don’t name any random variables m --><!--  %Random vectors --><!--  %**** math˙commands.tex Line 100 **** --><!--  %**** math˙commands.tex Line 125 **** --><!--  %Elements of random vectors --><!--  %**** math˙commands.tex Line 150 **** --><!--  %Random matrices --><!--  %**** math˙commands.tex Line 175 **** --><!--  %Elements of random matrices --><!--  %**** math˙commands.tex Line 200 **** --><!--  %Vectors --><!--  %**** math˙commands.tex Line 225 **** --><!--  %Elements of vectors --><!--  %**** math˙commands.tex Line 250 **** --><!--  %**** math˙commands.tex Line 275 **** --><!--  %Matrix --><!--  %**** math˙commands.tex Line 300 **** --><!--  %Tensor --><!--  %**** math˙commands.tex Line 325 **** --><!--  %Graph --><!--  %**** math˙commands.tex Line 350 **** --><!--  %Sets --><!--  %**** math˙commands.tex Line 375 **** --><!--  %Don’t use a set called E, because this would be the same as our symbol --><!--  %for expectation. --><!--  %**** math˙commands.tex Line 400 **** --><!--  %Entries of a matrix --><!--  %**** math˙commands.tex Line 425 **** --><!--  %entries of a tensor --><!--  %Same font as tensor, without “bm wrapper --><!--  %**** math˙commands.tex Line 450 **** --><!--  %The true underlying data generating distribution --><!--  %The empirical distribution defined by the training set --><!--  %The model distribution --><!--  %Stochastic autoencoder distributions --><!--  %**** math˙commands.tex Line 475 **** --><!--  %Laplace distribution --><!--  %Wolfram Mathworld says $L^2$ is for function spaces and $“ell^2$ is for vectors --><!--  %But then they seem to use $L^2$ for vectors throughout the site, and so does --><!--  %wikipedia. --><!--  %**** math˙commands.tex Line 500 **** --><!--  %See usage in notation.tex. Chosen to match Daphne’s book. --><?latexml package="hyperref"?>
<?latexml package="url"?>
<?latexml package="ifthen"?>
<?latexml package="graphicx"?>
<?latexml package="xcolor"?>
<!--  %Authors must not appear in the submitted version. They should be hidden --><!--  %as long as the “iclrfinalcopy macro remains commented out below. --><!--  %Non-anonymous submissions will be rejected without review. --><!--  %**** iclr2020˙conference.tex Line 25 **** --><!--  %The “author macro works with any number of authors. There are two commands --><!--  %used to separate the names and addresses of multiple authors: “And and “AND. --><!--  %Using “And between authors leaves it to “LaTeX–˝ to determine where to break --><!--  %the lines. Using “AND forces a linebreak at that point. So, if “LaTeX–˝ --><!--  %puts 3 of 4 authors names on the first line, and the last on the second --><!--  %line, try using “AND instead of “And before the third author name. --><?latexml RelaxNGSchema="LaTeXML"?>
<document xmlns="http://dlmf.nist.gov/LaTeXML" class="ltx_authors_1line">
  <resource src="LaTeXML.css" type="text/css"/>
  <resource src="ltx-article.css" type="text/css"/>
  <title>Exploring Exploration: Comparing Children with RL Agents in Unified Environments</title>
  <creator role="author">
    <personname>Eliza Kosoy<Math mode="inline" tex="{}^{1}" text="^1" xml:id="m1">
        <XMath>
          <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
            <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
          </XMApp>
        </XMath>
      </Math>  , Jasmine Collins<Math mode="inline" tex="{}^{1}" text="^1" xml:id="m2">
        <XMath>
          <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
            <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
          </XMApp>
        </XMath>
      </Math>, David M. Chan<Math mode="inline" tex="{}^{1}" text="^1" xml:id="m3">
        <XMath>
          <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
            <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
          </XMApp>
        </XMath>
      </Math>, Sandy Huang<Math mode="inline" tex="{}^{2}" text="^2" xml:id="m4">
        <XMath>
          <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
            <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
          </XMApp>
        </XMath>
      </Math>, Deepak Pathak<Math mode="inline" tex="{}^{1}" text="^1" xml:id="m5">
        <XMath>
          <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
            <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
          </XMApp>
        </XMath>
      </Math>,<break/><text font="bold">Pulkit Agrawal<Math mode="inline" tex="{}^{3}" text="^3" xml:id="m6">
          <XMath>
            <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
              <XMTok font="medium" fontsize="70%" meaning="3" role="NUMBER">3</XMTok>
            </XMApp>
          </XMath>
        </Math>, John Canny<Math mode="inline" tex="{}^{1}" text="^1" xml:id="m7">
          <XMath>
            <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
              <XMTok font="medium" fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
            </XMApp>
          </XMath>
        </Math>, Alison Gopnik<Math mode="inline" tex="{}^{1}" text="^1" xml:id="m8">
          <XMath>
            <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
              <XMTok font="medium" fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
            </XMApp>
          </XMath>
        </Math>, &amp; Jessica B. Hamrick<Math mode="inline" tex="{}^{2}" text="^2" xml:id="m9">
          <XMath>
            <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
              <XMTok font="medium" fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
            </XMApp>
          </XMath>
        </Math><break/><Math mode="inline" tex="{}^{1}" text="^1" xml:id="m10">
          <XMath>
            <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
              <XMTok font="medium" fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
            </XMApp>
          </XMath>
        </Math></text>University of California, Berkeley<break/><Math mode="inline" tex="{}^{2}" text="^2" xml:id="m11">
        <XMath>
          <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
            <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
          </XMApp>
        </XMath>
      </Math>DeepMind, London <break/><Math mode="inline" tex="{}^{3}" text="^3" xml:id="m12">
        <XMath>
          <XMApp role="FLOATSUPERSCRIPT" scriptpos="1">
            <XMTok fontsize="70%" meaning="3" role="NUMBER">3</XMTok>
          </XMApp>
        </XMath>
      </Math>Massachusetts Institute of Technology <break/></personname>
    <contact role="thanks">Corresponding author: <text font="typewriter">eko@berkeley.edu</text></contact>
  </creator>
  <abstract name="Abstract">
    <p>Research in developmental psychology consistently shows that children explore the world thoroughly and efficiently and that this exploration allows them to learn. In turn, this early learning supports more robust generalization and intelligent behavior later in life.
While much work has gone into developing methods for exploration in machine learning, artificial agents have not yet reached the high standard set by their human counterparts.
In this work we propose using DeepMind Lab <cite class="ltx_citemacro_citep">(<bibref bibrefs="beattie2016deepmind" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
          <bibrefphrase>, </bibrefphrase>
        </bibref>)</cite> as a platform to directly compare child and agent behaviors and to develop new exploration techniques.
We outline two ongoing experiments to demonstrate the effectiveness of a direct comparison, and outline a number of open research questions that we believe can be tested using this methodology.</p>
  </abstract>
  <ERROR class="undefined">\iclrfinalcopy</ERROR>
<!--  %Uncomment for camera-ready version, but NOT for submission. 
     %**** iclr2020˙conference.tex Line 50 ****-->  <section inlist="toc" labels="LABEL:sec:intro" xml:id="S1">
    <tags>
      <tag>1</tag>
      <tag role="autoref">section 1</tag>
      <tag role="refnum">1</tag>
      <tag role="typerefnum">§1</tag>
    </tags>
    <title><tag close=" ">1</tag>The Problem of Exploration</title>
    <para xml:id="S1.p1">
      <p>The problem of exploration is one of the most fundamental issues in reinforcement learning (RL): how should an agent gather enough experience from different parts of the world in order to later produce approximately optimal behavior?
Questions surrounding exploration have been investigated for as long as the field has existed <cite class="ltx_citemacro_citep">(<bibref bibrefs="thompson1933likelihood,robbins1952some,gittins1979dynamic" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, and it continues to be a major focus of research today, with recent approaches estimating various quantities to guide exploration, such as visit counts <cite class="ltx_citemacro_citep">(<bibref bibrefs="bellemare2013arcade,ostrovski2017count,martin2017count,tang2017exploration,machado2018count" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, uncertainty <cite class="ltx_citemacro_citep">(<bibref bibrefs="osband2016deep,burda2018exploration" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, surprise <cite class="ltx_citemacro_citep">(<bibref bibrefs="schmidhuber1991curious,pathak2017curiosity" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, learning progress <cite class="ltx_citemacro_citep">(<bibref bibrefs="kaplan2004maximizing,baranes2013active" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, disagreement <cite class="ltx_citemacro_citep">(<bibref bibrefs="pathak2019self" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, or other forms of novelty <cite class="ltx_citemacro_citep">(<bibref bibrefs="fu2017ex2" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>; see <cite class="ltx_citemacro_citet"><bibref bibrefs="Lavet_2018" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> for a recent review.
Yet, despite these efforts, the problem of exploration remains far from solved.
Indeed, algorithms which achieve state-of-the-art performance on benchmarks such as Atari often still rely on simple exploration strategies like <Math mode="inline" tex="\epsilon" text="epsilon" xml:id="S1.p1.m1">
          <XMath>
            <XMTok font="italic" name="epsilon" role="UNKNOWN">ϵ</XMTok>
          </XMath>
        </Math>-greedy combined with huge amounts of computation <cite class="ltx_citemacro_citep">(<bibref bibrefs="kapturowski2018recurrent" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>.</p>
    </para>
    <para xml:id="S1.p2">
      <p>As with artificial agents, exploration is a key feature of human behavior.
Dating back to <cite class="ltx_citemacro_citet"><bibref bibrefs="Piaget1933" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, developmental researchers have conceived of children as active and curious learners who are intrinsically motivated to explore the world in systematic and rational ways <cite class="ltx_citemacro_citep">(<bibref bibrefs="Schulz_2007,Cook,Legare_2012,Schulz_2012,Ruggeri_2019" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>; see <cite class="ltx_citemacro_citet"><bibref bibrefs="Schulz_2012" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> for a review.
The simplest example of this exploration may be in the way that active movement through space informs both object understanding and navigation.
For example, when infants become mobile and begin to crawl this exploration appears to allow them to learn both about space and objects <cite class="ltx_citemacro_citep">(<bibref bibrefs="campos2000travel" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Even 11-month old infants choose to physically explore objects that violate expectations of object solidity or object support instead of a novel distractor object <cite class="ltx_citemacro_citep">(<bibref bibrefs="Stahl_2015" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>.
Older children also explore in more complex ways, both when evidence does not conform to their expectations <cite class="ltx_citemacro_citep">(<bibref bibrefs="Bonawitz_2012,Legare_2012" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> and when evidence is causally confounded <cite class="ltx_citemacro_citep">(<bibref bibrefs="Schulz_2007,Glymour_2007,Cook,Schulz_2012" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>.
The simple fact that children know less than adults may make them more open to new kinds of learning and exploration <cite class="ltx_citemacro_citep">(<bibref bibrefs="Lucas_2014,Gopnik_2017" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Recent evidence suggests that children do indeed explore more than adults, and that this translates into higher amounts of learning, even when exploration comes at a cost <cite class="ltx_citemacro_citep">(<bibref bibrefs="Liquin_2019,schulz2019searching,Sumner_2019,Hartley_2019" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>.
Moreover, such learning is rapid and supports powerful, abstract generalizations <cite class="ltx_citemacro_citep">(<bibref bibrefs="Xu_2017,Bonawitz_2020" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>.
For example, <cite class="ltx_citemacro_citet"><bibref bibrefs="Xu_2017" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> found that in the course of playing with a toy that was activated by different shaped and colored blocks, preschoolers could develop an abstract “overhypothesis” about how the toy functioned, such as determining whether the blocks worked based on their color or their shape, and use that overhypothesis to make inferences about a new toy or block.</p>
    </para>
    <para xml:id="S1.p3">
      <p>The generalization and rapid learning resulting from children’s exploration is in stark contrast with that exhibited by modern RL agents.
We suggest that by performing <emph font="italic">direct</emph>, controlled comparisons between children and agents, we may be able to leverage insights from children’s exploratory behavior to improve the design of RL algorithms.
Although previous approaches to exploration have been motivated qualitatively by human behavior <cite class="ltx_citemacro_citep">(e.g. <bibref bibrefs="pathak2017curiosity" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, they typically have not included direct comparisons, making it difficult to know whether such methods actually capture the behavior they are inspired by.
For example, while learning exhibited by children is typically measured over a handful of trials, the learning done by curious RL agents is measured over millions of environment steps.
Moreover, experiments with children have not been performed in the types of navigation-centric environments where agents are often trained; we therefore do not know the extent to which results in the developmental literature even apply to such agents.</p>
    </para>
    <para xml:id="S1.p4">
      <p>Prior work has demonstrated the value of human baselines as a useful comparison for agent behavior in other settings.
For example, the baselines on Atari games provided by <cite class="ltx_citemacro_citet"><bibref bibrefs="mnih2015human" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> have been widely influential in RL, providing a point of comparison for hundreds of experiments.
<!--  %**** iclr2020˙conference.tex Line 75 **** -->Other work goes beyond using human data as a baseline, for example by using it to illuminate key differences between human and agent priors <cite class="ltx_citemacro_citep">(<bibref bibrefs="dubey2018investigating" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>.
However, with a few interesting exceptions <cite class="ltx_citemacro_citep">(<bibref bibrefs="bambach2018toddler,seita2019zpd" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, most existing comparisons have been done with adults.
We argue instead for using <emph font="italic">children</emph> as direct inspiration for research in exploration. Very young children learn extensively, and, unlike adults, they explore widely, ubiquitously and effectively with little direct training, explicit education or reflection. In fact, arguably most human learning takes place in childhood.</p>
    </para>
    <para xml:id="S1.p5">
      <p>Here, we present a methodology based on DeepMind Lab <cite class="ltx_citemacro_citep">(<bibref bibrefs="beattie2016deepmind" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> for directly comparing child and agent behavior in simulated exploration tasks, allowing us to precisely test questions about how children explore, how agents explore, and how and why they differ.
Using this methodology, we propose two candidate experiments designed to test key qualitative predictions of different exploration algorithms with respect to what is known about children’s exploration behavior in other domains.
Although we are still collecting data in these experiments, we present some preliminary analyses which already raise interesting questions, setting the stage for further research to inspire new approaches to the problem of exploration.</p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:Kid_mazes" xml:id="S1.F1">
      <tags>
        <tag>Figure 1</tag>
        <tag role="autoref">Figure 1</tag>
        <tag role="refnum">1</tag>
        <tag role="typerefnum">Figure 1</tag>
      </tags>
      <inline-para align="center" class="ltx_minipage" vattach="middle" width="325.2pt">
        <para xml:id="S1.F1.p1">
          <graphics candidates="images/kid_mazes_2.pdf" graphic="images/kid_mazes_2.pdf" options="width=433.62pt" xml:id="S1.F1.p1.g1"/>
        </para>
      </inline-para>
      <toccaption class="ltx_centering"><tag close=" ">1</tag>(Left) Child using the Arduino-based controller to explore a maze in DeepMind Lab. (Middle) The maze that the child sees on the screen. (Right) Top-down view of maze layout.</toccaption>
      <caption class="ltx_centering"><tag close=": ">Figure 1</tag>(Left) Child using the Arduino-based controller to explore a maze in DeepMind Lab. (Middle) The maze that the child sees on the screen. (Right) Top-down view of maze layout.</caption>
    </figure>
  </section>
  <section inlist="toc" labels="LABEL:sec:questions" xml:id="S2">
    <tags>
      <tag>2</tag>
      <tag role="autoref">section 2</tag>
      <tag role="refnum">2</tag>
      <tag role="typerefnum">§2</tag>
    </tags>
    <title><tag close=" ">2</tag>Directly Comparing Children and Agents</title>
    <para xml:id="S2.p1">
      <p>We propose using DeepMind Lab <cite class="ltx_citemacro_citep">(<bibref bibrefs="beattie2016deepmind" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> as a unified environment for training and evaluating both humans and agents.
DeepMind Lab is a learning environment, based on the Quake game engine, that provides a suite of challenging 3D navigation and puzzle-solving tasks for learning agents.
<!--  %**** iclr2020˙conference.tex Line 100 **** -->These tasks require physical or spatial navigation capabilities to achieve and are modeled after games that children play themselves.
In our experimental setup, children are allowed to interact with the DeepMind Lab environment through a custom Arduino-based controller shown in <ref class="ltx_refmacro_autoref" labelref="LABEL:fig:Kid_mazes" show="autoref"/>.
This controller exposes the same four actions that agents would use in this environment (move forward, move back, move left, and turn right).
More technical details about the environment are available in Appendix <ref labelref="LABEL:app:environment"/>.</p>
    </para>
    <para xml:id="S2.p2">
      <p>DeepMind Lab allows us to place agents and humans on more even footing, enabling a more precise exploration of the differences in child and agent behavior.
In particular, we emphasize that the Lab environment is more ecologically valid than other standard RL environments, that it enables more controlled comparisons than are typically seen in the RL literature, and that it provides an avenue for developing new cognitive models.</p>
    </para>
    <para xml:id="S2.p3">
      <p><text font="bold">Ecological validity.</text>
One key reason for gathering data from children and agents in the same environment is that it forces agents to be evaluated in a more ecologically valid setting, compared to grid world-like settings or 2D Atari games <cite class="ltx_citemacro_citep">(<bibref bibrefs="bellemare2013arcade" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>.
DeepMind Lab combines rich visuals of a simulated world with first-person view, which is much closer to the situated nature of human experience (and which may play an important role in generalization; see <cite class="ltx_citemacro_citet"><bibref bibrefs="hill2019emergent" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>).
Furthermore, navigating through the mazes in DeepMind lab is sufficiently interesting such that the human children are captivated by the task.</p>
    </para>
    <para xml:id="S2.p4">
      <p><text font="bold">Controlled comparisons.</text>
Comparing children directly to agents provides a standard baseline for the evaluation of agent behavior, and can assist in identifying areas of promising research in deep RL.
For example, while much of the literature in developmental psychology has focused on free exploration behavior, the majority of work on exploration in artificial intelligence and machine learning has been for goal-seeking domains.
Thus, the quality of an exploration method is typically measured by how much it improves the learning speed and final performance of an agent on a particular task, rather than how well it enables this agent to acquire knowledge and generalize to other tasks.
While recent work in meta-learning <cite class="ltx_citemacro_citep">(e.g. <bibref bibrefs="finn2017model" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> has begun to expand the metrics that we use beyond single-task reward to transfer efficiency, such papers often lack a strong baseline for human performance and behavior.</p>
    </para>
    <para xml:id="S2.p5">
      <p><text font="bold">Cognitive modeling.</text>
In addition to allowing for an ecologically valid experimental setting, these direct comparisons give strong direction in the development of new cognitive models of behavior, furthering the “virtuous cycle” between cognitive science and AI <cite class="ltx_citemacro_citep">(<bibref bibrefs="hassabis2017neuroscience" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>.
Collecting experimental data from children in the exact environment for which we will test our artificial agents allows us to directly evaluate learned behavior as well as design challenging test-time environments to understand the circumstances when the agent behavior and child behavior diverge.
These divergences have the ability to shed light on issues in both RL and cognitive sciences: How do RL agents react to classical pitfalls for humans?
How do humans react to the classical pitfalls for artificial agents?
Can we create a unified theory?
<!--  %**** iclr2020˙conference.tex Line 125 **** --></p>
    </para>
  </section>
  <section inlist="toc" labels="LABEL:sec:experiments" xml:id="S3">
    <tags>
      <tag>3</tag>
      <tag role="autoref">section 3</tag>
      <tag role="refnum">3</tag>
      <tag role="typerefnum">§3</tag>
    </tags>
    <title><tag close=" ">3</tag>Illustrative Experiments and Results</title>
    <para xml:id="S3.p1">
      <p>Although we are still in the process of collecting and analyzing the data, our preliminary results begin to demonstrate the utility of a direct comparison between children and agents. Both experiments below have been approved by UC Berkeley’s IRB.</p>
    </para>
    <subsection inlist="toc" labels="LABEL:sec:exp_1" xml:id="S3.SS1">
      <tags>
        <tag>3.1</tag>
        <tag role="autoref">subsection 3.1</tag>
        <tag role="refnum">3.1</tag>
        <tag role="typerefnum">§3.1</tag>
      </tags>
      <title><tag close=" ">3.1</tag>Experiment 1: Free versus goal-directed exploration</title>
      <para xml:id="S3.SS1.p1">
        <p>Our first experiment is designed to determine if there are differences in the exploration strategies of children who are faced with an unknown environment.
In this experiment, children explored the virtual DeepMind Lab mazes using a custom-built child-friendly controller (see Appendix <ref labelref="LABEL:app:exp1"/> for full details and maze layouts).
They completed two mazes one after another, each with the same layout.
In the first maze, they were told to explore freely (the “no-goal” condition), and in the second maze they were told to search for a “gummy” (the “goal” condition).
Our initial results suggest that children exhibit a wide range of variability in how much they explore in the no-goal condition, with “low explorers” only exploring about 22% of the maze, “medium explorers” exploring about 44%, and “high explorers” exploring up to 71% of the maze.
We see a relationship between the level of exploration in the no-goal condition and the steps taken to find the gummy in the goal condition: low explorers take the longest amount of time to reach the goal (95 steps on average), and medium explorers take 89 steps, whereas high explorers take 66 steps on average.</p>
      </para>
      <para xml:id="S3.SS1.p2">
        <p>We also find that children’s search strategies between the no-goal and goal condition differ.
We compared children’s behavior to that of a depth-first search (DFS) agent, which pursues an unexplored path until it reaches a dead-end, at which point it will turn around and explore the last unexplored path it has seen.
More details of this agent, and analysis of the experiment are available in Appendix <ref labelref="LABEL:app:exp1"/>.
We find that in the no-goal condition, children made choices consistent with DFS 89.61% of the time compared to the goal condition, in which children made choices consistent with DFS 96.04% of the time (<Math mode="inline" tex="p=0.0073" text="p = 0.0073" xml:id="S3.SS1.p2.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" role="UNKNOWN">p</XMTok>
                <XMTok meaning="0.0073" role="NUMBER">0.0073</XMTok>
              </XMApp>
            </XMath>
          </Math>).</p>
      </para>
      <para xml:id="S3.SS1.p3">
        <p>In combination, the above two results suggest that during the no-goal condition children build a mental model of the maze, which in “high explorers” is necessarily more complete.
During the goal condition, children are able to leverage this mental model to perform more efficient, DFS-like search towards the goal. These preliminary results suggest that we can start to understand the children’s behavior in terms of existing algorithms for search and exploration.
<!--  %**** iclr2020˙conference.tex Line 150 **** -->In contrast, RL agents are unlikely to exhibit directed, efficient exploration. Most state-of-the-art approaches for guiding exploration in RL agents (Sec. <ref labelref="LABEL:sec:intro"/>) depend on the agent first stumbling upon an interesting area by chance, and then encourage the agent to revisit that area until it is no longer “interesting.” In other words, RL agents are retrospective, rather than prospective, explorers.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S3.SS2">
      <tags>
        <tag>3.2</tag>
        <tag role="autoref">subsection 3.2</tag>
        <tag role="refnum">3.2</tag>
        <tag role="typerefnum">§3.2</tag>
      </tags>
      <title><tag close=" ">3.2</tag>Experiment 2: Sparse versus dense rewards</title>
      <para xml:id="S3.SS2.p1">
        <p>RL agents typically learn best using dense reward signals.
However, dense rewards make agents less incentivized to explore, and can thus lead to poor generalization.
We are interested in characterizing to what extent dense rewards can impact the exploration behavior in children.
If children are less susceptible to over-fitting to dense rewards, their behavior could shed light on how to design RL algorithms with better generalization.</p>
      </para>
      <para xml:id="S3.SS2.p2">
        <p>To test this, we developed a second experiment on children aged 4-6, in which children complete two mazes, each with three phases. In the first phase, the children explore the maze either in a “no-goal” condition, where there is no goal; a “sparse” condition where a goal exists, but there is no local reward; or a “dense” condition where a goal and local rewards leading to the goal are present (see Appendix <ref labelref="LABEL:app:exp2"/> for specific experiment details).
In the second phase, children are asked to find the goal again, which is in the same location as during exploration.
In the final phase, they are asked to find the goal, but the optimal route to the goal is blocked.
We hypothesize that both children and RL agents in the “dense” condition will (1) follow the dense-reward path directly to the goal in the first phase, (2) find the goal faster in the second phase (since they can repeat the previous dense-reward path), but (3) will take longer to find the goal in the final phase, compared to those in the “sparse” condition, because the previous dense-reward path to the goal is now blocked. Some RL agents in the “dense” condition may not find the goal at all in the final phase, if they are unable to switch from exploitation to exploration when they find that the path is blocked.</p>
      </para>
      <para xml:id="S3.SS2.p3">
        <p>While we are still collecting data, initial experimental data suggest that children are less likely to explore an area in the dense rewards condition, however, surprisingly, that lack of exploration does not hurt their performance in the final phase.</p>
      </para>
    </subsection>
  </section>
  <section inlist="toc" xml:id="S4">
    <tags>
      <tag>4</tag>
      <tag role="autoref">section 4</tag>
      <tag role="refnum">4</tag>
      <tag role="typerefnum">§4</tag>
    </tags>
    <title><tag close=" ">4</tag>Discussion</title>
    <para xml:id="S4.p1">
      <p>Even the most sophisticated methods for exploration in RL tend to explore only in the service of a specific goal, and are usually driven by error rather than seeking information.
We believe that to truly build intelligent agents, they must do as children do: actively explore their environments, perform experiments, and gather information to weave together into a rich model of the world, which can later be used to rapidly perform new tasks.
Our proposed paradigm using DeepMind Lab to support this endeavor by allowing us to identify the areas where agents and children already act similarly and those in which they do not.
Indeed, our preliminary results already suggest qualitative differences between the exploration behavior of children and agents: for example, we expect that most deep RL agents will not replicate the DFS-like behavior that we observed in children in Experiment 1.</p>
    </para>
<!--  %**** iclr2020˙conference.tex Line 175 **** -->    <para xml:id="S4.p2">
      <p>This work only begins to touch on a number of deep questions regarding how children and agents explore.
The two experiments presented here touch on questions of how much children and agents are willing to explore; whether free versus goal-directed exploration strategies differ; and how reward shaping affects exploration.
Yet, our setup allows us to ask so many more, and we have concrete plans to do so.
These include: how easily do agents and children get distracted by irrelevant stimuli or objects in a maze?
To what extent can children and agents remember and integrate information during exploration to aid in future tasks?
How do children react to both positive and negative rewards, and explore mazes safely?
In asking these questions, we will be able to acquire a deeper understanding of the way that children and agents explore novel environments, and how to close the gap between them.</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S5">
    <tags>
      <tag>5</tag>
      <tag role="autoref">section 5</tag>
      <tag role="refnum">5</tag>
      <tag role="typerefnum">§5</tag>
    </tags>
    <title><tag close=" ">5</tag>Acknowledgements</title>
    <para xml:id="S5.p1">
      <p>We would like to thank David Szepesvari and Alyosha Efros for helpful discussions and feedback on this work.</p>
    </para>
    <para xml:id="S5.p2">
      <p>This material is based upon work supported by the
DARPA – Machine Common Sense grant (HR001119S0005) and the National Science Foundation Graduate Research Fellowship under Grant No. 1752814.</p>
    </para>
  </section>
  <bibliography citestyle="authoryear" files="iclr2020_conference" xml:id="bib">
    <title>References</title>
  </bibliography>
  <pagination role="newpage"/>
  <appendix xml:id="Ax1">
    <title>Appendix/Supplementary Materials</title>
  </appendix>
  <appendix inlist="toc" labels="LABEL:app:environment" xml:id="A1">
    <tags>
      <tag>Appendix A</tag>
      <tag role="autoref">Appendix A</tag>
      <tag role="refnum">A</tag>
      <tag role="typerefnum">Appendix A</tag>
    </tags>
    <title><tag close=" ">Appendix A</tag>DeepMind Lab Environment</title>
    <toctitle><tag close=" ">A</tag>DeepMind Lab Environment</toctitle>
<!--  %**** iclr2020˙conference.tex Line 200 **** -->    <para xml:id="A1.p1">
      <p>Observations in the DeepMind Lab environment are rendered at 30FPS (close to human perception), and actions cause the avatar in the maze to move either forward or back or to turn left or right. These actions cause the avatar to move 10-15 game units forward/backward in the game space, which is equivalent to about 1/5th of a cell in Figure <ref labelref="LABEL:fig:exp1"/>.
As either children or agents can interact with DeepMind lab, in both cases we record the same type of state, action and trajectory information.
Trajectories are discretized by determining which cell the avatar is in (see Figure <ref labelref="LABEL:fig:exp1"/>), and recording the new state if it is not equivalent to the cell at the previous time step. These trajectories are then directly compared.</p>
    </para>
  </appendix>
  <appendix inlist="toc" xml:id="A2">
    <tags>
      <tag>Appendix B</tag>
      <tag role="autoref">Appendix B</tag>
      <tag role="refnum">B</tag>
      <tag role="typerefnum">Appendix B</tag>
    </tags>
    <title><tag close=" ">Appendix B</tag>Experimental Details</title>
    <toctitle><tag close=" ">B</tag>Experimental Details</toctitle>
<!--  %You may include other additional sections here. -->    <subsection inlist="toc" labels="LABEL:app:exp1" xml:id="A2.SS1">
      <tags>
        <tag>B.1</tag>
        <tag role="autoref">subsection B.1</tag>
        <tag role="refnum">B.1</tag>
        <tag role="typerefnum">§B.1</tag>
      </tags>
      <title><tag close=" ">B.1</tag>Experiment 1</title>
      <figure inlist="lof" labels="LABEL:fig:exp1" placement="h" xml:id="A2.F2">
        <tags>
          <tag>Figure 2</tag>
          <tag role="autoref">Figure 2</tag>
          <tag role="refnum">2</tag>
          <tag role="typerefnum">Figure 2</tag>
        </tags>
        <inline-para align="center" class="ltx_minipage" vattach="middle" width="355.6pt">
          <para xml:id="A2.F2.p1">
            <graphics candidates="images/3mazes_EXP1.png" graphic="images/3mazes_EXP1.png" options="width=433.62pt" xml:id="A2.F2.p1.g1"/>
          </para>
        </inline-para>
        <toccaption class="ltx_centering"><tag close=" ">2</tag>(Top) What the child sees when they start each of 3 different parts of the maze game in DeepMind Lab (Bottom) Top-down view of maze layout for each of the 3 parts of the experiment</toccaption>
        <caption class="ltx_centering"><tag close=": ">Figure 2</tag>(Top) What the child sees when they start each of 3 different parts of the maze game in DeepMind Lab (Bottom) Top-down view of maze layout for each of the 3 parts of the experiment</caption>
      </figure>
      <para xml:id="A2.SS1.p1">
        <p>Our first experiment is designed to determine if there are differences in the exploration strategies of children when faced with an unknown environment.
In this experiment, children explored the virtual DeepMind Lab mazes using our child-friendly controller.
They completed three mazes one after another, each with the same over-all layout.
In the first maze, they were told to explore freely (the “no-goal” condition), and in the second maze they were told to search for a “gummy” (the “goal” condition).
and (c) search for the gummy when the most direct path to the goal is blocked.
This “blocked” condition design is directly inspired by Tolman mazes <cite class="ltx_citemacro_citep">(<bibref bibrefs="Tollman1946" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> designed for rats, which demonstrate the ability to re-orient themselves to find a reward in a blocked condition.
<!--  %**** iclr2020˙conference.tex Line 225 **** -->For this experiment, we pre-registered a sample size of 50 children aged 4-5.</p>
      </para>
      <para xml:id="A2.SS1.p2">
        <p>Our initial results suggest that children exhibit a wide range of variability in how much they explore in the no-goal condition, using K-means we break the children into 3 types of explorers (low, medium and high) based on how much they explored in the first “no goal” part. “Low explorers” only explored about 22% of the maze, “medium explorers” explored about 44%, and “high explorers” explored up to 71% of the maze.
Low explorers take about 94.89 steps on average to reach the goal, medium explorers take 89.4 steps and high explorers take about 66.01 steps.
This suggests that naturally exploring more in Part A (without prompts) helps you find the goal in Part B. We do not find any correlations between explorer type and steps taken to reach the goal in Part C in the blocked condition, but plan to explore this further.</p>
      </para>
      <para xml:id="A2.SS1.p3">
        <p><text font="bold">Comparison to Depth First Search:</text> Depth first search is a systematic search algorithm which operates by greedily traversing a path until no further traversal can be made. It then backtracks to the most recent branching point which has unexplored branches, and then explores down a new, unexplored branch.</p>
      </para>
      <para xml:id="A2.SS1.p4">
        <p>In addition to the metrics mentioned above, we also compute what we call “Consistency” between a child’s behavior and the depth first search algorithm. This metric stands in as a proxy for systematic behavior: and attempts to examine if children explore in systematic ways in a maze. Indeed, depth first search is an extremely efficient way to locally explore a maze (unlike breadth first search, which hops around the open list, and would be difficult for a child to replicate). To compute this metric, we begin by discretizing the space of the maze into cells, and computing the trajectory for the child based on the order of cells visited. For a child’s trajectory, a step in the child’s trajectory is “consistent” with the depth first search algorithm if:</p>
        <enumerate xml:id="A2.I1">
          <item xml:id="A2.I1.i1">
            <tags>
              <tag>1.</tag>
              <tag role="autoref">item 1</tag>
              <tag role="refnum">1</tag>
              <tag role="typerefnum">item 1</tag>
            </tags>
            <para xml:id="A2.I1.i1.p1">
              <p>The child does not visit a state they have previously visited UNLESS there are are no adjacent un-visited neighbors.</p>
            </para>
          </item>
          <item xml:id="A2.I1.i2">
            <tags>
              <tag>2.</tag>
              <tag role="autoref">item 2</tag>
              <tag role="refnum">2</tag>
              <tag role="typerefnum">item 2</tag>
            </tags>
            <para xml:id="A2.I1.i2.p1">
              <p>if all neighboring states have been visited, the child moves in the direction of the most recent unexplored branch.</p>
            </para>
          </item>
        </enumerate>
        <p>Measuring consistency across a child’s entire trajectory would lead to large numbers of consistent states (as they traverse down long corridors), making it difficult to measure the actual behavioral differences of the children and the agents. To avoid this confound, we restrict our analyses to “decision points” in the maze, that is cells that do not have two neighboring cells. The code for this analysis is made available at <ref class="ltx_url" font="typewriter" href="https://github.com/CannyLab/ExpExp">https://github.com/CannyLab/ExpExp</ref>. Comparisons to other local search algorithms such as jump-point search are also an interesting avenue for future work.</p>
      </para>
    </subsection>
    <subsection inlist="toc" labels="LABEL:app:exp2" xml:id="A2.SS2">
      <tags>
        <tag>B.2</tag>
        <tag role="autoref">subsection B.2</tag>
        <tag role="refnum">B.2</tag>
        <tag role="typerefnum">§B.2</tag>
      </tags>
      <title><tag close=" ">B.2</tag>Experiment 2</title>
      <figure inlist="lof" labels="LABEL:fig:exp2" xml:id="A2.F3">
        <tags>
          <tag>Figure 3</tag>
          <tag role="autoref">Figure 3</tag>
          <tag role="refnum">3</tag>
          <tag role="typerefnum">Figure 3</tag>
        </tags>
        <inline-para align="center" class="ltx_minipage" vattach="middle" width="355.6pt">
          <para xml:id="A2.F3.p1">
            <graphics candidates="images/EXP2_conds.png" graphic="images/EXP2_conds.png" options="width=433.62pt" xml:id="A2.F3.p1.g1"/>
          </para>
        </inline-para>
<!--  %**** iclr2020˙conference.tex Line 250 **** -->        <toccaption class="ltx_centering"><tag close=" ">3</tag>(Left) Maze outlines for No apples or “sparse reward” condition. (Right) Maze outlines for Apples or “dense reward” condition</toccaption>
        <caption class="ltx_centering"><tag close=": ">Figure 3</tag>(Left) Maze outlines for No apples or “sparse reward” condition. (Right) Maze outlines for Apples or “dense reward” condition</caption>
      </figure>
      <para xml:id="A2.SS2.p1">
        <p>In our second pilot experiment, we pre-registered a sample size of 60 children aged 4-6. Designed to examine the impact of a dense versus sparse reward structure on the exploration patterns of both children and agents.</p>
      </para>
      <para xml:id="A2.SS2.p2">
        <p>This task contains 6 parts. Children complete two mazes, each with three phases. There are 2 conditions. The “dense rewards” or “apples” condition has children following a path of apples in the maze which leads them to the goal. These apples are taken away in the subsequent maze but the goal remains in the same place, children have to find the goal without the aid of the apples. They then have to find the goal in a maze where the main path to the goal is now blocked from the start. In the “sparse reward” or “no apples” condition, children freely explore a maze that has no goal present. They are then asked to look for a goal in the subsequent maze. They then have to find the goal in a maze where the main path to the goal is now blocked from the start. For both conditions the same tasks are repeated in a new maze design to test for generalization. Figure 3 outlines what the maze outlines look like. We will measure percent of maze explored for each part, number of steps taken, cells crossed, time to reach goal and percent of maze re-explored in each section.</p>
      </para>
      <para xml:id="A2.SS2.p3">
        <p>We expect that, in line with state-of-the-art RL agents, children in the “dense” condition will (1) follow the dense-reward path directly to the goal in the first phase, (2) find the goal faster in the second phase (since they can repeat the previous dense-reward path), but (3) will take longer to find the goal in the final phase, compared to children in the “sparse” condition, because the previous dense-reward path to the goal is now blocked.
While we are still collecting data, initial experimental data suggest that children are less likely to explore an area in the dense rewards condition, but, surprisingly, that does not hurt their performance in the final phase.</p>
      </para>
      <para xml:id="A2.SS2.p4">
        <p>While we do not know why the lack of exploration does not hurt their performance in the final phase, this shows experiment shows us why its useful to study children, their behavior here is surprising, and certainly an instance where agents behavior differs from children.</p>
      </para>
    </subsection>
  </appendix>
</document>
