<?xml version="1.0" encoding="UTF-8"?>
<?latexml searchpaths="/home/japhy/scienceReplication.artiswrong.com/paper_files/arxiv/2009.01810/latex_extracted"?>
<?latexml class="IEEEtran" options="conference"?>
<!--  %The preceding line is only needed to identify funding in the first footnote. If that is unneeded, please comment it out. --><?latexml package="cite"?>
<?latexml package="amsmath,amssymb,amsfonts"?>
<?latexml package="algorithmic"?>
<?latexml package="graphicx"?>
<?latexml package="textcomp"?>
<?latexml package="xcolor"?>
<?latexml package="csquotes"?>
<?latexml package="gensymb"?>
<?latexml package="makecell"?>
<?latexml RelaxNGSchema="LaTeXML"?>
<document xmlns="http://dlmf.nist.gov/LaTeXML" class="ltx_authors_1line">
  <resource src="LaTeXML.css" type="text/css"/>
  <resource src="ltx-article.css" type="text/css"/>
  <title>SEDRo: A Simulated Environment for Developmental Robotics</title>
  <creator role="author">
    <personname>Aishwarya Pothula, Md Ashaduzzaman Rubel Mondol, Sanath Narasimhan, Sm Mazharul Islam, Deokgun Park

</personname>
    <contact role="affiliation"><text font="italic">Computer Science and Engineering</text> <break/><text font="italic">University of Texas at Arlington<break/></text>Arlington, Texas USA <break/>{aishwarya.pothula, mdashaduzzaman.mondol, sanath.narasimhan, sxi7321}@mavs.uta.edu, deokgun.park@uta.edu</contact>
  </creator>
  <abstract name="Abstract">
    <p>Even with impressive advances in application
specific models, we still lack knowledge about how to build a model that can learn in a human-like way and do multiple tasks.
To learn in a human-like way, we need to provide a diverse experience that is comparable to human’s.
In this paper, we introduce our ongoing effort to build a simulated environment for developmental robotics (SEDRo). SEDRo provides diverse human experiences ranging from those of a fetus to a 12th month old. A series of simulated tests based on developmental psychology will be used to evaluate the progress of a learning model.
We anticipate SEDRo to lower the cost of entry and facilitate research in the developmental robotics community.</p>
  </abstract>
  <keywords>
Baby robots, Sensorimotor development, Embodiment
</keywords>
<!--  %**** root.tex Line 25 **** 
     %“author–Removed for Blind Review˝-->  <section inlist="toc" xml:id="S1">
    <tags>
      <tag>I</tag>
      <tag role="refnum">I</tag>
      <tag role="typerefnum">§I</tag>
    </tags>
    <title><tag close=" ">I</tag><text font="smallcaps">Introduction</text></title>
    <para xml:id="S1.p1">
      <p>Imagine a robot that can work as a butler. It can handle many tasks and talk with other butler robots to do even more tasks. Alas, one cannot buy or build one today even with an unlimited budget. The reason a butler robot is not available is because we do not know how to program it. Current approaches require huge data to teach a single skill <cite class="ltx_citemacro_cite">[<bibref bibrefs="lake2017building" separator="," yyseparator=","/>]</cite>, and the data requirement grows exponentially with the number of tasks. While we have made remarkable progress in solving tasks with well-defined structures such as when explicit rewards or ground truth exist, we do not know how we can generalize this capability for a single task to multiple tasks. Turing suggested <cite class="ltx_citemacro_cite">[<bibref bibrefs="turing1950computing" separator="," yyseparator=","/>]</cite>:</p>
    </para>
    <ERROR class="undefined">{displayquote}</ERROR>
    <para xml:id="S1.p2">
      <p>Instead of trying to produce a programme to simulate the adult mind, why not rather try to produce one which simulates the child’s?</p>
    </para>
    <para xml:id="S1.p3">
      <p>Humans are born with a vast blank memory and a mechanism for filling it. Let us call this mechanism <text font="italic">the learning mechanism</text> in this paper. With diverse experiences as input, the mechanism fills the contents of the memory as shown in Fig. <ref labelref="LABEL:fig:example"/>. After a few years, we can do many things in multiple domains such as perception, motor, social, language, and physics.
We claim that there were the following issues in previous approaches that made the search for the learning mechanism difficult and propose a new approach to mitigate those issues.</p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:example" placement="b" xml:id="S1.F1">
      <tags>
        <tag>Fig. 1</tag>
        <tag role="refnum">1</tag>
        <tag role="typerefnum">Fig. 1</tag>
      </tags>
      <block align="center" depth="0.0pt" width="433.6pt">
        <graphics candidates="figures/child_development.pdf" graphic="figures/child_development.pdf" xml:id="S1.F1.g1"/>
      </block>
      <toccaption class="ltx_centering"><tag close=" ">1</tag><text fontsize="90%">A child begins with vast blank memory and a learning mechanism. The learning mechanism uses diverse experiences to fill memory. There can be many different skill domains, such as perception, motor, reasoning, and so on. </text></toccaption>
      <caption class="ltx_centering"><tag close=": ">Fig. 1</tag><text fontsize="90%">A child begins with vast blank memory and a learning mechanism. The learning mechanism uses diverse experiences to fill memory. There can be many different skill domains, such as perception, motor, reasoning, and so on. </text></caption>
    </figure>
    <para xml:id="S1.p4">
      <itemize xml:id="S1.I1">
        <item xml:id="S1.I1.i1">
          <tags>
            <tag>•</tag>
            <tag role="typerefnum">1st item</tag>
          </tags>
          <para xml:id="S1.I1.i1.p1">
            <p><text font="bold">Targeting a single skill rather than diverse skills (the main issue)</text> – While a human child can learn to do many things simultaneously, we have focused largely on developing models that can do only a single task. This approach had resulted in overfitted solutions that cannot be generalized to diverse tasks.</p>
          </para>
        </item>
        <item xml:id="S1.I1.i2">
          <tags>
            <tag>•</tag>
            <tag role="typerefnum">2nd item</tag>
          </tags>
          <para xml:id="S1.I1.i2.p1">
            <p><text font="bold">Use of refined and focused datasets rather than diverse and noisy datasets (the first common pattern)</text> – Because the focus is to teach one skill, we tend to build a refined dataset or an environment that contains only task relevant information. This resulted in <text font="italic">spoon-fed human-edited sensory data</text> <cite class="ltx_citemacro_cite">[<bibref bibrefs="weng2001autonomous" separator="," yyseparator=","/>]</cite>. Compare this with how humans learn from unstructured data such as visual and auditory senses and find underlying structures and apply these structures to many domains<cite class="ltx_citemacro_cite">[<bibref bibrefs="gopnik1999scientist" separator="," yyseparator=","/>]</cite>.
<!--  %**** introduction.tex Line 25 **** --></p>
          </para>
        </item>
        <item xml:id="S1.I1.i3">
          <tags>
            <tag>•</tag>
            <tag role="typerefnum">3rd item</tag>
          </tags>
          <para xml:id="S1.I1.i3.p1">
            <p><text font="bold">Relying on explicit rewards rather than on other mechanisms (the second common pattern)</text> – While operant conditioning is a powerful mechanism <cite class="ltx_citemacro_cite">[<bibref bibrefs="skinner1938behavior" separator="," yyseparator=","/>]</cite>, we tend to rely on explicit rewards to guide learning.
Designing a reward mechanism might be easy for a single task. However, it becomes exponentially difficult as the number of target tasks increase.
If we compare the language acquisition abilities of humans and robots, robots can learn to navigate according to the verbal instruction quickly <cite class="ltx_citemacro_cite">[<bibref bibrefs="hermann2017grounded,chaplot2018gated,chen2019touchdown" separator="," yyseparator=","/>]</cite> but do not know how to generalize this to other tasks such as cooking. On the contrary, human infants cannot follow verbal instruction for a very long time. Unfortunately, you cannot give a treat to an 8-month infant for toilet training when he goes to a bathroom himself. But slowly around two years when they acquire language, they can do many tasks with it <cite class="ltx_citemacro_cite">[<bibref bibrefs="gopnik1999scientist" separator="," yyseparator=","/>]</cite>. One key difference is that while robots are trained using explicit rewards, it is not the case with infants.</p>
          </para>
        </item>
        <item xml:id="S1.I1.i4">
          <tags>
            <tag>•</tag>
            <tag role="typerefnum">4th item</tag>
          </tags>
          <para xml:id="S1.I1.i4.p1">
            <p><text font="bold">Too many necessary components rather than a sufficient set of the learning mechanism (the third common pattern)</text> – Finally, we tend to find individual necessary mechanisms rather than suggesting a set of the sufficient mechanisms. The learning mechanism is a system of multiple components. Some might classify the components into two different categories: 1) innate or built-in mechanisms versus 2) universal principles that drive learning. Examples of innate mechanisms are reflexes, hippocampus, or limbic systems. Universal principles explain the driving force behind learning and can be usually written as succinct mathematical formulation such as intrinsic motivation <cite class="ltx_citemacro_cite">[<bibref bibrefs="oudeyer2007intrinsic,schmidhuber2010formal" separator="," yyseparator=","/>]</cite>, Bayesian statistics <cite class="ltx_citemacro_cite">[<bibref bibrefs="gopnik2004theory" separator="," yyseparator=","/>]</cite>, or the free energy principles <cite class="ltx_citemacro_cite">[<bibref bibrefs="friston2010free" separator="," yyseparator=","/>]</cite>. As we can see, there are many candidate components, and we anticipate that the learning mechanism will be a set of multiple components.
However, for a single application, a single or small subset of these components might do the job.
The problem is that we cannot linearly concatenate the solutions from multiple domains because they are not independent.
Therefore, a more critical but neglected question is what is a sufficient set of components for all problems humans can solve.</p>
          </para>
        </item>
      </itemize>
    </para>
    <para xml:id="S1.p5">
      <p>As a summary, we tend to build models for single tasks resulting in overfitted solutions that cannot be generalized to multiple tasks.
In this perspective, we need a regularization.
Regularization by sharing is an effective pattern as demonstrated in convolutional neural network (CNN) or recurrent neural network (RNN) <cite class="ltx_citemacro_cite">[<bibref bibrefs="srivastava2014dropout" separator="," yyseparator=","/>]</cite>.
We claim that we need to regularize by enforcing the use of the same learning mechanism to conduct multiple tasks as Allen Newell suggested in his unified theories of cognition <cite class="ltx_citemacro_cite">[<bibref bibrefs="newell1994unified" separator="," yyseparator=","/>]</cite>.</p>
    </para>
    <para xml:id="S1.p6">
      <p>Then why has the focus of past researches been on developing models for individual tasks? Imagine that a researcher has decided to build an agent that can perform many tasks like a human can.
The first problem she encounters is that there is no simulated environment that can provide the diverse experiences required to acquire skills across multiple domains.</p>
    </para>
<!--  %such as OpenAI Gym~“cite–brockman2016openai˝ or DeepMind Lab~“cite–beattie2016deepmind˝ 
     %**** introduction.tex Line 50 ****
     %One possible method is using a physical robot and letting it explore the real world and learn from natural interactions with the world and humans.
     %While this is feasible and has been pursued by some~“cite–weng2001autonomous, oudeyer2007intrinsic˝, it is cost-inhibitive and not scalable. While it takes a few years for humans to be professional StarCraft II players, it took 200 years of gameplay for machines to become masters at it~“cite–vinyals2019alphastar˝.  Learning five years worth of experiences with human intervention will require a lot of time for training with human participation.  Furthermore this approach is not reproducible because the experience of the agents will be different from each other. So robot baby and human parents is not an option yet.  Therefore, one of the  root causes why we focus on a single task is because it is difficult to build an environment, a dataset, or a testbed for the diverse skills.
     %“begin–figure*˝[t!]
     %“centering
     %“resizebox–0.7“textwidth˝–!˝–“includegraphics–figures/issues.pdf˝˝
     %“caption–“small The lack of the multi-task environment is the root cause that the research in the general intelligence is difficult. We propose a novel approach to mitigate this.  ˝
     %“label–fig:issues˝
     %“end–figure*˝-->    <para xml:id="S1.p7">
      <p>To solve this problem, we introduce our ongoing effort to build a Simulated Environment for Developmental Robotics (SEDRo). SEDRo provides diverse experiences similar to that of human infants from the stage of a fetus to 12 months of age. SEDRo also simulates developmental psychology experiments to evaluate the progress of intelligence development in multiple domains.</p>
    </para>
    <para xml:id="S1.p8">
      <p>There are two generalizable lessons in our work.
First, we point that the learning environment should provide experiences for the multiple tasks and provide a proof-of-concept example.
Fig. <ref labelref="LABEL:fig:environment"/> shows screenshots of SEDRo.
In our environment, the learning agent has to rely on interactions with other characters such as a mother character, who teaches language as a human mother does.
Other characters have to intelligently react to the random babbling of the baby in a diverse but reasonable way.
Programming a mother character for all situations is intractable and it becomes increasingly challenging to provide an experience for open-ended learning when social learning is involved.
In our paper, we address this issue by focusing on the earlier stage of development from the stage of a fetus to 12 months of age when a few words are acquired.
It is more tractable as the conversations between the mother and the baby tends to be one-directional rather than interactive back-and-forth conversations.</p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:environment" placement="tb" xml:id="S1.F2">
      <tags>
        <tag>Fig. 2</tag>
        <tag role="refnum">2</tag>
        <tag role="typerefnum">Fig. 2</tag>
      </tags>
      <block align="center" depth="0.0pt" width="411.9pt">
        <graphics candidates="figures/environment.pdf" graphic="figures/environment.pdf" xml:id="S1.F2.g1"/>
      </block>
<!--  %**** introduction.tex Line 75 **** -->      <toccaption class="ltx_centering"><tag close=" ">2</tag><text fontsize="90%">Screenshots of SEDRo environment. The environment simulates a fetus in the womb(Left). An infant (learning agent) in the crib with a mother character (right).</text></toccaption>
      <caption class="ltx_centering"><tag close=": ">Fig. 2</tag><text fontsize="90%">Screenshots of SEDRo environment. The environment simulates a fetus in the womb(Left). An infant (learning agent) in the crib with a mother character (right).</text></caption>
    </figure>
    <para xml:id="S1.p9">
      <p>Second generalizable lesson is that we can build upon the prior researches in the developmental psychology to evaluate the developmental progress of non-verbal artificial agent. Because our environment cannot provide sufficient language exposure beyond the first 12 months, the agent cannot acquire advanced language beyond the first few words.
Consequently, we cannot evaluate the developmental progress of the agent based on their ability to follow verbal instruction or answer questions correctly.
We overcome this challenge by using studies from developmental psychology.
There are many experiments revealing developmental milestones for non-verbal infants.
We can simulate and make use of those experiments in SEDRo for developmental assessments.
As a concrete example, Kellman and Spelke found that babies acquire  <text font="italic">perceptual completion</text> around four months using the habituation-dishabituation paradigm <cite class="ltx_citemacro_cite">[<bibref bibrefs="kellman1983perception" separator="," yyseparator=","/>]</cite>.
With SEDRo, models can be computationally evaluated by simulating and running experiments to compare behaviors of the agent to the intellectual progress of human infants.
Fig. <ref labelref="LABEL:fig:evaluation"/> explains these experiments in more detail and shows screenshots of our simulated environment.</p>
    </para>
    <para xml:id="S1.p10">
      <p>The rest of this paper is arranged in the following manner. In section II, we survey related works which cover different types of simulated environments for developing AI and various evaluation methods for non-verbal agents. Then, in section III, we illustrate our proposed environment, SEDRo. Finally, we draw the conclusion in section IV by pointing out some major limitations of the current version of SEDRo, along with a future plan of actions to resolve these issues.</p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:evaluation" placement="htb" xml:id="S1.F3">
      <tags>
        <tag>Fig. 3</tag>
        <tag role="refnum">3</tag>
        <tag role="typerefnum">Fig. 3</tag>
      </tags>
      <block align="center" depth="0.0pt" width="394.6pt">
        <graphics candidates="figures/evaluation.pdf" graphic="figures/evaluation.pdf" xml:id="S1.F3.g1"/>
      </block>
      <toccaption class="ltx_centering"><tag close=" ">3</tag><text fontsize="90%">Example evaluation methods for non-verbal infants and simulated experiments for the artificial agent. (a) An experiment set up to examine visual pattern according to the stimulus <cite class="ltx_citemacro_cite">[<bibref bibrefs="haith1988expectation" separator="," yyseparator=","/>]</cite>. A baby views a series of visual stimuli at two or more locations. Visual patterns, such as looking time and looking preference,of the baby are then analyzed.
Depending on the developmental stage, they tend to attend more at novel things <cite class="ltx_citemacro_cite">[<bibref bibrefs="gilmore2002examining" separator="," yyseparator=","/>]</cite>. (b) For example, we can examine if the baby can differentiate between male and female faces <cite class="ltx_citemacro_cite">[<bibref bibrefs="charlesworth1969role" separator="," yyseparator=","/>]</cite>. (c) When newborn infants, under three months of age, see a rod moving behind a box, they will perceive it as two rods. However, babies, past four months of age, perceive it as a single rod and will be surprised when they are shown two rods <cite class="ltx_citemacro_cite">[<bibref bibrefs="stahl2015observing" separator="," yyseparator=","/>]</cite>. (e) In our simulated environment, we model eye gaze and central vision. (f) The focused area gives a clear view, while peripheral vision gives a blurred image. (g) Simulated object unity-perception task (h) Test for innate physics. </text> </toccaption>
      <caption class="ltx_centering"><tag close=": ">Fig. 3</tag><text fontsize="90%">Example evaluation methods for non-verbal infants and simulated experiments for the artificial agent. (a) An experiment set up to examine visual pattern according to the stimulus <cite class="ltx_citemacro_cite">[<bibref bibrefs="haith1988expectation" separator="," yyseparator=","/>]</cite>. A baby views a series of visual stimuli at two or more locations. Visual patterns, such as looking time and looking preference,of the baby are then analyzed.
Depending on the developmental stage, they tend to attend more at novel things <cite class="ltx_citemacro_cite">[<bibref bibrefs="gilmore2002examining" separator="," yyseparator=","/>]</cite>. (b) For example, we can examine if the baby can differentiate between male and female faces <cite class="ltx_citemacro_cite">[<bibref bibrefs="charlesworth1969role" separator="," yyseparator=","/>]</cite>. (c) When newborn infants, under three months of age, see a rod moving behind a box, they will perceive it as two rods. However, babies, past four months of age, perceive it as a single rod and will be surprised when they are shown two rods <cite class="ltx_citemacro_cite">[<bibref bibrefs="stahl2015observing" separator="," yyseparator=","/>]</cite>. (e) In our simulated environment, we model eye gaze and central vision. (f) The focused area gives a clear view, while peripheral vision gives a blurred image. (g) Simulated object unity-perception task (h) Test for innate physics. </text> </caption>
    </figure>
  </section>
  <section inlist="toc" xml:id="S2">
    <tags>
      <tag>II</tag>
      <tag role="refnum">II</tag>
      <tag role="typerefnum">§II</tag>
    </tags>
    <title><tag close=" ">II</tag><text font="smallcaps">Background</text></title>
    <para xml:id="S2.p1">
      <p>We review previous literature for 1) simulated environments for artificial agents and 2) evaluation methods for non-verbal agents.</p>
    </para>
    <subsection inlist="toc" xml:id="S2.SS1">
      <tags>
        <tag>II-A</tag>
        <tag role="refnum">II-A</tag>
        <tag role="typerefnum">§II-A</tag>
      </tags>
      <title><tag close=" ">II-A</tag><text font="italic">Simulated Environments for AI</text></title>
      <para xml:id="S2.SS1.p1">
        <p>Several environments have been developed for AI research and especially for reinforcement learning researchm <cite class="ltx_citemacro_cite">[<bibref bibrefs="brockman2016openai,beattie2016deepmind" separator="," yyseparator=","/>]</cite> . The overarching goal was to provide a common benchmark and to lower the barriers of the entry for researchers. Examples include environments in which agents get rewards by following verbal instructions in navigation <cite class="ltx_citemacro_cite">[<bibref bibrefs="chen2019touchdown,savva2019habitat,chaplot2018gated,hermann2017grounded,shridhar2019alfred" separator="," yyseparator=","/>]</cite> and give correct answers (question answering)<cite class="ltx_citemacro_cite">[<bibref bibrefs="das2018embodied" separator="," yyseparator=","/>]</cite>.
Though we’ve made substantial progress in reinforcement learning with explicit rewards, it is difficult to transfer these built models to develop artificial general intelligence (AGI).
Many previous works were conducted to overcome this limitation.
The difficulty in transferring is mainly because humans do not depend on explicit rewards nor labeled data to learn<cite class="ltx_citemacro_cite">[<bibref bibrefs="hull1943principles,white1959motivation" separator="," yyseparator=","/>]</cite>.
Principles such as intrinsic motivation <cite class="ltx_citemacro_cite">[<bibref bibrefs="oudeyer2007intrinsic" separator="," yyseparator=","/>]</cite> and free energy <cite class="ltx_citemacro_cite">[<bibref bibrefs="friston2009predictive" separator="," yyseparator=","/>]</cite> have been proposed to be the underlying mechanism for learning <cite class="ltx_citemacro_cite">[<bibref bibrefs="gopnik2004theory,bonawitz2014probabilistic,hawkins2016neurons,nagai2019predictive,schmidt2020self" separator="," yyseparator=","/>]</cite>.
A number of simulated environments have been proposed to test these hypotheses in the robotics context.
We can classify previous environments into artificial environments and human-inspired environments.</p>
      </para>
      <paragraph inlist="toc" xml:id="S2.SS1.SSS0.Px1">
        <title>Artificial Environments</title>
        <para xml:id="S2.SS1.SSS0.Px1.p1">
          <p>Oudeyer et al. proposed a mathematical formulation for intrinsic motivation and demonstrated similar observations using both simulation and robots <cite class="ltx_citemacro_cite">[<bibref bibrefs="oudeyer2007intrinsic" separator="," yyseparator=","/>]</cite>.
Similarly, Haber et al. showed that an agent begins by exploring an environment, and then in the later stages begins to interact with objects <cite class="ltx_citemacro_cite">[<bibref bibrefs="haber2018learning" separator="," yyseparator=","/>]</cite> based on intrinsic rewards.
These works were conducted on 3D simulated environments built using game physics engines.
However, previous works usually focussed on developing and testing a single component of the mechanism, such as self-other cognition, imitation, and joint attention<cite class="ltx_citemacro_cite">[<bibref bibrefs="nagai2019predictive" separator="," yyseparator=","/>]</cite>.
While it is relatively easy to build artificial environments to test a single component, it is difficult to extend the environments for multiple tasks.
To overcome this limitation, human-inspired environments were also studied.</p>
        </para>
<!--  %**** background.tex Line 25 **** -->      </paragraph>
      <paragraph inlist="toc" xml:id="S2.SS1.SSS0.Px2">
        <title>Human-inspired Environment</title>
        <para xml:id="S2.SS1.SSS0.Px2.p1">
          <p>For Human-Inspired Environments, there is a benefit in using infant-like environments. We can be rest assured that those experiences are enough for the development of human-level intelligence.
Meltzoff et al. elaborated this idea with evidence from developmental psychology, neuroscience, and machine learning <cite class="ltx_citemacro_cite">[<bibref bibrefs="meltzoff2009foundations" separator="," yyseparator=","/>]</cite>.
The idea of using human-like experience to nurture AI has been actively pursued in the  <text font="italic">Developmental Robotics (DevRob)</text> or <text font="italic">Epigenetic Robotics</text> community <cite class="ltx_citemacro_cite">[<bibref bibrefs="lungarella2003developmental,asada2009cognitive,cangelosi2015developmental" separator="," yyseparator=","/>]</cite>.
However, it is challenging to simulate the real world.
Therefore, researchers used 1) physical robots in the real world, or 2) simulated environment of a simplified real-world, focusing narrow skills.
Weng et al. developed SAIL robots that explored the world with humans to themselves acquire skills by in navigation and object perception <cite class="ltx_citemacro_cite">[<bibref bibrefs="weng2001autonomous" separator="," yyseparator=","/>]</cite>.
Later iCub <cite class="ltx_citemacro_cite">[<bibref bibrefs="metta2008icub" separator="," yyseparator=","/>]</cite>, a humanoid robot that is modelled after human babies was developed and used for developmental robotics research. Using physical robots, studies on perception and physical behaviors with objects can be conducted <cite class="ltx_citemacro_cite">[<bibref bibrefs="ruesch2008multimodal,gaudiello2016trust,marocco2010grounding,serhan2019replication" separator="," yyseparator=","/>]</cite>.
However, physical robots are expensive and providing same experiences for the reproducible research is an open problem.
To lower the cost of entry for research in robotics, many simulators were developed <cite class="ltx_citemacro_cite">[<bibref bibrefs="koenig2004design,Webots04" separator="," yyseparator=","/>]</cite>.
Environment simulations of human development are modelled after stages as early as the fetus <cite class="ltx_citemacro_cite">[<bibref bibrefs="kuniyoshi2006early,mori2010human" separator="," yyseparator=","/>]</cite> as it is evidenced that fetuses learn auditory <cite class="ltx_citemacro_cite">[<bibref bibrefs="moon2013language" separator="," yyseparator=","/>]</cite> and sensorimotor coordination.
To tackle the challenge of simulating natural interactions with human users, Murane et al. used virtual reality to allow humans to interact with the robots in the simulation <cite class="ltx_citemacro_cite">[<bibref bibrefs="murnane2019virtual,murnane2019learning" separator="," yyseparator=","/>]</cite>.
Using this method, data for human-robot interaction can be accumulated.</p>
        </para>
      </paragraph>
    </subsection>
    <subsection inlist="toc" xml:id="S2.SS2">
      <tags>
        <tag>II-B</tag>
        <tag role="refnum">II-B</tag>
        <tag role="typerefnum">§II-B</tag>
      </tags>
      <title><tag close=" ">II-B</tag><text font="italic">Evaluation methods for non-verbal agents</text></title>
      <para xml:id="S2.SS2.p1">
        <p>There are many tests for human-level intelligence, including the Turing test, robot college student test, kitchen tests, and AI preschool test <cite class="ltx_citemacro_cite">[<bibref bibrefs="adams2012mapping" separator="," yyseparator=","/>]</cite>.
However, most tests require a capability for language and cannot be used for evaluating progressive intelligence in diverse domains.</p>
      </para>
      <paragraph inlist="toc" xml:id="S2.SS2.SSS0.Px1">
        <title>Tests in Developmental psychology</title>
        <para xml:id="S2.SS2.SSS0.Px1.p1">
          <p>Researchers in developmental psychology developed various evaluation schemes using behavior patterns related to familiarity and novelty.
These includes visual expectation paradigm <cite class="ltx_citemacro_cite">[<bibref bibrefs="haith1988expectation" separator="," yyseparator=","/>]</cite>, preferential looking <cite class="ltx_citemacro_cite">[<bibref bibrefs="fantz1956method" separator="," yyseparator=","/>]</cite>, habituation-dishabituation paradigm <cite class="ltx_citemacro_cite">[<bibref bibrefs="kaplan1986habituation" separator="," yyseparator=","/>]</cite>, contingent change of the rate in pacifier-sucking behaviors <cite class="ltx_citemacro_cite">[<bibref bibrefs="moon2013language" separator="," yyseparator=","/>]</cite>.
For instance, the visual expectations paradigm means that babies look longer and attend more to novel scenes than to familiar scenes. Using these methods, developmental milestones in many skill domains such as visual <cite class="ltx_citemacro_cite">[<bibref bibrefs="bushnell2001mother" separator="," yyseparator=","/>]</cite>, auditory <cite class="ltx_citemacro_cite">[<bibref bibrefs="kuhl2007speech" separator="," yyseparator=","/>]</cite>, motor <cite class="ltx_citemacro_cite">[<bibref bibrefs="clifton1993visually" separator="," yyseparator=","/>]</cite>, social <cite class="ltx_citemacro_cite">[<bibref bibrefs="maurer1981infants,courage2002infant" separator="," yyseparator=","/>]</cite>, language <cite class="ltx_citemacro_cite">[<bibref bibrefs="kuhl2007speech" separator="," yyseparator=","/>]</cite>, physics <cite class="ltx_citemacro_cite">[<bibref bibrefs="kellman1983perception" separator="," yyseparator=","/>]</cite> etc have been studied.</p>
        </para>
      </paragraph>
      <paragraph inlist="toc" xml:id="S2.SS2.SSS0.Px2">
        <title>Psychology-inspired Test for AI</title>
        <para xml:id="S2.SS2.SSS0.Px2.p1">
          <p>There are previous researches that use human psychological metrics for the evaluation of artificial agents.
For example, Leibo et al. used human psychology paradigms such as visual search, change detection, and random dot motion discrimination <cite class="ltx_citemacro_cite">[<bibref bibrefs="leibo2018psychlab" separator="," yyseparator=","/>]</cite>.
However, it tests adult level psychological perception and does not provide developmental milestones. It is also limited to the visual perception domain and does not provide an integrated experience required to learn and perform diverse tasks.
<!--  %**** background.tex Line 50 **** -->Piloto et al. suggested the evaluation of physics concepts that are inspired by developmental psychology <cite class="ltx_citemacro_cite">[<bibref bibrefs="piloto2018probing" separator="," yyseparator=","/>]</cite>. They developed a dataset by examining object persistence, unchangeableness, continuity, solidity, and containment by violation of expectations (VOE) methods.
The study of complete and diverse tasks at the human level is challenging.
Crosby et al. used various intellectual animal behaviors in the simulated environment <cite class="ltx_citemacro_cite">[<bibref bibrefs="crosby2019animal" separator="," yyseparator=","/>]</cite>.
Tests for ten cognitive categories and a playground that can provide an experience to learn those skills are provided in the work.
SEDRo builds upon their work to extend those approaches to human-level intelligence.</p>
        </para>
<!--  %**** root.tex Line 50 **** -->      </paragraph>
    </subsection>
  </section>
  <section inlist="toc" xml:id="S3">
    <tags>
      <tag>III</tag>
      <tag role="refnum">III</tag>
      <tag role="typerefnum">§III</tag>
    </tags>
    <title><tag close=" ">III</tag><text font="smallcaps">Simulating Experience of Human Infants</text></title>
    <para xml:id="S3.p1">
      <p>In this section, we discuss about the proposed Simulated Environment for Developmental Robotics or SEDRo. Fig. <ref labelref="LABEL:fig:big_picture"/> illustrates the primary components of SEDRo and their inter-relations.
The two main components in SEDRo are the learning agent (with red border), the simulated environment (with green border).
Within the simulated environment, there are a caregiver character, surrounding objects in the environment (e.g. toys, cribs, walls etc.) and most importantly the body of the agent.
The agent will interact with the simulated environment by controlling the muscles in its body according to the sensor signals.
Interaction between the agent and the caregiver allows cognitive bootstrapping and social-learning, while interactions between the agent and the surrounding objects are increased gradually as the agent gets into more developed stages.
The caregiver character can also interact with the surrounding objects to introduce them to the agent at the earlier stages of development.</p>
    </para>
    <para xml:id="S3.p2">
      <p>Though there are no rewards that are explicitly awarded by the environment, it does not mean that the reward mechanism does not play a role in the learning. Rather than relying on the environment for the rewards, the responsibility of generating rewards belong to the agent itself. As an example, if an agent can get food from its environment, this input will be given to the agents as a number representing the amount of food in its stomach. It is now the agent’s role to generate a negative reward if there is no food in its stomach and positive rewards if new food is given. In this sense, we can say, the body itself is a part of the environment, and what is referred to as the agent is only the brain, which is why the agent’s body in Fig. <ref labelref="LABEL:fig:big_picture"/> is in green font.</p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:big_picture" placement="t!" xml:id="S3.F4">
      <tags>
        <tag>Fig. 4</tag>
        <tag role="refnum">4</tag>
        <tag role="typerefnum">Fig. 4</tag>
      </tags>
      <block align="center" depth="0.0pt" width="433.6pt">
        <graphics candidates="figures/big_picture.png" graphic="figures/big_picture.png" xml:id="S3.F4.g1"/>
      </block>
      <toccaption class="ltx_centering"><tag close=" ">4</tag><text fontsize="90%">Ecosystem of SEDRo environment. </text></toccaption>
      <caption class="ltx_centering"><tag close=": ">Fig. 4</tag><text fontsize="90%">Ecosystem of SEDRo environment. </text></caption>
    </figure>
    <para xml:id="S3.p3">
      <p>SEDRo provides the diverse experiences of human infants from the stage of a the fetus to first 12 months of life.
A new-born brain must learn to control its body. We can compare this with trying to learn to operate a machine using a control panel of 1,000 by 1,000 LEDs and 1,000 by 1,000 buttons. To make this even more challenging, LEDs and buttons are not labeled as shown in Fig. <ref labelref="LABEL:fig:model"/>. Each LED blinks, maybe sparsely. If you push some buttons, the blinking pattern of the LEDs seems to change and sometimes not; is not easy to track. You need to make sense out of this huge matrix of buttons and LEDs that Piaget called sensorimotor stage <cite class="ltx_citemacro_cite">[<bibref bibrefs="piaget1952origins" separator="," yyseparator=","/>]</cite>. The role of a brain model is to compose an output behavior vector given a sensor vector.</p>
    </para>
<!--  %**** plan.tex Line 25 **** -->    <figure inlist="lof" labels="LABEL:fig:model" placement="t!" xml:id="S3.F5">
      <tags>
        <tag>Fig. 5</tag>
        <tag role="refnum">5</tag>
        <tag role="typerefnum">Fig. 5</tag>
      </tags>
      <block align="center" depth="0.0pt" width="433.6pt">
        <graphics candidates="figures/model.pdf" graphic="figures/model.pdf" xml:id="S3.F5.g1"/>
      </block>
      <toccaption class="ltx_centering"><tag close=" ">5</tag><text fontsize="90%">This diagram illustrates our assumption about the learning mechanism.
Pressing some buttons in 1,000 by 1,000 button control panel will affect the flickering patterns in 1,000 by 1,000 LEDs panel.
The button panel represents the motor output vector, and the LED panel represents the sensory input vector.
Please note that there are no labels on those two vectors and the learning mechanism needs to learn how to operate the body. </text></toccaption>
      <caption class="ltx_centering"><tag close=": ">Fig. 5</tag><text fontsize="90%">This diagram illustrates our assumption about the learning mechanism.
Pressing some buttons in 1,000 by 1,000 button control panel will affect the flickering patterns in 1,000 by 1,000 LEDs panel.
The button panel represents the motor output vector, and the LED panel represents the sensory input vector.
Please note that there are no labels on those two vectors and the learning mechanism needs to learn how to operate the body. </text></caption>
    </figure>
    <subsection inlist="toc" xml:id="S3.SS1">
      <tags>
        <tag>III-A</tag>
        <tag role="refnum">III-A</tag>
        <tag role="typerefnum">§III-A</tag>
      </tags>
      <title><tag close=" ">III-A</tag><text font="italic">Curriculum for Development</text></title>
      <para xml:id="S3.SS1.p1">
        <p>To make the learning easier, human infants develop in a curriculum which scaffolds the involved sensory and motor capabilities. <cite class="ltx_citemacro_cite">[<bibref bibrefs="smith2018developing,turkewitz1982limitations,berlyne1960conflict,mirvis1991flow" separator="," yyseparator=","/>]</cite>. For example, in the fetus stage there are no visual inputs. A small subset of LEDs and buttons that are available at that stage can be isolated to master new skills such as sucking a thumb or rotating body. In the first three months, babies are very near-sighted and do not have any mobility, which makes many visual signals stationary. At later stages, when babies learn to sit and grasp, they develop alternative strategies of learning using a rotating viewpoints and the contingent verbal speech of caregivers.</p>
      </para>
      <para xml:id="S3.SS1.p2">
        <p>In SEDRo, the input output signal changes according to the development of the agent. For example, the agent in the womb stage will not have any visual input signals which will be available after birth. But for the first 3 months, visual signals will represent nearsightedness. Muscles will develop over time. Full force at the early stage will not be enough for an agent to crawl or stand, but it will steadily increase to afford walking in the later stages.</p>
      </para>
      <figure inlist="lof" labels="LABEL:fig:vision" placement="b!" xml:id="S3.F6">
        <tags>
          <tag>Fig. 6</tag>
          <tag role="refnum">6</tag>
          <tag role="typerefnum">Fig. 6</tag>
        </tags>
        <block align="center" depth="0.0pt" width="390.3pt">
          <graphics candidates="figures/vision.pdf" graphic="figures/vision.pdf" xml:id="S3.F6.g1"/>
        </block>
        <toccaption class="ltx_centering"><tag close=" ">6</tag><text fontsize="90%">The visual system of the agents in SEDRo. The orange laser beam in (a) shows visual attention. Each eye has a central (c and e) and a peripheral vision (d and f). For debugging purpose, a main view is provided as shown in (b). </text></toccaption>
        <caption class="ltx_centering"><tag close=": ">Fig. 6</tag><text fontsize="90%">The visual system of the agents in SEDRo. The orange laser beam in (a) shows visual attention. Each eye has a central (c and e) and a peripheral vision (d and f). For debugging purpose, a main view is provided as shown in (b). </text></caption>
      </figure>
    </subsection>
    <subsection inlist="toc" xml:id="S3.SS2">
      <tags>
        <tag>III-B</tag>
        <tag role="refnum">III-B</tag>
        <tag role="typerefnum">§III-B</tag>
      </tags>
      <title><tag close=" ">III-B</tag><text font="italic">Specification for I/O Vectors</text></title>
      <para xml:id="S3.SS2.p1">
        <p>The sensory input consists of touch, vision, acceleration, gravity, and proprioceptors.
<!--  %**** plan.tex Line 50 **** -->Visual attention plays an important role in the evaluation of non-verbal human infants as explained in Sec. <ref labelref="LABEL:sec:evaluation"/>. In SEDRo, the artificial infant can control eye movement with three parameters - vertical angle, horizontal angle, and focal length.
To simulate central and peripheral vision with two eyes, four images are generated as central and peripheral vision images for left and right eyes each.
Central vision has an 8° field of view (FOV) and a higher resolution, while the peripheral vision has a 100° FOV  and a lower resolution. One additional image is provided for the purpose of debugging that will represent reconstructed visual imagery in the brain.
Five cameras with different settings have been used in the game engine for implementation.</p>
      </para>
      <para xml:id="S3.SS2.p2">
        <p>The touch sensors are spread all over the body but the distribution pattern varies. The face, lips, and hands have a higher density of touch sensors than the torso. In the current version, there are 2,110 touch sensors. Each body part is segmented into meshes and a touch sensor provides a binary feature which represents if a contact has been made.
We have implemented touch feature using collision information provided by the game engine.</p>
      </para>
      <para xml:id="S3.SS2.p3">
        <p>The motor output vectors constitute muscle torques, which will determine the 53 motors, including the 9 degree of freedom (DOF) in each hand inspired by iCub <cite class="ltx_citemacro_cite">[<bibref bibrefs="metta2008icub" separator="," yyseparator=","/>]</cite>.
Main loop of the environment runs 100 steps per second motivated by the human biological brain <cite class="ltx_citemacro_cite">[<bibref bibrefs="hawkins2004on" separator="," yyseparator=","/>]</cite>. At each step, the agent will read a sensory input vector and write a motor output vector.</p>
      </para>
<!--  %Though there are no rewards that are explicitly awarded by the environment, it does not mean that the reward mechanism does not play a role in the learning. Rather than relying on the environment for the rewards, the responsibility of generating rewards belong to the agent itself. As an example, if an agent can get food from its environment, this input will be given to the agents as a number representing the amount of food in its stomach. It is now the agent’s role to generate a negative reward if there is no food in its stomach and positive rewards if new food is given. In this sense, we can say, the body itself is a part of the environment, and what is referred to as the agent is only the brain. -->    </subsection>
    <subsection inlist="toc" xml:id="S3.SS3">
      <tags>
        <tag>III-C</tag>
        <tag role="refnum">III-C</tag>
        <tag role="typerefnum">§III-C</tag>
      </tags>
      <title><tag close=" ">III-C</tag><text font="italic">Social Event Scenario</text></title>
      <para xml:id="S3.SS3.p1">
        <p>Social interaction plays an important role for human development. In SEDRo, we are building scenarios for social interaction according to the following process:</p>
        <enumerate xml:id="S3.I1">
          <item xml:id="S3.I1.i1">
            <tags>
              <tag>1.</tag>
              <tag role="refnum">1</tag>
              <tag role="typerefnum">item 1</tag>
            </tags>
            <para xml:id="S3.I1.i1.p1">
              <p>We start by choosing a meaningful interaction pattern by reviewing developmental psychology literature.</p>
            </para>
          </item>
          <item xml:id="S3.I1.i2">
            <tags>
              <tag>2.</tag>
              <tag role="refnum">2</tag>
              <tag role="typerefnum">item 2</tag>
            </tags>
            <para xml:id="S3.I1.i2.p1">
              <p>We write a scenario for the chosen interaction.</p>
            </para>
          </item>
          <item xml:id="S3.I1.i3">
            <tags>
              <tag>3.</tag>
              <tag role="refnum">3</tag>
              <tag role="typerefnum">item 3</tag>
            </tags>
            <para xml:id="S3.I1.i3.p1">
              <p>Actors perform that scenario and and we capture their behaviors using a motion capture facility.</p>
            </para>
          </item>
          <item xml:id="S3.I1.i4">
            <tags>
              <tag>4.</tag>
              <tag role="refnum">4</tag>
              <tag role="typerefnum">item 4</tag>
            </tags>
            <para xml:id="S3.I1.i4.p1">
              <p>We add the recorded scenario into SEDRo along with a schedule for the event.</p>
            </para>
          </item>
        </enumerate>
      </para>
      <para xml:id="S3.SS3.p2">
        <p>Building a library of the social event is time consuming and we anticipate the SEDRo environment will expand over the coming years. We will maintain the versioning of SEDRo such that the research using SEDRo can be reproducible.</p>
      </para>
<!--  %**** plan.tex Line 75 **** -->      <table inlist="lot" xml:id="S3.T1">
        <tags>
          <tag><text fontsize="90%">TABLE I</text></tag>
          <tag role="refnum"><text fontsize="90%">I</text></tag>
          <tag role="typerefnum"><text fontsize="90%">TABLE I</text></tag>
        </tags>
        <toccaption><tag close=" "><text fontsize="90%">I</text></tag><text fontsize="90%">Summary of Developmental Milestones (M represents months after birth)</text></toccaption>
        <caption fontsize="90%"><tag close=": ">TABLE I</tag>Summary of Developmental Milestones (M represents months after birth)</caption>
        <tabular class="ltx_centering" vattach="middle">
          <tr>
            <td align="center" border="l rr t"><ERROR class="undefined">\thead</ERROR><text fontsize="90%"> Stage</text></td>
            <td align="center" border="r t"><ERROR class="undefined">\thead</ERROR><text fontsize="90%">Fetus Stage</text></td>
            <td align="right" border="t"><ERROR class="undefined">\thead</ERROR><text fontsize="90%">Immobile Stage</text></td>
            <td border="t"/>
            <td border="t"/>
          </tr>
          <tr>
            <td align="center" border="l rr"><text fontsize="90%">(Less than 3 Months)</text></td>
            <td align="right"><ERROR class="undefined">\thead</ERROR><text fontsize="90%">Crawling Stage</text></td>
            <td/>
            <td/>
            <td/>
          </tr>
          <tr>
            <td align="center" border="l rr"><text fontsize="90%">(4-10 Months)</text></td>
            <td align="right"><ERROR class="undefined">\thead</ERROR><text fontsize="90%">Walking Stage</text></td>
            <td/>
            <td/>
            <td/>
          </tr>
          <tr>
            <td align="center" border="l rr"><text fontsize="90%">(11-18 Months)</text></td>
            <td/>
            <td/>
            <td/>
            <td/>
          </tr>
          <tr>
            <td align="right" border="l tt"><ERROR class="undefined">\makecell</ERROR><text fontsize="90%">Descri</text></td>
            <td border="tt"/>
            <td border="tt"/>
            <td border="tt"/>
            <td border="tt"/>
          </tr>
          <tr>
            <td align="center" border="l rr"><text fontsize="90%">ption</text></td>
            <td align="center" border="r"><text fontsize="90%">No vision</text></td>
            <td align="center" border="r"><text fontsize="90%">Near sighted vision.</text></td>
            <td align="center" border="r"><text fontsize="90%">Fully developed vision. Sit and interact with objects. Interact with other persons by babbling.</text></td>
            <td align="center" border="r"><text fontsize="90%">Fully developed muscles. First words</text></td>
          </tr>
          <tr>
            <td align="center" border="l rr t"><text fontsize="90%">Vision</text></td>
            <td align="center" border="r t"><break/><break/></td>
            <td align="center" border="r t"><text fontsize="90%">Visual expectation(0 vs 3M) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="haith1988expectation,canfield1991young,adler2008infants,wentworth2002spatiotemporal" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%">, face preference(1 vs 2M) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="maurer1976developmental,haan2002specialization,maurer1981infants,morton1991conspec" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%">, face preference(2 3 days) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="bushnell2001mother,bushneil1989neonatal" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%">
, gender detection(0 vs 3M) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="leinbach1993categorical,quinn2002representation" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%">, depth perception (0 vs 2M) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="campos1970cardiac" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%"></text></td>
            <td align="center" border="r t"><text fontsize="90%">Visual scan pattern (2 vs 11 weeks)  </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="bronson1991infant" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%">, tracking occluded objects(4 vs 6 Months) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="johnson2003development" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%">, lost ability to distinguish faces of different gender(3 vs 9 M) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="quinn2002representation" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%"></text></td>
            <td align="center" border="r t"><text fontsize="90%">Novelty preference inversion (6-12 months) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="roder2000infants,colombo2006emergence" separator="," yyseparator=","/><text fontsize="90%">]</text></cite></td>
          </tr>
          <tr>
            <td align="right" border="l t"><ERROR class="undefined">\makecell</ERROR><text fontsize="90%">Joint</text></td>
            <td border="t"/>
            <td border="t"/>
            <td border="t"/>
            <td border="t"/>
          </tr>
          <tr>
            <td align="center" border="l rr"><text fontsize="90%">atten</text></td>
            <td/>
            <td/>
            <td/>
            <td/>
          </tr>
          <tr>
            <td align="center" border="l rr"><text fontsize="90%">tion</text></td>
            <td align="center" border="r"><break/></td>
            <td align="center" border="r"><text fontsize="90%">left/right attention manipulation</text></td>
            <td align="center" border="r"><text fontsize="90%">Gaze angle detection, fixation of first salient object</text></td>
            <td align="center" border="r"><text fontsize="90%">Mutual gaze through eye contact  </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="kaplan2006challenges" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%">, Fixation of any salient object, declarative pointing, drawing attention</text></td>
          </tr>
          <tr>
            <td align="center" border="l rr t"><break/><text fontsize="90%">Motor</text></td>
            <td align="center" border="r t"><text fontsize="90%">Hand/face contacts (11 gestation weeks) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="mori2010human" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%"></text></td>
            <td align="center" border="r t"><text fontsize="90%">Open hand grasping </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="von1984developmental,white1964observations" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%"></text></td>
            <td align="center" border="r t"><text fontsize="90%">Recognizing own motion vs others(3 vs 5 Months) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="bahrick1985detection,gergely1999early" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%"></text></td>
            <td align="center" border="r t"><text fontsize="90%">Partial integration of visual and motor skill (9 Months) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="berthier2001using" separator="," yyseparator=","/><text fontsize="90%">]</text></cite></td>
          </tr>
          <tr>
            <td align="right" border="l t"><break/><break/><break/><ERROR class="undefined">\makecell</ERROR><text fontsize="90%">Lang</text></td>
            <td border="t"/>
            <td border="t"/>
            <td border="t"/>
            <td border="t"/>
          </tr>
          <tr>
            <td align="center" border="l rr"><text fontsize="90%">guage</text></td>
            <td border="r"/>
            <td align="center" border="r"><text fontsize="90%">Differentiate mother tongue and foreign language </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="moon2013language" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%">, marginal babbling</text></td>
            <td align="center" border="r"><text fontsize="90%">canonical babbling</text></td>
            <td align="center" border="r"><text fontsize="90%">intentional gestures, single words, word-gesture combination </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="hoff2013language" separator="," yyseparator=","/><text fontsize="90%">]</text></cite></td>
          </tr>
          <tr>
            <td align="right" border="l t"><ERROR class="undefined">\makecell</ERROR><text fontsize="90%">Reason</text></td>
            <td border="t"/>
            <td border="t"/>
            <td border="t"/>
            <td border="t"/>
          </tr>
          <tr>
            <td align="center" border="l rr"><text fontsize="90%">ing</text></td>
            <td align="center" border="r"><break/></td>
            <td align="center" border="r"><text fontsize="90%">Self-perception at mirror(3 Months) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="courage2002infant" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%"></text></td>
            <td align="center" border="r"><text fontsize="90%">Fear of heights (after crawling)</text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="campos1992early,kermoian1988locomotor" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%">, Allocentric spatial frame of reference (9 Months) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="acredolo1978development,acredolo1980developmental,acredolo1984role" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%"></text></td>
            <td align="center" border="r"><text fontsize="90%">Mark test(15 Months) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="lewis2012social" separator="," yyseparator=","/><text fontsize="90%">]</text></cite><text fontsize="90%">, adapted use of hook(12 Months) </text><cite class="ltx_citemacro_cite"><text fontsize="90%">[</text><bibref bibrefs="van1994affordances" separator="," yyseparator=","/><text fontsize="90%">]</text></cite></td>
          </tr>
          <tr>
            <td align="center" border="l rr t"><break/></td>
            <td border="t"/>
            <td border="t"/>
            <td border="t"/>
            <td border="t"/>
          </tr>
        </tabular>
      </table>
    </subsection>
    <subsection inlist="toc" labels="LABEL:sec:evaluation" xml:id="S3.SS4">
      <tags>
        <tag>III-D</tag>
        <tag role="refnum">III-D</tag>
        <tag role="typerefnum">§III-D</tag>
      </tags>
      <title><tag close=" ">III-D</tag><text font="italic">Evaluation Framework for Non-verbal Agents</text></title>
      <para xml:id="S3.SS4.p1">
        <p>We have developed an evaluation framework for the development of skills in multiple domains by simulating established experiments from developmental psychology.
There are multiple developmental milestones in multiple skill domains.
At each stage, there are key milestones that the agent needs to satisfy.
Researchers may choose to replay relevant experiences if the agent does not achieve those milestones.
Consequently, the agent will experience an adaptive experience based on its current capability rather than experiences based on a fixed time schedule.
Fig. <ref labelref="LABEL:fig:evaluation"/> shows example tasks in the developmental psychology and screenshots of preliminary prototypes simulating those experiments.</p>
      </para>
      <para xml:id="S3.SS4.p2">
        <p>In current version, we developed a visual expectation paradigm experiment with a moving rod.
The visual attention pattern over the moving rod can be acquired as a separate channel in the gym interface.
Table 1 summarizes our plan for evaluation experiments in domains such as vision, motor, attention, and reasoning.
Each evaluation has a different expected behavior pattern between two stages of human development. For example, two month old infants cannot predict regular pattern, but at 3.5 months, infants exhibit anticipatory eye movement 200 ms before the actual pattern visual expectation <cite class="ltx_citemacro_cite">[<bibref bibrefs="haith1988expectation,canfield1991young,adler2008infants,wentworth2002spatiotemporal" separator="," yyseparator=","/>]</cite>.
We leverage such known developmental milestones to develop suites of simulated experiments for evaluating the development of the artificial agent.
<!--  %**** plan.tex Line 175 **** -->The evaluation will conduct multiple experiments and compare the results with those of the human participants.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S3.SS5">
      <tags>
        <tag>III-E</tag>
        <tag role="refnum">III-E</tag>
        <tag role="typerefnum">§III-E</tag>
      </tags>
      <title><tag close=" ">III-E</tag><text font="italic">Implementation Detail</text></title>
      <para xml:id="S3.SS5.p1">
        <p>We use Unity3D 2018.4 for the development of the environment. Unity ML agent  <cite class="ltx_citemacro_cite">[<bibref bibrefs="juliani2018unity" separator="," yyseparator=","/>]</cite> is used to implement Open AI gym interface <cite class="ltx_citemacro_cite">[<bibref bibrefs="brockman2016openai" separator="," yyseparator=","/>]</cite>. To record behaviors of the actors, we use Motive Body software with Opti-track motion capture system with ten Prime 17W cameras.</p>
      </para>
    </subsection>
  </section>
  <section inlist="toc" xml:id="S4">
    <tags>
      <tag>IV</tag>
      <tag role="refnum">IV</tag>
      <tag role="typerefnum">§IV</tag>
    </tags>
    <title><tag close=" ">IV</tag><text font="smallcaps">Discussion</text></title>
    <para xml:id="S4.p1">
      <p>As SEDRo is a work in progress, here we discuss its limitations, a few alternative approaches and future works.</p>
    </para>
    <subsection inlist="toc" xml:id="S4.SS1">
      <tags>
        <tag>IV-A</tag>
        <tag role="refnum">IV-A</tag>
        <tag role="typerefnum">§IV-A</tag>
      </tags>
      <title><tag close=" ">IV-A</tag><text font="italic">Limitations</text></title>
      <para xml:id="S4.SS1.p1">
        <p>A major limitation of our work is the lack of back and forth interactive conversation between the caregivers and the infant agent.
Currently, only two types of conversations are supported;</p>
      </para>
      <para xml:id="S4.SS1.p2">
        <itemize xml:id="S4.I1">
          <item xml:id="S4.I1.i1">
            <tags>
              <tag>•</tag>
              <tag role="typerefnum">1st item</tag>
            </tags>
            <para xml:id="S4.I1.i1.p1">
              <p>Caregiver initiated conversation that will be played according to a pre-determined schedule, and</p>
            </para>
          </item>
          <item xml:id="S4.I1.i2">
            <tags>
              <tag>•</tag>
              <tag role="typerefnum">2nd item</tag>
            </tags>
            <para xml:id="S4.I1.i2.p1">
              <p>Contingent response that will be played conditioned on infant agent behaviors such as cooing or touching toys.</p>
            </para>
          </item>
        </itemize>
      </para>
      <para xml:id="S4.SS1.p3">
        <p>Despite building diverse scenarios for these two conversation types is a challenge by itself due to the sheer number of the required diverse experience, they would not provide enough experience for the development of language acquisition beyond the first year level.
One potential approach to overcome this limitation would be to ask humans to interact with the artificial agent using a virtual reality technique <cite class="ltx_citemacro_cite">[<bibref bibrefs="murnane2019learning,murnane2019virtual" separator="," yyseparator=","/>]</cite>.
Another option would be to use a physical embodied robot and ask humans to take care of it.
We claim that SEDRo can be used to test cognitive architectures before the need to perform physical robot experiments, thereby helping in reducing the number of candidate architectures for the expensive physical robot experiments.</p>
      </para>
      <para xml:id="S4.SS1.p4">
        <p>In SEDRo, we simulate the human infant experiences, but an alternative is to use a completely artificial environment that is not relevant to human experience but still requires skills in many domains.
For example, emergent communication behaviors were observed in the reinforcement learning environment with multiple agents <cite class="ltx_citemacro_cite">[<bibref bibrefs="eccles2019biases,cao2018emergent,das2018tarmac,foerster2016learning" separator="," yyseparator=","/>]</cite>.
Through similar researches, though we might find clues about the underlying human learning mechanism, it might be challenging to apply them to human robot interaction because language is a set of arbitrary symbols shared between members <cite class="ltx_citemacro_cite">[<bibref bibrefs="kottur2017natural" separator="," yyseparator=","/>]</cite>.
<!--  %**** discussion.tex Line 25 **** --></p>
      </para>
      <para xml:id="S4.SS1.p5">
        <p>Another possibility is to transform existing resources into an open-ended learning environment.
Using Youtube videos to create a diverse experience is an example.
However, Smith and Slone pointed out that these kinds of approaches use shallow information about a lot of things, whereas, on the contrary, human infants begin by learning a lot about a few things <cite class="ltx_citemacro_cite">[<bibref bibrefs="smith2017developmental" separator="," yyseparator=","/>]</cite>.
In addition to that, visual information from the first years of human life constitutes an egocentric view of the world. The allocentric view emerges only later, after 12 months of age.
Furthermore, humans rely heavily on social interactions to learn.
While infants can learn a language by being tutored by an instructor, they cannot learn by seeing a recorded video of an same tutoring <cite class="ltx_citemacro_cite">[<bibref bibrefs="kuhl2007speech" separator="," yyseparator=","/>]</cite>.
Therefore we think that certain necessary skills have to be acquired before learning from those resources becomes feasible.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S4.SS2">
      <tags>
        <tag>IV-B</tag>
        <tag role="refnum">IV-B</tag>
        <tag role="typerefnum">§IV-B</tag>
      </tags>
      <title><tag close=" ">IV-B</tag><text font="italic">Conclusion and Future Work</text></title>
      <para xml:id="S4.SS2.p1">
        <p>We are building SEDRo, an environment that simulates the early experiences of a human from the stage of a fetus to 12 months of age.
The open-ended and unsupervised nature of the environment requires agents to avoid fitting to specific tasks.
To evaluate the development of intelligent behaviors of non-verbal artificial agents, a set of experiments in developmental psychology will be simulated in SEDRo.
We expect researchers in the AI and robotics community to discover the learning mechanism for artificial general intelligence by testing different cognitive architectures using the open-ended learning environment developed in our project.</p>
      </para>
    </subsection>
  </section>
  <bibliography bibstyle="IEEEtran" citestyle="numbers" files="references" sort="true" xml:id="bib">
    <title>References</title>
  </bibliography>
</document>
