<?xml version="1.0" encoding="UTF-8"?>
<?latexml searchpaths="/home/japhy/scienceReplication.artiswrong.com/paper_files/arxiv/2003.10469/latex_extracted"?>
<!--  %updated April 2002 by Antje Endemann --><!--  %Based on CVPR 07 and LNCS, with modifications by DAF, AZ and elle, 2008 and AA, 2010, and CC, 2011; TT, 2014; AAS, 2016; AAS, 2020 --><?latexml class="llncs"?>
<?latexml package="graphicx"?>
<?latexml package="comment"?>
<?latexml package="amsmath,amssymb"?>
<?latexml package="acronym"?>
<?latexml package="color"?>
<?latexml package="multibib" options="resetlabels,labeled"?>
<?latexml package="footmisc" options="symbol"?>
<!--  %“usepackage[utf8]–inputenc˝ --><!--  %“usepackage–authblk˝ --><!--  %INITIAL SUBMISSION - The following two lines are NOT commented --><!--  %CAMERA READY - Comment OUT the following two lines --><!--  %“usepackage–ruler˝ --><?latexml package="geometry" options="width=122mm,left=12mm,paperwidth=146mm,height=193mm,top=12mm,paperheight=217mm"?>
<?latexml package="tabularx, lipsum"?>
<?latexml package="hyperref"?>
<!--  %“usepackage–float˝ --><?latexml package="xr"?>
<!--  %argument=file name and extension --><!--  %**** arxiv.tex Line 50 **** --><!--  %“myexternaldocument–supp˙arxiv˝ --><!--  %“acrodef–OP˝–Object Permanence˝ --><!--  %**** arxiv.tex Line 75 **** --><?latexml RelaxNGSchema="LaTeXML"?>
<document xmlns="http://dlmf.nist.gov/LaTeXML" class="ltx_authors_1line">
  <resource src="LaTeXML.css" type="text/css"/>
  <resource src="ltx-article.css" type="text/css"/>
  <title>Learning Object Permanence from Video</title>
  <creator role="author">
    <personname>Aviv Shamsian  Ofri Kleinfeld<sup>*</sup>   Amir Globerson  Gal Chechik</personname>
    <contact role="thanks">equal contribution</contact>
    <contact role="institute">1Bar-Ilan University, Ramat-Gan, Israel </contact>
    <contact role="emailmark">1</contact>
    <contact role="institute">1Bar-Ilan University, Ramat-Gan, Israel </contact>
    <contact role="emailmark">1</contact>
    <contact role="institute">2Tel Aviv University, Tel Aviv, Israel </contact>
    <contact role="emailmark">2</contact>
    <contact role="institute">1Bar-Ilan University, Ramat-Gan, Israel </contact>
    <contact role="emailmark">1</contact>
    <contact role="institute">3NVIDIA Research, Tel-Aviv, Israel
<ref class="ltx_url" font="typewriter" href="https://chechiklab.biu.ac.il/~avivshamsian/OP/OP_HTML.html">https://chechiklab.biu.ac.il/~avivshamsian/OP/OP_HTML.html</ref></contact>
    <contact role="emailmark">3</contact>
  </creator>
  <abstract name="Abstract">
    <p><emph font="italic">Object Permanence</emph> allows people to reason about the location of non-visible objects, by understanding that they continue to exist even when not perceived directly. Object Permanence is critical for building a model of the world, since objects in natural visual scenes dynamically occlude and contain each-other. Intensive studies in developmental psychology suggest that object permanence is a challenging task that is learned through extensive experience.</p>
    <p>Here we introduce the setup of learning Object Permanence from labeled videos. We explain why this learning problem should be dissected into four components, where objects are (1) visible, (2) occluded, (3) contained by another object and (4) carried by a containing object.
The fourth subtask, where a target object is carried by a containing object, is particularly challenging because it requires a system to reason about a moving location of an invisible object. We then present a unified deep architecture that learns to predict object location under these four scenarios. We evaluate the architecture and system on a new dataset based on CATER, and find that it outperforms previous localization methods and various baselines.</p>
<!--  %“keywords–Object Permanence, Reasoning, Video Analysis˝ -->  </abstract>
<!--  %Insert your submission number here 
     %Replace with your title
     %INITIAL SUBMISSION
     %“begin–comment˝
     %“titlerunning–ECCV-20 submission ID “ECCVSubNumber˝
     %“authorrunning–ECCV-20 submission ID “ECCVSubNumber˝
     %“end–comment˝
     %******************
     %**** arxiv.tex Line 100 ****
     %CAMERA READY SUBMISSION
     %“titlerunning–Learning Object Permanence from Video˝
     %If the paper title is too long for the running head, you can set
     %an abbreviated paper title here
     %“authorrunning–F. Author et al.˝
     %First names are abbreviated in the running head.
     %If there are more than two authors, ’et al.’ is used.
     %“email–Gal.Chechik@biu.ac.il˝““
     %“url–http://chechiklab.biu.ac.il˝ “and
     %“email–“–abc,lncs“˝@uni-heidelberg.de˝˝
     %******************
     %**** arxiv.tex Line 125 ****-->  <section inlist="toc" xml:id="S1">
    <tags>
      <tag>1</tag>
      <tag role="autoref">section 1</tag>
      <tag role="refnum">1</tag>
      <tag role="typerefnum">§1</tag>
    </tags>
    <title><tag close=" ">1</tag>Introduction</title>
    <para xml:id="S1.p1">
      <p>Understanding dynamic natural scenes is often challenged by objects that contain or occlude each other. To reason correctly about such visual scenes, systems need to develop a sense of <text font="italic">Object Permanence</text> (OP) <cite class="ltx_citemacro_cite">[<bibref bibrefs="Piaget1954TheCO" separator="," yyseparator=","/>]</cite>. Namely, the understanding that objects continue to exist and preserve their physical characteristics, even if they are not perceived directly. For example, we want systems to learn that a pedestrian occluded by a truck may emerge from its other side, but that a person entering a car would “disappear” from the scene.</p>
    </para>
    <para xml:id="S1.p2">
      <p>The concept of OP received substantial attention in the cognitive development literature. Piaget hypothesized that infants develop OP relatively late (at two years of age), suggesting that it is a challenging task that requires deep modelling of the world based on sensory-motor interaction with objects. Later evidence showed that children learn OP for occluded targets early <cite class="ltx_citemacro_cite">[<bibref bibrefs="aguiar1999,baillargeon1991object" separator="," yyseparator=","/>]</cite>. Still, only at a later age do children develop understanding of objects that are contained by other objects <cite class="ltx_citemacro_cite">[<bibref bibrefs="smitsman2009significance" separator="," yyseparator=","/>]</cite>. Based on these experiments we hypothesize that reasoning about the location of non-visible objects may be much harder when they are carried inside other moving objects.</p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:object_per" placement="t" xml:id="S1.F1">
      <tags>
        <tag>Figure 1</tag>
        <tag role="autoref">Figure 1</tag>
        <tag role="refnum">1</tag>
        <tag role="typerefnum">Figure 1</tag>
      </tags>
      <p>(a)                 (b)                     (c)                  (d)</p>
      <graphics candidates="Figures/four_types_reason.png" class="ltx_centering" graphic="Figures/four_types_reason.png" options="width=385.9218pt" xml:id="S1.F1.g1"/>
      <toccaption class="ltx_centering"><tag close=" ">1</tag>Inferring object location in rich dynamic scenes involves four different tasks, and two different types of reasoning. (a) The target, a red ball, is fully visible. (b) The target is fully-or partially occluded by the static cube. (c) The target is located inside the cube and fully covered. (d) The non-visible target is located inside another moving object; its location changes even though it is not directly visible </toccaption>
      <caption class="ltx_centering"><tag close=": ">Figure 1</tag>Inferring object location in rich dynamic scenes involves four different tasks, and two different types of reasoning. (a) The target, a red ball, is fully visible. (b) The target is fully-or partially occluded by the static cube. (c) The target is located inside the cube and fully covered. (d) The non-visible target is located inside another moving object; its location changes even though it is not directly visible </caption>
    </figure>
    <para xml:id="S1.p3">
      <p>Reasoning about the location of a target object in a video scene involves four different subtasks of increasing complexity (Figure <ref labelref="LABEL:fig:object_per"/>). These four tasks are based on the state of the target object, depending if it is (1) visible, (2) occluded, (3) contained or (4) carried. The <text font="italic">visible</text> case is perhaps the simplest task, and corresponds to object detection, where one aims to localize an object that is visible. Detection was studied extensively and is viewed as a key component in computer vision systems.
The second task, <text font="italic">occlusion</text>, is to detect a target object which becomes transiently invisible by a moving occluding object (e.g., bicycle behind a truck). Tracking objects under occlusion can be very challenging, especially with long-term occlusions <cite class="ltx_citemacro_cite">[<bibref bibrefs="mojtaba2019deep,Kristan2018TheSV,fan2019lasot,wu2015object" separator="," yyseparator=","/>]</cite>.</p>
    </para>
<!--  %**** arxiv.tex Line 150 **** -->    <para xml:id="S1.p4">
      <p>Third, in a <text font="italic">containment</text> scenario, a target object may be located inside another container object and become non-visible <cite class="ltx_citemacro_cite">[<bibref bibrefs="ullman2019model" separator="," yyseparator=","/>]</cite> (e.g., a person enters a store).
Finally, the fourth case of a <text font="italic">carried</text> object is arguably the most challenging task. It requires inferring the location of a non-visible object located inside a moving containing object (e.g., a person enters a taxi that leaves the scene). Among the challenging aspects of this task is the need to keep a representation of which object should be tracked at every time point and the need to “switch states” dynamically through time. This task received little attention in the computer vision community so far.</p>
    </para>
    <para xml:id="S1.p5">
      <p>We argue that reasoning about the location of a non-visible object should address two distinct and fundamentally different cases: occlusion and containment. First, to <text font="italic">localize an occluded object</text>, an agent has to build an internal state that models how the object moves. For example, when we observe a person walking in the street, we can predict her ever-changing location even if occluded by a large bus. In this mode, our reasoning mechanism keeps attending to the person and keeps inferring her location from past data. Second, <text font="italic">localizing contained objects</text> is fundamentally different. It requires a reasoning mechanism that switches to attend to the containing object, which is visible. Here, even though the object of interest is not-visible, its location can be accurately inferred from the location of the visible containing object. We demonstrate below that incorporating these two reasoning mechanisms leads to more accurate localization in all four subtasks.</p>
    </para>
    <para xml:id="S1.p6">
      <p>Specifically, we develop a unified approach for learning all four object localization subtasks in video. We design a deep architecture that learns to localize objects that may be visible, occluded, contained or carried. Our architecture consists of two reasoning modules designed to reason about (1) carried or contained targets, and (2) occluded or visible targets. The first reasoning component is explicitly designed to answer the question <text font="italic">“Which object should be tracked now?”</text>. It does so by using an LSTM to weight the perceived locations of the objects in the scene. The second reasoning component leverages the information about which object should be tracked and previous known locations of the target to localize the target, even if it is occluded. Finally, we also introduce a dataset that is based on videos from CATER <cite class="ltx_citemacro_cite">[<bibref bibrefs="girdhar2019cater" separator="," yyseparator=","/>]</cite>, enriched with new annotations about task type and about ground-truth location of all objects.</p>
    </para>
    <para xml:id="S1.p7">
      <p>Our main novel contributions are: (1) We conceptualize that localizing non-visible objects requires two types of reasoning: about occluded objects and about carried ones. (2) We define four subtypes of localization tasks and introduce annotations for the CATER dataset to facilitate evaluating each of these subtasks. (3) We describe a new unified architecture for all four subtasks, which can capture the two types of reasoning, and we show empirically that it outperforms multiple strong baselines.</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S2">
    <tags>
      <tag>2</tag>
      <tag role="autoref">section 2</tag>
      <tag role="refnum">2</tag>
      <tag role="typerefnum">§2</tag>
    </tags>
    <title><tag close=" ">2</tag>Related Work</title>
    <subsection inlist="toc" xml:id="S2.SS1">
      <tags>
        <tag>2.1</tag>
        <tag role="autoref">subsection 2.1</tag>
        <tag role="refnum">2.1</tag>
        <tag role="typerefnum">§2.1</tag>
      </tags>
      <title><tag close=" ">2.1</tag>Relational Reasoning in Synthetic Video Datasets</title>
      <para xml:id="S2.SS1.p1">
        <p>Recently, several studies provided synthetic datasets to explore object interaction and reasoning. Many of these studies are based on CLEVR <cite class="ltx_citemacro_cite">[<bibref bibrefs="johnson2017clevr" separator="," yyseparator=","/>]</cite>, a synthetic dataset designed for visual reasoning through visual question answering. CLEVRER <cite class="ltx_citemacro_cite">[<bibref bibrefs="yi2019clevrer" separator="," yyseparator=","/>]</cite> extended CLEVR to video, focusing on the causal structures underlying object interactions. It demonstrated that visual reasoning models that thrive on perception based tasks often perform poorly in causal reasoning tasks.</p>
      </para>
      <para xml:id="S2.SS1.p2">
        <p>Most relevant for our paper, CATER <cite class="ltx_citemacro_cite">[<bibref bibrefs="girdhar2019cater" separator="," yyseparator=","/>]</cite> is a dataset for reasoning about object action and interactions in video. One of the three tasks defined in CATER, the <text font="italic">snitch localization</text> task, is closely related to the OP problem studied here. It is defined as localizing a target <text font="italic">at the end of a video</text>, where the target is usually visible.
Our work refines their setup, learning to localize the target through the full video, and breaks down prediction into four types of localization tasks. As a result, we provide a fine-grained insight about the architectures and reasoning that is required for solving the complex localization task.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S2.SS2">
      <tags>
        <tag>2.2</tag>
        <tag role="autoref">subsection 2.2</tag>
        <tag role="refnum">2.2</tag>
        <tag role="typerefnum">§2.2</tag>
      </tags>
      <title><tag close=" ">2.2</tag>Architectures for Video Reasoning</title>
      <para xml:id="S2.SS2.p1">
        <p>Several recent papers studied the effectiveness of CNN-based architectures for video action recognition. Many approaches use 3D convolutions for spatiotemporal feature learning <cite class="ltx_citemacro_cite">[<bibref bibrefs="carreira2017quo,tran2015learning" separator="," yyseparator=","/>]</cite> and separate the spatial and temporal modalities by adding optical flow as a second stream <cite class="ltx_citemacro_cite">[<bibref bibrefs="feichtenhofer2016spatiotemporal,simonyan2014two" separator="," yyseparator=","/>]</cite>. These models are computationally
expensive because 3D convolution kernels may be costly to compute. As a result, they may limited to sequence length to 20-30 frames <cite class="ltx_citemacro_cite">[<bibref bibrefs="carreira2017quo,tran2015learning" separator="," yyseparator=","/>]</cite>. In <cite class="ltx_citemacro_cite">[<bibref bibrefs="zhou2017temporalrelation" separator="," yyseparator=","/>]</cite> it was proposed to sparsely sample video frames to capture temporal relations in action recognition datasets. However, sparse sampling may be insufficient for long occlusion and containment sequences, which is the core of our OP focus.</p>
      </para>
      <para xml:id="S2.SS2.p2">
        <p>Another strategy for temporal aggregation is to use recurrent architectures like LSTM <cite class="ltx_citemacro_cite">[<bibref bibrefs="hochreiter1997long" separator="," yyseparator=","/>]</cite>, connecting the underlying CNN output along the temporal dimension <cite class="ltx_citemacro_cite">[<bibref bibrefs="yue2015beyond" separator="," yyseparator=","/>]</cite>. <cite class="ltx_citemacro_cite">[<bibref bibrefs="gao2017video,song2017end,sharma2015action" separator="," yyseparator=","/>]</cite> combined LSTM with spatial attention, learning to attend to those parts of the video frame that are relevant for the task as the video progresses. In Section <ref labelref="LABEL:Experiments"/> we experiment with a spatial attention module, which learns to dynamically focus on relevant objects.</p>
      </para>
      <subsubsection inlist="toc" xml:id="S2.SS2.SSS1">
        <tags>
          <tag>2.2.1</tag>
          <tag role="autoref">subsubsection 2.2.1</tag>
          <tag role="refnum">2.2.1</tag>
          <tag role="typerefnum">§2.2.1</tag>
        </tags>
        <title><tag close=" ">2.2.1</tag>Tracking with Object Occlusion.</title>
        <para xml:id="S2.SS2.SSS1.p1">
          <p>A large body of work has been devoted to tracking objects <cite class="ltx_citemacro_cite">[<bibref bibrefs="mojtaba2019deep" separator="," yyseparator=","/>]</cite>. For objects under complex occlusion like carrying, early work studied tracking using classical techniques and without deep learning methods.
For instance, <cite class="ltx_citemacro_cite">[<bibref bibrefs="huang2005tracking,papadourakis2010multiple" separator="," yyseparator=","/>]</cite> used the idea of object permanence to track objects through long-term occlusions. They located objects using adaptive appearance models, spatial distributions
<!--  %**** arxiv.tex Line 175 **** -->and inter-occlusion relationships.
In contrast, the approach presented in this paper focuses on a single deep differentiable model to learn motion reasoning end-to-end.
<!--  %from supervised data. --><cite class="ltx_citemacro_cite">[<bibref bibrefs="grabner2010tracking" separator="," yyseparator=","/>]</cite> succeeds to track occluded targets by learning how their movement is coupled with the movement of other visible objects.
Unfortunately, the dataset studied here, CATER <cite class="ltx_citemacro_cite">[<bibref bibrefs="girdhar2019cater" separator="," yyseparator=","/>]</cite>, has weak object-object motion coupling by design. Specifically, when measuring the
correlation between the movement of the target and other object (as in <cite class="ltx_citemacro_cite">[<bibref bibrefs="grabner2010tracking" separator="," yyseparator=","/>]</cite>), we found that the correlation in 94% of the videos was not statistically significant.</p>
        </para>
        <para xml:id="S2.SS2.SSS1.p2">
          <p>More recently, models based on Siamese neural network achieved SOTA results in object tracking  <cite class="ltx_citemacro_cite">[<bibref bibrefs="Fan_2019,Li2018HighPV,Zhu_2018_ECCV" separator="," yyseparator=","/>]</cite>. Despite the power of these architectures, tracking highly-occluded objects is still challenging <cite class="ltx_citemacro_cite">[<bibref bibrefs="mojtaba2019deep" separator="," yyseparator=","/>]</cite>.
The tracker of <cite class="ltx_citemacro_cite">[<bibref bibrefs="Zhu_2018_ECCV" separator="," yyseparator=","/>]</cite>, DaSiamRPN, extends the region-proposal sub-network of <cite class="ltx_citemacro_cite">[<bibref bibrefs="Li2018HighPV" separator="," yyseparator=","/>]</cite>. It was designed for long-term tracking and handles full occlusion or out-of-view scenarios. DaSiamRPN was used as a baseline for the snitch localization task in CATER <cite class="ltx_citemacro_cite">[<bibref bibrefs="girdhar2019cater" separator="," yyseparator=","/>]</cite>, and we evaluated its performance for the OP problem in Section <ref labelref="LABEL:Experiments"/>.</p>
        </para>
      </subsubsection>
      <subsubsection inlist="toc" xml:id="S2.SS2.SSS2">
        <tags>
          <tag>2.2.2</tag>
          <tag role="autoref">subsubsection 2.2.2</tag>
          <tag role="refnum">2.2.2</tag>
          <tag role="typerefnum">§2.2.2</tag>
        </tags>
        <title><tag close=" ">2.2.2</tag>Containment.</title>
        <para xml:id="S2.SS2.SSS2.p1">
          <p>Few recent studies explored the idea of containment relations.
<cite class="ltx_citemacro_cite">[<bibref bibrefs="liang2018tracking" separator="," yyseparator=","/>]</cite> recovered incomplete object trajectories by reasoning about containment relations. <cite class="ltx_citemacro_cite">[<bibref bibrefs="ullman2019model" separator="," yyseparator=","/>]</cite> proposed an unsupervised model to categorize spatial relations, including containment between objects. The containment setup defined in these studies differs from the one defined here in that the contained object is always at least partially visible <cite class="ltx_citemacro_cite">[<bibref bibrefs="ullman2019model" separator="," yyseparator=","/>]</cite>, or
the containment does not involve carrying <cite class="ltx_citemacro_cite">[<bibref bibrefs="liang2018tracking,ullman2019model" separator="," yyseparator=","/>]</cite>.</p>
        </para>
<!--  %============================================== -->      </subsubsection>
    </subsection>
  </section>
  <section inlist="toc" xml:id="S3">
    <tags>
      <tag>3</tag>
      <tag role="autoref">section 3</tag>
      <tag role="refnum">3</tag>
      <tag role="typerefnum">§3</tag>
    </tags>
    <title><tag close=" ">3</tag>The Learning Setup: Reasoning about Non-Visible Objects</title>
    <para xml:id="S3.p1">
      <p>We next formally define the OP task and learning setup.
We are given a set of videos <Math mode="inline" tex="v_{1},...,v_{N}" text="list@(v _ 1, ldots, v _ N)" xml:id="S3.p1.m1">
          <XMath>
            <XMDual>
              <XMApp>
                <XMTok meaning="list"/>
                <XMRef idref="S3.p1.m1.2"/>
                <XMRef idref="S3.p1.m1.1"/>
                <XMRef idref="S3.p1.m1.3"/>
              </XMApp>
              <XMWrap>
                <XMApp xml:id="S3.p1.m1.2">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">v</XMTok>
                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMTok name="ldots" role="ID" xml:id="S3.p1.m1.1">…</XMTok>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp xml:id="S3.p1.m1.3">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">v</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">N</XMTok>
                </XMApp>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math> where each frame <Math mode="inline" tex="x^{i}_{t}" text="(x ^ i) _ t" xml:id="S3.p1.m2">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">x</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
              </XMApp>
              <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
            </XMApp>
          </XMath>
        </Math> in video <Math mode="inline" tex="v_{i}" text="v _ i" xml:id="S3.p1.m3">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">v</XMTok>
              <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
            </XMApp>
          </XMath>
        </Math> is accompanied by the bounding box position <Math mode="inline" tex="B^{i}_{t}" text="(B ^ i) _ t" xml:id="S3.p1.m4">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">B</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
              </XMApp>
              <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
            </XMApp>
          </XMath>
        </Math> of the target object as its label. The goal is to predict for each frame a bounding box <Math mode="inline" tex="\hat{B}^{i}_{t}" text="((hat@(B)) ^ i) _ t" xml:id="S3.p1.m5">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMApp>
                  <XMTok name="hat" role="OVERACCENT" stretchy="false">^</XMTok>
                  <XMTok font="italic" role="UNKNOWN">B</XMTok>
                </XMApp>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
              </XMApp>
              <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
            </XMApp>
          </XMath>
        </Math> of the target object that is closest (in terms of <Math mode="inline" tex="L_{1}" text="L _ 1" xml:id="S3.p1.m6">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">L</XMTok>
              <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
            </XMApp>
          </XMath>
        </Math> distance) to the ground-truth bounding box <Math mode="inline" tex="B^{i}_{t}" text="(B ^ i) _ t" xml:id="S3.p1.m7">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">B</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
              </XMApp>
              <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
            </XMApp>
          </XMath>
        </Math>.</p>
    </para>
    <para xml:id="S3.p2">
      <p>We define four localization tasks: (1) Localizing a visible object, which we define as an object that is at least partially visible. (2) Localizing an occluded object, which we define as an object that is <text font="italic">fully</text> occluded by another object. (3) Localizing an object contained by another object, thus also completely non visible. (4) Localizing an object that is carried along the surface by a containing object. Thus in this case the target is moving while being completely non-visible. Together, these four tasks form a localization task that we call <text font="bold">object-permanence localization task</text>, or OP.</p>
    </para>
    <para xml:id="S3.p3">
      <p>In Section <ref labelref="LABEL:sect:abl"/>, we also study a semi-supervised learning setup, where at training time the location <Math mode="inline" tex="B^{i}_{t}" text="(B ^ i) _ t" xml:id="S3.p3.m1">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">B</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
              </XMApp>
              <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
            </XMApp>
          </XMath>
        </Math> of the target is provided only in frames when it is visible. This would correspond to the case of a child learning object permanence without explicit feedback about where an object is located when it is hidden.</p>
    </para>
    <para xml:id="S3.p4">
      <p>It is instructive to note how the task we address here differs from the tasks of relation or reaction recognition <cite class="ltx_citemacro_cite">[<bibref bibrefs="krishna2017visual,lu2016visual,sadeghi2011recognition" separator="," yyseparator=","/>]</cite>.
In these tasks, models are trained to output an explicit label that captures the name of the interaction or relation (e.g., “behind”, “carry”). In our task, the model aims to predict the location of the target (a regression problem), but it is not trained to name it explicitly (occluded, contained). While it is possible that the model creates some implicit representation describing the visibility type, this is not mandated by the loss or the architecture.
<!--  %**** arxiv.tex Line 200 **** --></p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:opnet_arch" placement="t" xml:id="S3.F2">
      <tags>
        <tag>Figure 2</tag>
        <tag role="autoref">Figure 2</tag>
        <tag role="refnum">2</tag>
        <tag role="typerefnum">Figure 2</tag>
      </tags>
      <graphics candidates="Figures/opnet_arch.png" class="ltx_centering" graphic="Figures/opnet_arch.png" options="width=433.62pt, height=0.0pt, keepaspectratio=true" xml:id="S3.F2.g1"/>
      <toccaption class="ltx_centering"><tag close=" ">2</tag>The architecture of <text font="italic">Object Permanence network</text> (OPNet) consists of three components. (a) Perception module for detection. (b) Reasoning module for inferring which object to track in case the target is carried or contained. (c) A second reasoning module for occluded or visible targets, and for refining the exact location of the predicted target.
</toccaption>
      <caption class="ltx_centering"><tag close=": ">Figure 2</tag>The architecture of <text font="italic">Object Permanence network</text> (OPNet) consists of three components. (a) Perception module for detection. (b) Reasoning module for inferring which object to track in case the target is carried or contained. (c) A second reasoning module for occluded or visible targets, and for refining the exact location of the predicted target.
</caption>
    </figure>
    <pagination role="newpage"/>
<!--  %================================= -->  </section>
  <section inlist="toc" labels="LABEL:sec:model" xml:id="S4">
    <tags>
      <tag>4</tag>
      <tag role="autoref">section 4</tag>
      <tag role="refnum">4</tag>
      <tag role="typerefnum">§4</tag>
    </tags>
    <title><tag close=" ">4</tag>Our Approach</title>
    <para xml:id="S4.p1">
      <p>We describe a deep network architecture designed to address the four localization subtasks of the OP task. We refer to the architecture as OPNet. It contains three modules, that account for the perception and inference computations which facilitate OP (see Figure <ref labelref="LABEL:fig:opnet_arch"/>).</p>
    </para>
    <para xml:id="S4.p2">
      <p><text font="italic">Perception and detection module (Figure <ref labelref="LABEL:fig:opnet_arch"/>a)</text>: A perception module, responsible for detecting and tracking visible objects. We incorporated a Faster R-CNN <cite class="ltx_citemacro_cite">[<bibref bibrefs="ren2015faster" separator="," yyseparator=","/>]</cite> object detection model, fine-tuned on frames from our dataset, as the perception component of our model.
After pre-training, we used the detector to output the bounding boxes together with identifiers of all objects in any given frame. Specifically, we represent a frame using a <Math mode="inline" tex="K\times 5" text="K * 5" xml:id="S4.p2.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">×</XMTok>
              <XMTok font="italic" role="UNKNOWN">K</XMTok>
              <XMTok meaning="5" role="NUMBER">5</XMTok>
            </XMApp>
          </XMath>
        </Math> matrix. Each row in the matrix represents an object using <Math mode="inline" tex="5" text="5" xml:id="S4.p2.m2">
          <XMath>
            <XMTok meaning="5" role="NUMBER">5</XMTok>
          </XMath>
        </Math> values: four values of the bounding box (<Math mode="inline" tex="x_{1},y_{1},x_{2},y_{2}" text="list@(x _ 1, y _ 1, x _ 2, y _ 2)" xml:id="S4.p2.m3">
          <XMath>
            <XMDual>
              <XMApp>
                <XMTok meaning="list"/>
                <XMRef idref="S4.p2.m3.1"/>
                <XMRef idref="S4.p2.m3.2"/>
                <XMRef idref="S4.p2.m3.3"/>
                <XMRef idref="S4.p2.m3.4"/>
              </XMApp>
              <XMWrap>
                <XMApp xml:id="S4.p2.m3.1">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">x</XMTok>
                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp xml:id="S4.p2.m3.2">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">y</XMTok>
                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp xml:id="S4.p2.m3.3">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">x</XMTok>
                  <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp xml:id="S4.p2.m3.4">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">y</XMTok>
                  <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                </XMApp>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math>) and one <text font="italic">visibility bit</text>, which indicates whether the object is visible or not.
As the video progresses, we assign a unique row to each <emph font="italic">newly identified</emph> object. If an object is not detected in a given frame, its corresponding information (assigned row) is set to zero. In practice, <Math mode="inline" tex="K=15" text="K = 15" xml:id="S4.p2.m4">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMTok font="italic" role="UNKNOWN">K</XMTok>
              <XMTok meaning="15" role="NUMBER">15</XMTok>
            </XMApp>
          </XMath>
        </Math> was the maximal number of objects in a single video in our dataset.
Notably, the videos in the dataset we used do not contain two identical objects, but we found that the detector sometimes mistakes one object for another.</p>
    </para>
    <para xml:id="S4.p3">
      <p><text font="italic">“Who to track?” module (Figure <ref labelref="LABEL:fig:opnet_arch"/>c)</text>: responsible for understanding which object is currently covering the target. This component consists of a single LSTM layer with a hidden dimension of 256 neurons and a linear projection matrix. After applying the LSTM to the object bounding boxes, we project its output to <Math mode="inline" tex="K" text="K" xml:id="S4.p3.m1">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">K</XMTok>
          </XMath>
        </Math> neurons, each representing a distinct object in the frame. Finally we apply a softmax layer, resulting in a distribution over the objects in the frame. This distribution can be viewed as an attention mask focusing on the object that covers the target in this frame. Importantly, we do not provide explicit supervision to this attention mask (e.g., by explicitly “telling the model” during training what is the correct attention mask). Rather, our only objective is the location of the target. The output of this module is <Math mode="inline" tex="5" text="5" xml:id="S4.p3.m2">
          <XMath>
            <XMTok meaning="5" role="NUMBER">5</XMTok>
          </XMath>
        </Math> numbers per frame. It is computed as the the weighted average over the <Math mode="inline" tex="K\times 5" text="K * 5" xml:id="S4.p3.m3">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">×</XMTok>
              <XMTok font="italic" role="UNKNOWN">K</XMTok>
              <XMTok meaning="5" role="NUMBER">5</XMTok>
            </XMApp>
          </XMath>
        </Math> outputs of the previous stage, weighted by the attention mask.</p>
    </para>
    <para xml:id="S4.p4">
      <p><text font="italic">“Where is it” module (Figure <ref labelref="LABEL:fig:opnet_arch"/>b)</text>: learns to predict the location of occluded targets. This final component consists of a second LSTM and a projection matrix. Using the output of the previous component, this component is responsible for predicting the target localization. It takes the output of the previous step (<Math mode="inline" tex="5" text="5" xml:id="S4.p4.m1">
          <XMath>
            <XMTok meaning="5" role="NUMBER">5</XMTok>
          </XMath>
        </Math> values per frame), feeds it into the LSTM and projects its output to four units, representing the predicted bounding box of the target for each frame.</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S5">
    <tags>
      <tag>5</tag>
      <tag role="autoref">section 5</tag>
      <tag role="refnum">5</tag>
      <tag role="typerefnum">§5</tag>
    </tags>
    <title><tag close=" ">5</tag>The LA-CATER Dataset</title>
<!--  %**** arxiv.tex Line 225 **** -->    <para xml:id="S5.p1">
      <p>To train models and evaluate their performance on the four OP subtasks defined above, we introduce a new set of annotations to the CATER dataset <cite class="ltx_citemacro_cite">[<bibref bibrefs="girdhar2019cater" separator="," yyseparator=","/>]</cite>. We refer to these as <text font="italic">Localization Annotations</text> (LA-CATER).</p>
    </para>
    <para xml:id="S5.p2">
      <p>The CATER dataset consists of 5,500 videos generated programmatically using the Blender 3D engine. Each video is 10-second long (300 frames) and contains 5 to 10 objects. Each object is characterized by its shape (cube, sphere, cylinder and inverted cone), size (small, medium, large), material (shiny metal and matte rubber) and color (eight colors). Every video contains a golden small sphere referred to as “the snitch”, that is used as the target object which needs to be localized.</p>
    </para>
    <para xml:id="S5.p3">
      <p>For the purpose of this study, we generated videos following a similar configuration to the one used by CATER, but we computed additional annotations during video generation. Specifically, we augmented the CATER dataset with ground-truth bounding boxes locations of all objects. These annotations were programmatically extracted from the Blender engine, by projecting the internal 3D coordinates of objects are to the 2D pixel space.</p>
    </para>
    <para xml:id="S5.p4">
      <p>We further annotated videos with detailed frame-level annotations. Each frame was labeled with one of four classes: visible, fully occluded, contained (i.e., covered, static and non-visible) and carried (i.e., covered, moving and non-visible). This classification of frames matches the four localization subtasks of the OP problem.
To compute these annotations, we computed the line-of sight from the camera position to determine if a target is occluded by another object, or occluding it.</p>
    </para>
    <para xml:id="S5.p5">
      <p><emph font="italic">LA-CATER</emph> includes a total number of 14K videos split into train, dev and test datasets. See Table <ref labelref="LABEL:table:dataset"/> for a classification of video frames to each one of the localization subtasks across the dataset splits. Further details about dataset preparation are provided in appendix <ref labelref="LABEL:sec:la_cater_prep"/>.</p>
    </para>
    <table inlist="lot" labels="LABEL:table:dataset" placement="h" xml:id="S5.T1">
      <tags>
        <tag>Table 1</tag>
        <tag role="autoref">Table 1</tag>
        <tag role="refnum">1</tag>
        <tag role="typerefnum">Table 1</tag>
      </tags>
      <tabular class="ltx_centering ltx_guessed_headers" colsep="4.0pt" vattach="middle">
        <thead>
          <tr>
            <td border="r" thead="column"/>
            <td align="center" border="r t" thead="column"><tabular colsep="4.0pt" vattach="middle">
                <tr>
                  <td align="center"><text font="smallcaps">Number of</text></td>
                </tr>
                <tr>
                  <td align="center"><text font="smallcaps">Frames</text></td>
                </tr>
              </tabular></td>
            <td align="center" border="r t" thead="column"><tabular colsep="4.0pt" vattach="middle">
                <tr>
                  <td align="center"><text font="smallcaps">Visible</text></td>
                </tr>
              </tabular></td>
            <td align="center" border="r t" thead="column"><tabular colsep="4.0pt" vattach="middle">
                <tr>
                  <td align="center"><text font="smallcaps">Occluded</text></td>
                </tr>
              </tabular></td>
            <td align="center" border="r t" thead="column"><tabular colsep="4.0pt" vattach="middle">
                <tr>
                  <td align="center"><text font="smallcaps">Contained</text></td>
                </tr>
              </tabular></td>
            <td align="center" border="r t" thead="column"><tabular colsep="4.0pt" vattach="middle">
                <tr>
                  <td align="center"><text font="smallcaps">Carried</text></td>
                </tr>
              </tabular></td>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td align="center" border="l r t"><text font="smallcaps">Train</text></td>
            <td align="center" border="r t"><text font="smallcaps">9,300</text></td>
            <td align="center" border="r t"><text font="smallcaps">63.00%</text></td>
            <td align="center" border="r t"><text font="smallcaps">3.03%</text></td>
            <td align="center" border="r t"><text font="smallcaps">29.43%</text></td>
            <td align="center" border="r t"><text font="smallcaps">4.54%</text></td>
          </tr>
          <tr>
            <td align="center" border="l r t"><text font="smallcaps">Dev</text></td>
            <td align="center" border="r t"><text font="smallcaps">3,327</text></td>
            <td align="center" border="r t"><text font="smallcaps">63.27%</text></td>
            <td align="center" border="r t"><text font="smallcaps">2.89%</text></td>
            <td align="center" border="r t"><text font="smallcaps">29.19%</text></td>
            <td align="center" border="r t"><text font="smallcaps">4.65%</text></td>
          </tr>
          <tr>
            <td align="center" border="b l r t"><text font="smallcaps">Test</text></td>
            <td align="center" border="b r t"><text font="smallcaps">1,371</text></td>
            <td align="center" border="b r t"><text font="smallcaps">64.13%</text></td>
            <td align="center" border="b r t"><text font="smallcaps">3.07%</text></td>
            <td align="center" border="b r t"><text font="smallcaps">28.56%</text></td>
            <td align="center" border="b r t"><text font="smallcaps">4.24%</text></td>
          </tr>
        </tbody>
      </tabular>
      <toccaption><tag close=" ">1</tag>Fraction of frames per type in the train, dev and test sets of LA-CATER. Occluded and carried target frames make up less than 8% of the frames, but they present the most challenging prediction tasks.</toccaption>
      <caption><tag close=": ">Table 1</tag>Fraction of frames per type in the train, dev and test sets of LA-CATER. Occluded and carried target frames make up less than 8% of the frames, but they present the most challenging prediction tasks.</caption>
    </table>
<!--  %===================================== -->  </section>
  <section inlist="toc" labels="LABEL:Experiments" xml:id="S6">
    <tags>
      <tag>6</tag>
      <tag role="autoref">section 6</tag>
      <tag role="refnum">6</tag>
      <tag role="typerefnum">§6</tag>
    </tags>
    <title><tag close=" ">6</tag>Experiments</title>
    <para xml:id="S6.p1">
      <p>We describe our experimental setup, compared methods and evaluation metrics. Implementation details are given in Appendix <ref labelref="LABEL:sec:implementation_details"/>.</p>
    </para>
    <subsection inlist="toc" labels="LABEL:baseline_models" xml:id="S6.SS1">
      <tags>
        <tag>6.1</tag>
        <tag role="autoref">subsection 6.1</tag>
        <tag role="refnum">6.1</tag>
        <tag role="typerefnum">§6.1</tag>
      </tags>
      <title><tag close=" ">6.1</tag>Baselines and Model Variants</title>
      <para xml:id="S6.SS1.p1">
        <p>We compare our proposed OPNet with six other architectures designed to solve the OP tasks. Since we are not aware of previously published unified architectures designed to solve all OP tasks at once, we used existing models as components in our baselines. All baseline models receive the predictions of the object detector (perception) component as their input.</p>
      </para>
      <para class="ltx_noindent" xml:id="S6.SS1.p2">
        <p><text font="italic">(A) Programmed Models</text>. We evaluated two programmed models. These models are “hard-coded” rather than learned. They are designed to reflect models that programmatically solve the reasoning task.</p>
        <itemize xml:id="S6.I1">
          <item xml:id="S6.I1.i1">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">1st item</tag>
            </tags>
            <para xml:id="S6.I1.i1.p1">
              <p>(1) <text font="bold">Detector + Tracker</text>. Using the detected location of the target, this method initiates a DaSiamRPN tracker <cite class="ltx_citemacro_cite">[<bibref bibrefs="Zhu_2018_ECCV" separator="," yyseparator=","/>]</cite> to track the target. Whenever the target is no longer visible, the tracker is re-initiated to track the object located in the last known location of the target.</p>
            </para>
          </item>
          <item xml:id="S6.I1.i2">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">2nd item</tag>
            </tags>
            <para xml:id="S6.I1.i2.p1">
              <p>(2) <text font="bold">Detector + Heuristic</text>. When the target is not detected, the model switches from tracking the target to tracking the object located closest to last known location of the target. The model also employs an heuristic logic to adjust between the sizes of the current tracked object and the original target.</p>
            </para>
          </item>
        </itemize>
      </para>
      <para class="ltx_noindent" xml:id="S6.SS1.p3">
        <p><text font="italic">(B) Learned Models</text>. We evaluated four learned baselines with an increasing level of representation complexity.</p>
        <itemize xml:id="S6.I2">
          <item xml:id="S6.I2.i1">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">1st item</tag>
            </tags>
            <para xml:id="S6.I2.i1.p1">
              <p>(3) <text font="bold">OPNet</text>. The proposed model, as presented in Section <ref labelref="LABEL:sec:model"/>.</p>
            </para>
          </item>
          <item xml:id="S6.I2.i2">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">2nd item</tag>
            </tags>
            <para xml:id="S6.I2.i2.p1">
              <p>(4) <text font="bold">Baseline LSTM</text>. This model uses a single unidirectional LSTM layer with a hidden state of 512 neurons, operating on the temporal (frames) dimension. The input to the LSTM is the concatenation of the objects input representations. It is the simplest learned baseline as the input representation is not transformed non-linearly before being fed to the LSTM.</p>
            </para>
<!--  %**** arxiv.tex Line 275 **** -->          </item>
          <item xml:id="S6.I2.i3">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">3rd item</tag>
            </tags>
            <para xml:id="S6.I2.i3.p1">
              <p>(5) <text font="bold">Non-Linear + LSTM</text>. This model augments the previous model and increases the complexity of the scene representation. The input representations are upsampled using a linear layer followed by a ReLU activation, resulting in a 256-dimensional vector representation for each object in the frame. These high-dimensional objects representations are concatenated and fed into the LSTM.</p>
            </para>
          </item>
          <item xml:id="S6.I2.i4">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">4th item</tag>
            </tags>
            <para xml:id="S6.I2.i4.p1">
              <p>(6) <text font="bold">Transformer + LSTM</text>. This model augments the previous baselines by introducing a much complex representations for objects in frame. We utilized a transformer encoder <cite class="ltx_citemacro_cite">[<bibref bibrefs="vaswani2017attention" separator="," yyseparator=","/>]</cite> after up-sampling the input representations, employing self attention between all the objects in a frame. We used a transformer encoder with 2 layers and 2 attention heads, yielding a single vector containing the target attended values. These attended values, which corresponds to each other object in the frame, are then fed into the LSTM.</p>
            </para>
          </item>
          <item xml:id="S6.I2.i5">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">5th item</tag>
            </tags>
            <para xml:id="S6.I2.i5.p1">
              <p>(7) <text font="bold">LSTM + MLP</text>. This model (<text font="italic">Figure <ref labelref="LABEL:fig:opnet_arch"/></text>) ablates the second LSTM module (c) in the model presented in Section <ref labelref="LABEL:sec:model"/>.</p>
            </para>
          </item>
        </itemize>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S6.SS2">
      <tags>
        <tag>6.2</tag>
        <tag role="autoref">subsection 6.2</tag>
        <tag role="refnum">6.2</tag>
        <tag role="typerefnum">§6.2</tag>
      </tags>
      <title><tag close=" ">6.2</tag>Evaluation Metric</title>
      <para xml:id="S6.SS2.p1">
        <p>We evaluate model performance at a given frame <Math mode="inline" tex="t" text="t" xml:id="S6.SS2.p1.m1">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">t</XMTok>
            </XMath>
          </Math> by comparing the predicted target localization and the ground truth (GT) target localization. We use two metrics as follows. First, the intersection over union (IoU) metric.</p>
        <equation xml:id="S6.E1">
          <tags>
            <tag>(1)</tag>
            <tag role="autoref">Equation 1</tag>
            <tag role="refnum">1</tag>
          </tags>
          <Math mode="display" tex="IoU_{t}=\frac{B^{GT}_{t}\cap B^{p}_{t}}{B^{GT}_{t}\cup B^{p}_{t}}\quad," text="I * o * U _ t = ((B ^ (G * T)) _ t intersection (B ^ p) _ t) / ((B ^ (G * T)) _ t union (B ^ p) _ t)" xml:id="S6.E1.m1">
            <XMath>
              <XMDual>
                <XMRef idref="S6.E1.m1.1"/>
                <XMWrap>
                  <XMApp xml:id="S6.E1.m1.1">
                    <XMTok meaning="equals" role="RELOP">=</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" role="UNKNOWN">I</XMTok>
                      <XMTok font="italic" role="UNKNOWN">o</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" role="UNKNOWN">U</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMApp>
                      <XMTok mathstyle="display" meaning="divide" role="FRACOP"/>
                      <XMApp>
                        <XMTok meaning="intersection" name="cap" role="ADDOP">∩</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                          <XMApp>
                            <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                            <XMTok font="italic" role="UNKNOWN">B</XMTok>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">G</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">T</XMTok>
                            </XMApp>
                          </XMApp>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                        </XMApp>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                          <XMApp>
                            <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                            <XMTok font="italic" role="UNKNOWN">B</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">p</XMTok>
                          </XMApp>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="union" name="cup" role="ADDOP">∪</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                          <XMApp>
                            <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                            <XMTok font="italic" role="UNKNOWN">B</XMTok>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">G</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">T</XMTok>
                            </XMApp>
                          </XMApp>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                        </XMApp>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                          <XMApp>
                            <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                            <XMTok font="italic" role="UNKNOWN">B</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">p</XMTok>
                          </XMApp>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                        </XMApp>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMTok font="italic" name="quad" role="PUNCT"> </XMTok>
                  <XMTok role="PUNCT">,</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>
        </equation>
        <p>where <Math mode="inline" tex="B^{p}_{t}" text="(B ^ p) _ t" xml:id="S6.SS2.p1.m2">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">B</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">p</XMTok>
                </XMApp>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
              </XMApp>
            </XMath>
          </Math> denotes the predicted bounding box for frame <Math mode="inline" tex="t" text="t" xml:id="S6.SS2.p1.m3">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">t</XMTok>
            </XMath>
          </Math> and <Math mode="inline" tex="B^{GT}_{t}" text="(B ^ (G * T)) _ t" xml:id="S6.SS2.p1.m4">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">B</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">G</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">T</XMTok>
                  </XMApp>
                </XMApp>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
              </XMApp>
            </XMath>
          </Math> denotes the ground truth bounding box for frame <Math mode="inline" tex="t" text="t" xml:id="S6.SS2.p1.m5">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">t</XMTok>
            </XMath>
          </Math>.</p>
      </para>
      <para xml:id="S6.SS2.p2">
        <p>Second, we evaluate models using the mean average precision (MAP) metric. MAP is computed by employing an indicator function on each frame, determining whether the IoU value is greater than a predefined threshold, then averaging across frames in a single video and all the videos in the dataset.</p>
        <equation xml:id="S6.E2">
          <tags>
            <tag>(2)</tag>
            <tag role="autoref">Equation 2</tag>
            <tag role="refnum">2</tag>
          </tags>
          <Math mode="display" tex="AP=\frac{1}{n}\sum_{t=1}^{n}\mathbf{1_{t}}\quad\ \text{, where}\;\mathbf{1_{t}%&#10;}=\begin{cases}1&amp;IoU_{t}&gt;IoU\;threshold\\&#10;0&amp;\rm{otherwise}\end{cases}" text="formulae@(A * P = (1 / n) * ((sum _ (t = 1)) ^ n)@(1 _ t), [, where] * 1 _ t = cases@(1, I * o * U _ t &gt; I * o * U * t * h * r * e * s * h * o * l * d, 0, otherwise))" xml:id="S6.E2.m1">
            <XMath>
              <XMDual>
                <XMApp>
                  <XMTok meaning="formulae"/>
                  <XMRef idref="S6.E2.m1.5"/>
                  <XMRef idref="S6.E2.m1.6"/>
                </XMApp>
                <XMWrap>
                  <XMApp xml:id="S6.E2.m1.5">
                    <XMTok meaning="equals" role="RELOP">=</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" role="UNKNOWN">A</XMTok>
                      <XMTok font="italic" role="UNKNOWN">P</XMTok>
                    </XMApp>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok mathstyle="display" meaning="divide" role="FRACOP"/>
                        <XMTok meaning="1" role="NUMBER">1</XMTok>
                        <XMTok font="italic" role="UNKNOWN">n</XMTok>
                      </XMApp>
                      <XMApp>
                        <XMApp scriptpos="mid">
                          <XMTok role="SUPERSCRIPTOP" scriptpos="mid1"/>
                          <XMApp scriptpos="mid">
                            <XMTok role="SUBSCRIPTOP" scriptpos="mid1"/>
                            <XMTok mathstyle="display" meaning="sum" role="SUMOP" scriptpos="mid">∑</XMTok>
                            <XMApp>
                              <XMTok fontsize="70%" meaning="equals" role="RELOP">=</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                              <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                            </XMApp>
                          </XMApp>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                        </XMApp>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                          <XMTok font="bold" meaning="1" role="NUMBER">1</XMTok>
                          <XMTok font="bold" fontsize="70%" role="UNKNOWN">t</XMTok>
                        </XMApp>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMTok font="italic" name="quad" role="PUNCT">   </XMTok>
                  <XMApp xml:id="S6.E2.m1.6">
                    <XMTok meaning="equals" role="RELOP">=</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMText>, where</XMText>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                        <XMTok font="bold" meaning="1" role="NUMBER"> 1</XMTok>
                        <XMTok font="bold" fontsize="70%" role="UNKNOWN">t</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="cases"/>
                        <XMRef idref="S6.E2.m1.1"/>
                        <XMRef idref="S6.E2.m1.2"/>
                        <XMRef idref="S6.E2.m1.3"/>
                        <XMRef idref="S6.E2.m1.4"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="true">{</XMTok>
                        <XMArray>
                          <XMRow>
                            <XMCell align="left">
                              <XMTok meaning="1" role="NUMBER" xml:id="S6.E2.m1.1">1</XMTok>
                            </XMCell>
                            <XMCell align="left">
                              <XMApp xml:id="S6.E2.m1.2">
                                <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">I</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">o</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post6"/>
                                    <XMTok font="italic" role="UNKNOWN">U</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                  </XMApp>
                                </XMApp>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">I</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">o</XMTok>
                                  <XMTok font="italic" role="UNKNOWN" rpadding="2.8pt">U</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">t</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">h</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">r</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">e</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">s</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">h</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">o</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">l</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">d</XMTok>
                                </XMApp>
                              </XMApp>
                            </XMCell>
                          </XMRow>
                          <XMRow>
                            <XMCell align="left">
                              <XMTok meaning="0" role="NUMBER" xml:id="S6.E2.m1.3">0</XMTok>
                            </XMCell>
                            <XMCell align="left">
                              <XMTok role="UNKNOWN" xml:id="S6.E2.m1.4">otherwise</XMTok>
                            </XMCell>
                          </XMRow>
                        </XMArray>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>
        </equation>
        <equation xml:id="S6.E3">
          <tags>
            <tag>(3)</tag>
            <tag role="autoref">Equation 3</tag>
            <tag role="refnum">3</tag>
          </tags>
          <Math mode="display" tex="MAP=\frac{1}{N}\displaystyle\sum_{v=1}^{N}AP_{v}\quad." text="M * A * P = (1 / N) * ((sum _ (v = 1)) ^ N)@(A * P _ v)" xml:id="S6.E3.m1">
            <XMath>
              <XMDual>
                <XMRef idref="S6.E3.m1.1"/>
                <XMWrap>
                  <XMApp xml:id="S6.E3.m1.1">
                    <XMTok meaning="equals" role="RELOP">=</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" role="UNKNOWN">M</XMTok>
                      <XMTok font="italic" role="UNKNOWN">A</XMTok>
                      <XMTok font="italic" role="UNKNOWN">P</XMTok>
                    </XMApp>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMApp>
                        <XMTok mathstyle="display" meaning="divide" role="FRACOP"/>
                        <XMTok meaning="1" role="NUMBER">1</XMTok>
                        <XMTok font="italic" role="UNKNOWN">N</XMTok>
                      </XMApp>
                      <XMApp>
                        <XMApp scriptpos="mid">
                          <XMTok role="SUPERSCRIPTOP" scriptpos="mid1"/>
                          <XMApp scriptpos="mid">
                            <XMTok role="SUBSCRIPTOP" scriptpos="mid1"/>
                            <XMTok mathstyle="display" meaning="sum" role="SUMOP" scriptpos="mid">∑</XMTok>
                            <XMApp>
                              <XMTok fontsize="70%" meaning="equals" role="RELOP">=</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">v</XMTok>
                              <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                            </XMApp>
                          </XMApp>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">N</XMTok>
                        </XMApp>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" role="UNKNOWN">A</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMTok font="italic" role="UNKNOWN">P</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">v</XMTok>
                          </XMApp>
                        </XMApp>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMTok font="italic" name="quad" role="PUNCT"> </XMTok>
                  <XMTok role="PERIOD">.</XMTok>
                </XMWrap>
              </XMDual>
            </XMath>
          </Math>
        </equation>
<!--  %**** arxiv.tex Line 300 **** -->        <p>These per-frame metrics allow us to quantify the performance on each of the four OP subtasks separately.</p>
      </para>
    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:Results" xml:id="S7">
    <tags>
      <tag>7</tag>
      <tag role="autoref">section 7</tag>
      <tag role="refnum">7</tag>
      <tag role="typerefnum">§7</tag>
    </tags>
    <title><tag close=" ">7</tag>Results</title>
    <para xml:id="S7.p1">
      <p>We start with comparing OPNet with the baselines presented in Section <ref labelref="LABEL:baseline_models"/>. We then provide more insights into the performance of the models by repeating the evaluations with <text font="italic">“Perfect Perception”</text> in Section <ref labelref="LABEL:sec:perfect_abl"/>. Section <ref labelref="LABEL:sect:abl"/> describes a semi-supervised setting of training with visible frames only. Finally, in Section <ref labelref="LABEL:sec:cater_res"/> we compare OPNet with the models presented in the CATER paper on the original CATER data.</p>
    </para>
    <table inlist="lot" labels="LABEL:table:od_results" placement="!h" xml:id="S7.T2">
      <tags>
        <tag>Table 2</tag>
        <tag role="autoref">Table 2</tag>
        <tag role="refnum">2</tag>
        <tag role="typerefnum">Table 2</tag>
      </tags>
      <block align="center" depth="0.0pt" width="433.6pt">
        <tabular class="ltx_guessed_headers" colsep="2.8pt" vattach="middle">
          <thead>
            <tr>
              <td align="left" border="l r t" thead="column row"><tabular colsep="2.8pt" vattach="middle">
                  <tr>
                    <td align="center">Mean IoU<Math mode="inline" tex="\pm" text="plus-or-minus" xml:id="S7.T2.m1">
                        <XMath>
                          <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        </XMath>
                      </Math> SEM</td>
                  </tr>
                </tabular></td>
              <td align="center" border="r t" thead="column"><tabular colsep="2.8pt" vattach="middle">
                  <tr>
                    <td align="center">Visible</td>
                  </tr>
                </tabular></td>
              <td align="center" border="r t" thead="column"><tabular colsep="2.8pt" vattach="middle">
                  <tr>
                    <td align="center">Occluded</td>
                  </tr>
                </tabular></td>
              <td align="center" border="r t" thead="column"><tabular colsep="2.8pt" vattach="middle">
                  <tr>
                    <td align="center">Contained</td>
                  </tr>
                </tabular></td>
              <td align="center" border="r t" thead="column"><tabular colsep="2.8pt" vattach="middle">
                  <tr>
                    <td align="center">Carried</td>
                  </tr>
                </tabular></td>
              <td align="center" border="r t" thead="column">Overall</td>
            </tr>
          </thead>
          <tbody>
            <tr>
              <td align="left" border="l r t" thead="row"><text font="smallcaps">Detector + Tracker</text></td>
              <td align="center" border="r t">90.27 <Math mode="inline" tex="\pm 0.13" text="plus-or-minus 0.13" xml:id="S7.T2.m2">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.13" role="NUMBER">0.13</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">53.62 <Math mode="inline" tex="\pm 0.58" text="plus-or-minus 0.58" xml:id="S7.T2.m3">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.58" role="NUMBER">0.58</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">39.98 <Math mode="inline" tex="\pm 0.38" text="plus-or-minus 0.38" xml:id="S7.T2.m4">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.38" role="NUMBER">0.38</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">34.45 <Math mode="inline" tex="\pm 0.40" text="plus-or-minus 0.40" xml:id="S7.T2.m5">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.40" role="NUMBER">0.40</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">71.23 <Math mode="inline" tex="\pm 0.51" text="plus-or-minus 0.51" xml:id="S7.T2.m6">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.51" role="NUMBER">0.51</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
            </tr>
            <tr>
              <td align="left" border="l r t" thead="row"><text font="smallcaps">Detector + Heuristic</text></td>
              <td align="center" border="r t">90.06 <Math mode="inline" tex="\pm 0.14" text="plus-or-minus 0.14" xml:id="S7.T2.m7">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.14" role="NUMBER">0.14</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">47.03 <Math mode="inline" tex="\pm 0.73" text="plus-or-minus 0.73" xml:id="S7.T2.m8">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.73" role="NUMBER">0.73</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">55.36 <Math mode="inline" tex="\pm 0.53" text="plus-or-minus 0.53" xml:id="S7.T2.m9">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.53" role="NUMBER">0.53</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">55.87 <Math mode="inline" tex="\pm 0.59" text="plus-or-minus 0.59" xml:id="S7.T2.m10">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.59" role="NUMBER">0.59</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">76.91 <Math mode="inline" tex="\pm 0.43" text="plus-or-minus 0.43" xml:id="S7.T2.m11">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.43" role="NUMBER">0.43</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
            </tr>
            <tr>
              <td align="left" border="l r t" thead="row"><text font="smallcaps">Baseline LSTM</text></td>
              <td align="center" border="r t">81.60 <Math mode="inline" tex="\pm 0.19" text="plus-or-minus 0.19" xml:id="S7.T2.m12">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.19" role="NUMBER">0.19</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">59.80 <Math mode="inline" tex="\pm 0.61" text="plus-or-minus 0.61" xml:id="S7.T2.m13">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.61" role="NUMBER">0.61</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">49.18 <Math mode="inline" tex="\pm 0.64" text="plus-or-minus 0.64" xml:id="S7.T2.m14">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.64" role="NUMBER">0.64</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">21.53 <Math mode="inline" tex="\pm 0.40" text="plus-or-minus 0.40" xml:id="S7.T2.m15">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.40" role="NUMBER">0.40</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">67.20 <Math mode="inline" tex="\pm 0.53" text="plus-or-minus 0.53" xml:id="S7.T2.m16">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.53" role="NUMBER">0.53</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
            </tr>
            <tr>
              <td align="left" border="l r t" thead="row"><text font="smallcaps">Non-Linear + LSTM</text></td>
              <td align="center" border="r t">88.25 <Math mode="inline" tex="\pm 0.14" text="plus-or-minus 0.14" xml:id="S7.T2.m17">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.14" role="NUMBER">0.14</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">70.14 <Math mode="inline" tex="\pm 0.62" text="plus-or-minus 0.62" xml:id="S7.T2.m18">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.62" role="NUMBER">0.62</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">55.66 <Math mode="inline" tex="\pm 0.67" text="plus-or-minus 0.67" xml:id="S7.T2.m19">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.67" role="NUMBER">0.67</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">24.58 <Math mode="inline" tex="\pm 0.44" text="plus-or-minus 0.44" xml:id="S7.T2.m20">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.44" role="NUMBER">0.44</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">73.53 <Math mode="inline" tex="\pm 0.51" text="plus-or-minus 0.51" xml:id="S7.T2.m21">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.51" role="NUMBER">0.51</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
            </tr>
            <tr>
              <td align="left" border="l r t" thead="row"><text font="smallcaps">Transformer + LSTM</text></td>
              <td align="center" border="r t"><text font="bold">90.82</text> <Math mode="inline" tex="\pm 0.14" text="plus-or-minus 0.14" xml:id="S7.T2.m22">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.14" role="NUMBER">0.14</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t"><text font="bold">80.40</text> <Math mode="inline" tex="\pm 0.61" text="plus-or-minus 0.61" xml:id="S7.T2.m23">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.61" role="NUMBER">0.61</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">70.71 <Math mode="inline" tex="\pm 0.78" text="plus-or-minus 0.78" xml:id="S7.T2.m24">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.78" role="NUMBER">0.78</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">28.25 <Math mode="inline" tex="\pm 0.45" text="plus-or-minus 0.45" xml:id="S7.T2.m25">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.45" role="NUMBER">0.45</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">80.27 <Math mode="inline" tex="\pm 0.50" text="plus-or-minus 0.50" xml:id="S7.T2.m26">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.50" role="NUMBER">0.50</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
            </tr>
            <tr>
              <td align="left" border="l r t" thead="row"><text font="smallcaps">OPNet (LSTM + MLP)</text></td>
              <td align="center" border="r t">88.11 <Math mode="inline" tex="\pm 0.16" text="plus-or-minus 0.16" xml:id="S7.T2.m27">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.16" role="NUMBER">0.16</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">55.32 <Math mode="inline" tex="\pm 0.85" text="plus-or-minus 0.85" xml:id="S7.T2.m28">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.85" role="NUMBER">0.85</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">65.18 <Math mode="inline" tex="\pm 0.89" text="plus-or-minus 0.89" xml:id="S7.T2.m29">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.89" role="NUMBER">0.89</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t"><text font="bold">57.59</text> <Math mode="inline" tex="\pm 0.85" text="plus-or-minus 0.85" xml:id="S7.T2.m30">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.85" role="NUMBER">0.85</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="r t">78.85 <Math mode="inline" tex="\pm 0.52" text="plus-or-minus 0.52" xml:id="S7.T2.m31">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.52" role="NUMBER">0.52</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
            </tr>
            <tr>
              <td align="left" border="b l r t" thead="row"><text font="smallcaps">OPNet (LSTM + LSTM)</text></td>
              <td align="center" border="b r t">88.89 <Math mode="inline" tex="\pm 0.16" text="plus-or-minus 0.16" xml:id="S7.T2.m32">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.16" role="NUMBER">0.16</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="b r t">78.83 <Math mode="inline" tex="\pm 0.56" text="plus-or-minus 0.56" xml:id="S7.T2.m33">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.56" role="NUMBER">0.56</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="b r t"><text font="bold">76.79</text> <Math mode="inline" tex="\pm 0.62" text="plus-or-minus 0.62" xml:id="S7.T2.m34">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.62" role="NUMBER">0.62</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="b r t">56.04 <Math mode="inline" tex="\pm 0.77" text="plus-or-minus 0.77" xml:id="S7.T2.m35">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.77" role="NUMBER">0.77</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
              <td align="center" border="b r t"><text font="bold">81.94</text> <Math mode="inline" tex="\pm 0.41" text="plus-or-minus 0.41" xml:id="S7.T2.m36">
                  <XMath>
                    <XMApp>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                      <XMTok meaning="0.41" role="NUMBER">0.41</XMTok>
                    </XMApp>
                  </XMath>
                </Math></td>
            </tr>
          </tbody>
        </tabular>
      </block>
      <toccaption><tag close=" ">2</tag>Mean IoU performance of various models on the LA-CATER test data. “<Math mode="inline" tex="\pm" text="plus-or-minus" xml:id="S7.T2.m37">
          <XMath>
            <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
          </XMath>
        </Math>” denotes the standard error of the mean (SEM). OPNet performs consistently well across all subtasks. Also, on the contained and carried frames OPNet is significantly better than the other methods.
</toccaption>
      <caption><tag close=": ">Table 2</tag>Mean IoU performance of various models on the LA-CATER test data. “<Math mode="inline" tex="\pm" text="plus-or-minus" xml:id="S7.T2.m38">
          <XMath>
            <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
          </XMath>
        </Math>” denotes the standard error of the mean (SEM). OPNet performs consistently well across all subtasks. Also, on the contained and carried frames OPNet is significantly better than the other methods.
</caption>
    </table>
    <para xml:id="S7.p2">
      <p>We first compare OPNet and the baselines presented in Section <ref labelref="LABEL:baseline_models"/>. Table <ref labelref="LABEL:table:od_results"/> shows IoU for all models in all four sub-tasks and Figure <ref labelref="LABEL:fig:map_od"/> presents the MAP accuracy of the models across different IoU thresholds.</p>
    </para>
    <para xml:id="S7.p3">
      <p>It can be seen in Table <ref labelref="LABEL:table:od_results"/> that OPNet performs consistently well across all subtasks and outperforms all other models overall. On the visible and occluded frames performance is similar to other baselines. But on the contained and carried frames, OPNet is significantly better than the other methods. This is likely due to OPNet’s explicit modeling of the object to be tracked.</p>
    </para>
    <para xml:id="S7.p4">
      <p>Table <ref labelref="LABEL:table:od_results"/> also reports results for two variants of OPNet: OPNet (LSTM+MLP) and OPNet (LSTM+LSTM). The former is missing the second module (“Where is it” in Figure <ref labelref="LABEL:fig:map_od"/>) which is meant to handle occlusion and indeed under-performs for occlusion frames (the “occluded” and “contained” subtasks).
This highlights the importance of using the two LSTM modules in Figure <ref labelref="LABEL:fig:map_od"/>.</p>
    </para>
    <para xml:id="S7.p5">
      <p>Figure <ref labelref="LABEL:fig:map_od"/> provides interesting insight into the behavior of the programmed models (namely Detector + Tracker and Detector + Heuristic). It can be seen that these models perform well when the IoU threshold is low. This reflects the fact that they have a good coarse estimate of where the target is, but fail to provide more accurate localization. On the other hand our OPNet model does well for accurate localization, presumably due to its learned “Where is it” module.</p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:map_od" placement="!h" xml:id="S7.F3">
      <tags>
        <tag>Figure 3</tag>
        <tag role="autoref">Figure 3</tag>
        <tag role="refnum">3</tag>
        <tag role="typerefnum">Figure 3</tag>
      </tags>
      <graphics candidates="Figures/map_metric_od_perception.png" class="ltx_centering" graphic="Figures/map_metric_od_perception.png" options="width=411.939pt, height=0.0pt, keepaspectratio=true" xml:id="S7.F3.g1"/>
<!--  %“vspace–-10pt˝ -->      <toccaption class="ltx_centering"><tag close=" ">3</tag>Mean average precision (MAP) as a function of IoU thresholds. The two programmed models, Detector+Tracker (blue) and Detector+Heuristic (orange) perform well when the IoU threshold is low, providing a good coarse estimate of target location. OPNet performs well on all subtasks.</toccaption>
      <caption class="ltx_centering"><tag close=": ">Figure 3</tag>Mean average precision (MAP) as a function of IoU thresholds. The two programmed models, Detector+Tracker (blue) and Detector+Heuristic (orange) perform well when the IoU threshold is low, providing a good coarse estimate of target location. OPNet performs well on all subtasks.</caption>
    </figure>
    <subsection inlist="toc" labels="LABEL:sec:perfect_abl" xml:id="S7.SS1">
      <tags>
        <tag>7.1</tag>
        <tag role="autoref">subsection 7.1</tag>
        <tag role="refnum">7.1</tag>
        <tag role="typerefnum">§7.1</tag>
      </tags>
      <title><tag close=" ">7.1</tag>Reasoning with Perfect Perception</title>
      <para xml:id="S7.SS1.p1">
        <p>The OPNet model contains an initial “Perception” module that analyzes the frame pixels to get bounding boxes. Errors in this component will naturally propagate to the rest of the model and adversely affect results. Here we analyze the effect of the perception module by replacing it with ground truth bounding boxes and visibility bits. See Appendix <ref labelref="LABEL:sec:pp_annotation"/> for details on extracting ground-truth annotations. In this setup all errors reflect failure in the reasoning components of the models.</p>
      </para>
      <table inlist="lot" labels="LABEL:table:perfect_per_results" placement="!h" xml:id="S7.T3">
        <tags>
          <tag>Table 3</tag>
          <tag role="autoref">Table 3</tag>
          <tag role="refnum">3</tag>
          <tag role="typerefnum">Table 3</tag>
        </tags>
<!--  %“setlength–“tabcolsep˝–2.8pt˝ 
     %**** arxiv.tex Line 350 ****-->        <block align="center" depth="0.0pt" width="433.6pt">
          <tabular class="ltx_guessed_headers" vattach="middle">
            <thead>
              <tr>
                <td align="left" border="l r t" thead="column row">Mean IoU <Math mode="inline" tex="\pm" text="plus-or-minus" xml:id="S7.T3.m1">
                    <XMath>
                      <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                    </XMath>
                  </Math> SEM</td>
                <td align="center" border="r t" thead="column">Visible</td>
                <td align="center" border="r t" thead="column">Occluded</td>
                <td align="center" border="r t" thead="column">Contained</td>
                <td align="center" border="r t" thead="column">Carried</td>
                <td align="center" border="r t" thead="column">Overall</td>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td align="left" border="l r t" thead="row"><text font="smallcaps">DETECTOR + TRACKER</text></td>
                <td align="center" border="r t">90.27 <Math mode="inline" tex="\pm 0.13" text="plus-or-minus 0.13" xml:id="S7.T3.m2">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.13" role="NUMBER">0.13</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">53.62 <Math mode="inline" tex="\pm 0.58" text="plus-or-minus 0.58" xml:id="S7.T3.m3">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.58" role="NUMBER">0.58</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">39.98 <Math mode="inline" tex="\pm 0.38" text="plus-or-minus 0.38" xml:id="S7.T3.m4">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.38" role="NUMBER">0.38</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">34.45 <Math mode="inline" tex="\pm 0.40" text="plus-or-minus 0.40" xml:id="S7.T3.m5">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.40" role="NUMBER">0.40</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">71.23 <Math mode="inline" tex="\pm 0.51" text="plus-or-minus 0.51" xml:id="S7.T3.m6">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.51" role="NUMBER">0.51</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
              </tr>
              <tr>
                <td align="left" border="l r t" thead="row"><text font="smallcaps">DETECTOR + HEURISTIC</text></td>
                <td align="center" border="r t"><text font="bold">95.59</text> <Math mode="inline" tex="\pm 0.34" text="plus-or-minus 0.34" xml:id="S7.T3.m7">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.34" role="NUMBER">0.34</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">30.40 <Math mode="inline" tex="\pm 0.81" text="plus-or-minus 0.81" xml:id="S7.T3.m8">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.81" role="NUMBER">0.81</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">59.81 <Math mode="inline" tex="\pm 0.47" text="plus-or-minus 0.47" xml:id="S7.T3.m9">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.47" role="NUMBER">0.47</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">59.33 <Math mode="inline" tex="\pm 0.50" text="plus-or-minus 0.50" xml:id="S7.T3.m10">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.50" role="NUMBER">0.50</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">81.24 <Math mode="inline" tex="\pm 0.49" text="plus-or-minus 0.49" xml:id="S7.T3.m11">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.49" role="NUMBER">0.49</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
              </tr>
              <tr>
                <td align="left" border="l r t" thead="row"><text font="smallcaps">BASELINE LSTM</text></td>
                <td align="center" border="r t">75.22 <Math mode="inline" tex="\pm 0.31" text="plus-or-minus 0.31" xml:id="S7.T3.m12">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.31" role="NUMBER">0.31</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">50.52 <Math mode="inline" tex="\pm 0.75" text="plus-or-minus 0.75" xml:id="S7.T3.m13">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.75" role="NUMBER">0.75</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">45.10 <Math mode="inline" tex="\pm 0.62" text="plus-or-minus 0.62" xml:id="S7.T3.m14">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.62" role="NUMBER">0.62</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">19.12 <Math mode="inline" tex="\pm 0.36" text="plus-or-minus 0.36" xml:id="S7.T3.m15">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.36" role="NUMBER">0.36</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">61.41 <Math mode="inline" tex="\pm 0.53" text="plus-or-minus 0.53" xml:id="S7.T3.m16">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.53" role="NUMBER">0.53</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
              </tr>
              <tr>
                <td align="left" border="l r t" thead="row"><text font="smallcaps">NON-LINEAR + LSTM</text></td>
                <td align="center" border="r t">88.63 <Math mode="inline" tex="\pm 0.25" text="plus-or-minus 0.25" xml:id="S7.T3.m17">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.25" role="NUMBER">0.25</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">65.73 <Math mode="inline" tex="\pm 0.82" text="plus-or-minus 0.82" xml:id="S7.T3.m18">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.82" role="NUMBER">0.82</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">58.77 <Math mode="inline" tex="\pm 0.70" text="plus-or-minus 0.70" xml:id="S7.T3.m19">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.70" role="NUMBER">0.70</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">23.89 <Math mode="inline" tex="\pm 0.41" text="plus-or-minus 0.41" xml:id="S7.T3.m20">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.41" role="NUMBER">0.41</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">74.53 <Math mode="inline" tex="\pm 0.54" text="plus-or-minus 0.54" xml:id="S7.T3.m21">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.54" role="NUMBER">0.54</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
              </tr>
              <tr>
                <td align="left" border="l r t" thead="row"><text font="smallcaps">TRANSFORMER + LSTM</text></td>
                <td align="center" border="r t">93.99 <Math mode="inline" tex="\pm 0.24" text="plus-or-minus 0.24" xml:id="S7.T3.m22">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.24" role="NUMBER">0.24</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t"><text font="bold">81.31</text> <Math mode="inline" tex="\pm 0.88" text="plus-or-minus 0.88" xml:id="S7.T3.m23">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.88" role="NUMBER">0.88</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">75.75 <Math mode="inline" tex="\pm 0.85" text="plus-or-minus 0.85" xml:id="S7.T3.m24">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.85" role="NUMBER">0.85</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">28.01 <Math mode="inline" tex="\pm 0.44" text="plus-or-minus 0.44" xml:id="S7.T3.m25">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.44" role="NUMBER">0.44</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">83.78 <Math mode="inline" tex="\pm 0.55" text="plus-or-minus 0.55" xml:id="S7.T3.m26">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.55" role="NUMBER">0.55</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
              </tr>
              <tr>
                <td align="left" border="l r t" thead="row"><text font="smallcaps">OPNet (LSTM + MLP)</text></td>
                <td align="center" border="r t">88.11 <Math mode="inline" tex="\pm 0.16" text="plus-or-minus 0.16" xml:id="S7.T3.m27">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.16" role="NUMBER">0.16</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">19.39 <Math mode="inline" tex="\pm 0.60" text="plus-or-minus 0.60" xml:id="S7.T3.m28">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.60" role="NUMBER">0.60</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">77.40 <Math mode="inline" tex="\pm 0.68" text="plus-or-minus 0.68" xml:id="S7.T3.m29">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.68" role="NUMBER">0.68</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t"><text font="bold">78.25</text> <Math mode="inline" tex="\pm 0.65" text="plus-or-minus 0.65" xml:id="S7.T3.m30">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.65" role="NUMBER">0.65</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">83.84 <Math mode="inline" tex="\pm 0.48" text="plus-or-minus 0.48" xml:id="S7.T3.m31">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.48" role="NUMBER">0.48</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
              </tr>
              <tr>
                <td align="left" border="b l r t" thead="row"><text font="smallcaps">OPNet (LSTM + LSTM)</text></td>
                <td align="center" border="b r t">88.78 <Math mode="inline" tex="\pm 0.25" text="plus-or-minus 0.25" xml:id="S7.T3.m32">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.25" role="NUMBER">0.25</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="b r t">67.79 <Math mode="inline" tex="\pm 0.69" text="plus-or-minus 0.69" xml:id="S7.T3.m33">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.69" role="NUMBER">0.69</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="b r t"><text font="bold">83.47</text> <Math mode="inline" tex="\pm 0.47" text="plus-or-minus 0.47" xml:id="S7.T3.m34">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.47" role="NUMBER">0.47</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="b r t">76.42 <Math mode="inline" tex="\pm 0.66" text="plus-or-minus 0.66" xml:id="S7.T3.m35">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.66" role="NUMBER">0.66</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="b r t"><text font="bold">85.44</text> <Math mode="inline" tex="\pm 0.38" text="plus-or-minus 0.38" xml:id="S7.T3.m36">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.38" role="NUMBER">0.38</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
              </tr>
            </tbody>
          </tabular>
        </block>
        <toccaption><tag close=" ">3</tag>Mean IoU performance with the <text font="italic">Perfect Perception</text> setup. “<Math mode="inline" tex="\pm" text="plus-or-minus" xml:id="S7.T3.m37">
            <XMath>
              <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
            </XMath>
          </Math>” denotes the standard error of the mean (S.E.M.). Results are similar in nature to those with imperfect, detector-based, perception (Table <ref labelref="LABEL:table:od_results"/>). All models improve when using the ground-truth perception information. The subtask that improves the most with OPNet is the carried task.
</toccaption>
        <caption><tag close=": ">Table 3</tag>Mean IoU performance with the <text font="italic">Perfect Perception</text> setup. “<Math mode="inline" tex="\pm" text="plus-or-minus" xml:id="S7.T3.m38">
            <XMath>
              <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
            </XMath>
          </Math>” denotes the standard error of the mean (S.E.M.). Results are similar in nature to those with imperfect, detector-based, perception (Table <ref labelref="LABEL:table:od_results"/>). All models improve when using the ground-truth perception information. The subtask that improves the most with OPNet is the carried task.
</caption>
      </table>
      <para xml:id="S7.SS1.p2">
        <p>Table <ref labelref="LABEL:table:perfect_per_results"/> provides the IoU performance and Figure <ref labelref="LABEL:fig:map_pp"/> the MAP for all compared methods on all four subtasks. The results are similar to the previous results. When compared to the previous section (imperfect, detector-based, perception), the overall trend is the same, but all models improve when using the ground truth perception information. Interestingly, the subtask that improves the most from using ground truth boxes is the carried task. This makes sense, since it is the hardest subtask and the one that most relies on having the correct object locations per frame.</p>
      </para>
      <figure inlist="lof" labels="LABEL:fig:map_pp" placement="!h" xml:id="S7.F4">
        <tags>
          <tag>Figure 4</tag>
          <tag role="autoref">Figure 4</tag>
          <tag role="refnum">4</tag>
          <tag role="typerefnum">Figure 4</tag>
        </tags>
        <graphics candidates="Figures/map_metric_perfect_perception.png" class="ltx_centering" graphic="Figures/map_metric_perfect_perception.png" options="width=411.939pt, height=0.0pt, keepaspectratio=true" xml:id="S7.F4.g1"/>
<!--  %“vspace–-10pt˝ -->        <toccaption class="ltx_centering"><tag close=" ">4</tag>Mean average precision (MAP) as a function of IoU thresholds for reasoning with Perfect Perception (Section <ref labelref="LABEL:sec:perfect_abl"/>). The most notable performance gain of OPNet (pink and brown curves) was with carried targets (subtask d). </toccaption>
        <caption class="ltx_centering"><tag close=": ">Figure 4</tag>Mean average precision (MAP) as a function of IoU thresholds for reasoning with Perfect Perception (Section <ref labelref="LABEL:sec:perfect_abl"/>). The most notable performance gain of OPNet (pink and brown curves) was with carried targets (subtask d). </caption>
      </figure>
    </subsection>
    <subsection inlist="toc" labels="LABEL:sect:abl" xml:id="S7.SS2">
      <tags>
        <tag>7.2</tag>
        <tag role="autoref">subsection 7.2</tag>
        <tag role="refnum">7.2</tag>
        <tag role="typerefnum">§7.2</tag>
      </tags>
      <title><tag close=" ">7.2</tag>Learning only from Visible Frames</title>
      <para xml:id="S7.SS2.p1">
        <p>We now examine a learning setup in which localization supervision is available only for frames where the target object is visible. This setup corresponds more naturally to the process by which people learn object permanence. For instance, imagine a child learning to track a carried (non visible) object for the first time and receiving a surprising feedback only when the object reappears in the scene.</p>
      </para>
      <para xml:id="S7.SS2.p2">
        <p>In absence of any supervision when the target is non-visible, incorporating an extra auxiliary loss is needed to account for these frames. Towards this end, we incorporated an auxiliary <emph font="italic">consistency loss</emph> that minimizes the change between predictions in consecutive frames. <Math mode="inline" tex="\mathcal{L}_{consistency}=\frac{1}{n}\sum_{t=1}^{n}\left\lVert b_{t}-b_{t-1}%&#10;\right\rVert^{2}" text="L _ (c * o * n * s * i * s * t * e * n * c * y) = (1 / n) * ((sum _ (t = 1)) ^ n)@((norm@(b _ t - b _ (t - 1))) ^ 2)" xml:id="S7.SS2.p2.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">c</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">o</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">e</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">c</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">y</XMTok>
                  </XMApp>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMApp>
                    <XMTok mathstyle="text" meaning="divide" role="FRACOP"/>
                    <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok mathstyle="text" meaning="sum" role="SUMOP" scriptpos="post">∑</XMTok>
                        <XMApp>
                          <XMTok fontsize="70%" meaning="equals" role="RELOP">=</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                          <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                    </XMApp>
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMDual>
                        <XMApp>
                          <XMTok meaning="norm"/>
                          <XMRef idref="S7.SS2.p2.m1.1"/>
                        </XMApp>
                        <XMWrap>
                          <XMTok name="lVert" role="OPEN" stretchy="true">∥</XMTok>
                          <XMApp xml:id="S7.SS2.p2.m1.1">
                            <XMTok meaning="minus" role="ADDOP">-</XMTok>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                              <XMTok font="italic" role="UNKNOWN">b</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                            </XMApp>
                            <XMApp>
                              <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                              <XMTok font="italic" role="UNKNOWN">b</XMTok>
                              <XMApp>
                                <XMTok fontsize="70%" meaning="minus" role="ADDOP">-</XMTok>
                                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                              </XMApp>
                            </XMApp>
                          </XMApp>
                          <XMTok name="rVert" role="CLOSE" stretchy="true">∥</XMTok>
                        </XMWrap>
                      </XMDual>
                      <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                    </XMApp>
                  </XMApp>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>. The total loss is defined as an interpolation between the localization loss and the consistency loss, balancing their different scales: <Math mode="inline" tex="\mathcal{L}=\alpha\cdot\mathcal{L}_{localization}+\beta\cdot\mathcal{L}_{consistency}" text="L = alpha cdot L _ (l * o * c * a * l * i * z * a * t * i * o * n) + beta cdot L _ (c * o * n * s * i * s * t * e * n * c * y)" xml:id="S7.SS2.p2.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                <XMApp>
                  <XMTok meaning="plus" role="ADDOP">+</XMTok>
                  <XMApp>
                    <XMTok name="cdot" role="MULOP">⋅</XMTok>
                    <XMTok font="italic" name="alpha" role="UNKNOWN">α</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">l</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">o</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">c</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">l</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">z</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">o</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                  <XMApp>
                    <XMTok name="cdot" role="MULOP">⋅</XMTok>
                    <XMTok font="italic" name="beta" role="UNKNOWN">β</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">c</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">o</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">e</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">c</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">y</XMTok>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>
Details on choosing the values of <Math mode="inline" tex="\alpha" text="alpha" xml:id="S7.SS2.p2.m3">
            <XMath>
              <XMTok font="italic" name="alpha" role="UNKNOWN">α</XMTok>
            </XMath>
          </Math> and <Math mode="inline" tex="\beta" text="beta" xml:id="S7.SS2.p2.m4">
            <XMath>
              <XMTok font="italic" name="beta" role="UNKNOWN">β</XMTok>
            </XMath>
          </Math> are provided in the supplementary.</p>
      </para>
      <table inlist="lot" labels="LABEL:table:visible_only" placement="t" xml:id="S7.T4">
        <tags>
          <tag>Table 4</tag>
          <tag role="autoref">Table 4</tag>
          <tag role="refnum">4</tag>
          <tag role="typerefnum">Table 4</tag>
        </tags>
        <block align="center" depth="0.0pt" width="433.6pt">
          <tabular class="ltx_guessed_headers" vattach="middle">
            <thead>
              <tr>
                <td align="left" border="l r t" thead="column row"><tabular vattach="middle">
                    <tr>
                      <td align="center">Mean IoU</td>
                    </tr>
                  </tabular></td>
                <td align="center" border="r t" thead="column"><tabular vattach="middle">
                    <tr>
                      <td align="center">Visible</td>
                    </tr>
                  </tabular></td>
                <td align="center" border="r t" thead="column"><tabular vattach="middle">
                    <tr>
                      <td align="center">Occluded</td>
                    </tr>
                  </tabular></td>
                <td align="center" border="r t" thead="column"><tabular vattach="middle">
                    <tr>
                      <td align="center">Contained</td>
                    </tr>
                  </tabular></td>
                <td align="center" border="r t" thead="column"><tabular vattach="middle">
                    <tr>
                      <td align="center">Carried</td>
                    </tr>
                  </tabular></td>
                <td align="center" border="r t" thead="column">Overall</td>
              </tr>
            </thead>
            <tbody>
              <tr>
                <td align="left" border="l r t" thead="row"><text font="smallcaps">Baseline LSTM</text></td>
                <td align="center" border="r t">88.61 <Math mode="inline" tex="\pm 0.16" text="plus-or-minus 0.16" xml:id="S7.T4.m1">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.16" role="NUMBER">0.16</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">80.39 <Math mode="inline" tex="\pm 0.54" text="plus-or-minus 0.54" xml:id="S7.T4.m2">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.54" role="NUMBER">0.54</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">68.35 <Math mode="inline" tex="\pm 0.76" text="plus-or-minus 0.76" xml:id="S7.T4.m3">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.76" role="NUMBER">0.76</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">27.39 <Math mode="inline" tex="\pm 0.45" text="plus-or-minus 0.45" xml:id="S7.T4.m4">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.45" role="NUMBER">0.45</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">78.09 <Math mode="inline" tex="\pm 0.49" text="plus-or-minus 0.49" xml:id="S7.T4.m5">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.49" role="NUMBER">0.49</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
              </tr>
              <tr>
                <td align="left" border="l r t" thead="row"><text font="smallcaps">Non Linear + LSTM</text></td>
                <td align="center" border="r t">89.30 <Math mode="inline" tex="\pm 0.15" text="plus-or-minus 0.15" xml:id="S7.T4.m6">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.15" role="NUMBER">0.15</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">82.49 <Math mode="inline" tex="\pm 0.45" text="plus-or-minus 0.45" xml:id="S7.T4.m7">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.45" role="NUMBER">0.45</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">67.25 <Math mode="inline" tex="\pm 0.75" text="plus-or-minus 0.75" xml:id="S7.T4.m8">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.75" role="NUMBER">0.75</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">27.34 <Math mode="inline" tex="\pm 0.45" text="plus-or-minus 0.45" xml:id="S7.T4.m9">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.45" role="NUMBER">0.45</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">78.15 <Math mode="inline" tex="\pm 0.49" text="plus-or-minus 0.49" xml:id="S7.T4.m10">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.49" role="NUMBER">0.49</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
              </tr>
              <tr>
                <td align="left" border="l r t" thead="row"><text font="smallcaps">Transformer + LSTM</text></td>
                <td align="center" border="r t">88.33 <Math mode="inline" tex="\pm 0.15" text="plus-or-minus 0.15" xml:id="S7.T4.m11">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.15" role="NUMBER">0.15</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t"><text font="bold">83.74</text> <Math mode="inline" tex="\pm 0.44" text="plus-or-minus 0.44" xml:id="S7.T4.m12">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.44" role="NUMBER">0.44</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t"><text font="bold">69.93</text> <Math mode="inline" tex="\pm 0.77" text="plus-or-minus 0.77" xml:id="S7.T4.m13">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.77" role="NUMBER">0.77</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t"><text font="bold">27.65</text> <Math mode="inline" tex="\pm 0.54" text="plus-or-minus 0.54" xml:id="S7.T4.m14">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.54" role="NUMBER">0.54</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">78.43 <Math mode="inline" tex="\pm 0.49" text="plus-or-minus 0.49" xml:id="S7.T4.m15">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.49" role="NUMBER">0.49</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
              </tr>
              <tr>
                <td align="left" border="l r t" thead="row"><text font="smallcaps">OPNet (LSTM + MLP)</text></td>
                <td align="center" border="r t">88.45 <Math mode="inline" tex="\pm 0.17" text="plus-or-minus 0.17" xml:id="S7.T4.m16">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.17" role="NUMBER">0.17</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">48.03 <Math mode="inline" tex="\pm 0.82" text="plus-or-minus 0.82" xml:id="S7.T4.m17">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.82" role="NUMBER">0.82</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">10.95 <Math mode="inline" tex="\pm 0.51" text="plus-or-minus 0.51" xml:id="S7.T4.m18">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.51" role="NUMBER">0.51</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">7.28 <Math mode="inline" tex="\pm 0.30" text="plus-or-minus 0.30" xml:id="S7.T4.m19">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.30" role="NUMBER">0.30</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="r t">61.18 <Math mode="inline" tex="\pm 0.69" text="plus-or-minus 0.69" xml:id="S7.T4.m20">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.69" role="NUMBER">0.69</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
              </tr>
              <tr>
                <td align="left" border="b l r t" thead="row"><text font="smallcaps">OPNet (LSTM + LSTM)</text></td>
                <td align="center" border="b r t"><text font="bold">88.95</text> <Math mode="inline" tex="\pm 0.16" text="plus-or-minus 0.16" xml:id="S7.T4.m21">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.16" role="NUMBER">0.16</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="b r t">81.84 <Math mode="inline" tex="\pm 0.48" text="plus-or-minus 0.48" xml:id="S7.T4.m22">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.48" role="NUMBER">0.48</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="b r t">69.01 <Math mode="inline" tex="\pm 0.76" text="plus-or-minus 0.76" xml:id="S7.T4.m23">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.76" role="NUMBER">0.76</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="b r t">27.50 <Math mode="inline" tex="\pm 0.45" text="plus-or-minus 0.45" xml:id="S7.T4.m24">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.45" role="NUMBER">0.45</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
                <td align="center" border="b r t"><text font="bold">78.50</text> <Math mode="inline" tex="\pm 0.49" text="plus-or-minus 0.49" xml:id="S7.T4.m25">
                    <XMath>
                      <XMApp>
                        <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
                        <XMTok meaning="0.49" role="NUMBER">0.49</XMTok>
                      </XMApp>
                    </XMath>
                  </Math></td>
              </tr>
            </tbody>
          </tabular>
        </block>
        <toccaption><tag close=" ">4</tag>IoU performance for the <text font="italic">only visible supervision</text> setting. “<Math mode="inline" tex="\pm" text="plus-or-minus" xml:id="S7.T4.m26">
            <XMath>
              <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
            </XMath>
          </Math>” denote the standard error of the mean (S.E.M.). The models perform well when the target is visible, fully occluded or contained without movement, but not when the target is carried.
</toccaption>
        <caption><tag close=": ">Table 4</tag>IoU performance for the <text font="italic">only visible supervision</text> setting. “<Math mode="inline" tex="\pm" text="plus-or-minus" xml:id="S7.T4.m27">
            <XMath>
              <XMTok meaning="plus-or-minus" name="pm" role="ADDOP">±</XMTok>
            </XMath>
          </Math>” denote the standard error of the mean (S.E.M.). The models perform well when the target is visible, fully occluded or contained without movement, but not when the target is carried.
</caption>
      </table>
      <para xml:id="S7.SS2.p3">
        <p>Table <ref labelref="LABEL:table:visible_only"/> shows the mean IoU for this setup (compare with Table <ref labelref="LABEL:table:od_results"/>). The baselines perform well when the target is visible, fully occluded or contained without movement.
This phenomenon goes hand-in-hand with the inductive bias of the <emph font="italic">consistency loss</emph>. Usually, to solve these subtasks, a model only needs to predict the last known target location. This explains why the OPNet (LSTM+MLP) model performs so poorly in this setup.</p>
      </para>
      <para xml:id="S7.SS2.p4">
        <p>We note that the performance of non-OPNet models on the carried task is similar to that obtained using full supervision (see Table <ref labelref="LABEL:table:od_results"/>, Section <ref labelref="LABEL:sec:Results"/>) . This suggests that these models fail to use the supervision for the “carried” task, and further reinforces the observation that localizing carried object is highly challenging.</p>
      </para>
    </subsection>
    <subsection inlist="toc" labels="LABEL:sec:cater_res" xml:id="S7.SS3">
      <tags>
        <tag>7.3</tag>
        <tag role="autoref">subsection 7.3</tag>
        <tag role="refnum">7.3</tag>
        <tag role="typerefnum">§7.3</tag>
      </tags>
      <title><tag close=" ">7.3</tag>Comparison with CATER Data</title>
      <para xml:id="S7.SS3.p1">
        <p>The original CATER paper <cite class="ltx_citemacro_cite">[<bibref bibrefs="girdhar2019cater" separator="," yyseparator=","/>]</cite> considered the “snitch localization” task, aiming to localize the snitch at the last frame of the video, and formalized as a classification problem. The x-y plane was divided with a 6-by-6 grid, and the goal was to predict the correct cell of that grid.</p>
      </para>
      <para xml:id="S7.SS3.p2">
        <p>Here we report the performance of OPNet and relevant baselines evaluated on the exact setup as in <cite class="ltx_citemacro_cite">[<bibref bibrefs="girdhar2019cater" separator="," yyseparator=","/>]</cite>, to facilitate comparison between our models and the results reported there. Table <ref labelref="LABEL:table:cater_results"/>
shows the accuracy and <Math mode="inline" tex="L_{1}" text="L _ 1" xml:id="S7.SS3.p2.m1">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">L</XMTok>
                <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
              </XMApp>
            </XMath>
          </Math> distance metrics for this evaluation. OPNet significantly improves over all baselines from <cite class="ltx_citemacro_cite">[<bibref bibrefs="girdhar2019cater" separator="," yyseparator=","/>]</cite>. It cuts down the classification error from <Math mode="inline" tex="40\%" text="40percent" xml:id="S7.SS3.p2.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="percent" role="POSTFIX">%</XMTok>
                <XMTok meaning="40" role="NUMBER">40</XMTok>
              </XMApp>
            </XMath>
          </Math> error down to <Math mode="inline" tex="24\%" text="24percent" xml:id="S7.SS3.p2.m3">
            <XMath>
              <XMApp>
                <XMTok meaning="percent" role="POSTFIX">%</XMTok>
                <XMTok meaning="24" role="NUMBER">24</XMTok>
              </XMApp>
            </XMath>
          </Math>, and the <Math mode="inline" tex="l_{1}" text="l _ 1" xml:id="S7.SS3.p2.m4">
            <XMath>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">l</XMTok>
                <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
              </XMApp>
            </XMath>
          </Math> distance from <Math mode="inline" tex="1.2" text="1.2" xml:id="S7.SS3.p2.m5">
            <XMath>
              <XMTok meaning="1.2" role="NUMBER">1.2</XMTok>
            </XMath>
          </Math> to <Math mode="inline" tex="0.54" text="0.54" xml:id="S7.SS3.p2.m6">
            <XMath>
              <XMTok meaning="0.54" role="NUMBER">0.54</XMTok>
            </XMath>
          </Math>.</p>
      </para>
      <table inlist="lot" labels="LABEL:table:cater_results" placement="h" xml:id="S7.T5">
        <tags>
          <tag>Table 5</tag>
          <tag role="autoref">Table 5</tag>
          <tag role="refnum">5</tag>
          <tag role="typerefnum">Table 5</tag>
        </tags>
        <tabular class="ltx_centering ltx_guessed_headers" vattach="middle">
          <tbody>
            <tr>
              <td border="l r t" thead="row"/>
              <td align="center" border="r t">Accuracy</td>
              <td align="center" border="r t"><Math mode="inline" tex="L_{1}" text="L _ 1" xml:id="S7.T5.m1">
                  <XMath>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMTok font="italic" role="UNKNOWN">L</XMTok>
                      <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                    </XMApp>
                  </XMath>
                </Math> Distance</td>
            </tr>
            <tr>
              <td align="left" border="l r" thead="row">Model</td>
              <td align="center" border="r"><text fontsize="90%">(higher is better)</text></td>
              <td align="center" border="r"><text fontsize="90%">(lower is better)</text></td>
            </tr>
            <tr>
              <td align="left" border="l r t" thead="row"><text font="smallcaps">DaSiamRPN</text></td>
              <td align="center" border="r t">33.9</td>
              <td align="center" border="r t">2.4</td>
            </tr>
            <tr>
              <td align="left" border="l r" thead="row"><text font="smallcaps">TSN-RGB + LSTM</text></td>
              <td align="center" border="r">25.6</td>
              <td align="center" border="r">2.6</td>
            </tr>
            <tr>
              <td align="left" border="l r" thead="row"><text font="smallcaps">R3D + LSTM</text></td>
              <td align="center" border="r">60.2</td>
              <td align="center" border="r">1.2</td>
            </tr>
            <tr>
              <td align="left" border="b l r t" thead="row"><text font="smallcaps">OPNet (Ours)</text></td>
              <td align="center" border="b r t"><text font="bold">74.8</text></td>
              <td align="center" border="b r t"><text font="bold">0.54</text></td>
            </tr>
          </tbody>
        </tabular>
<!--  %**** arxiv.tex Line 425 **** -->        <toccaption><tag close=" ">5</tag>Classification accuracy on the CATER dataset using the metrics of <cite class="ltx_citemacro_cite">[<bibref bibrefs="girdhar2019cater" separator="," yyseparator=","/>]</cite>. OPNet significantly improves over all baselines for the “snitch localization task”.
</toccaption>
        <caption><tag close=": ">Table 5</tag>Classification accuracy on the CATER dataset using the metrics of <cite class="ltx_citemacro_cite">[<bibref bibrefs="girdhar2019cater" separator="," yyseparator=","/>]</cite>. OPNet significantly improves over all baselines for the “snitch localization task”.
</caption>
      </table>
      <figure inlist="lof" labels="LABEL:fig:all_models_comparison" placement="h!" xml:id="S7.F5">
        <tags>
          <tag>Figure 5</tag>
          <tag role="autoref">Figure 5</tag>
          <tag role="refnum">5</tag>
          <tag role="typerefnum">Figure 5</tag>
        </tags>
        <p align="center">(a)                                                
(b)                                                 
<graphics candidates="Figures/model_comparison_1.png" graphic="Figures/model_comparison_1.png" options="width=214.6419pt" xml:id="S7.F5.g1"/>
<graphics candidates="Figures/model_comparison_2.png" graphic="Figures/model_comparison_2.png" options="width=214.6419pt" xml:id="S7.F5.g2"/></p>
        <toccaption class="ltx_centering"><tag close=" ">5</tag>Screenshots from the model comparison video files. Blue boxes denote the ground truth location. Yellow boxes denote the predicted location. OPNet (ours) is at the bottom right panel. <text font="bold">(a)</text> The target is contained and then <text font="italic">carried</text> by the blue cone and is captured successfully by OPNet. <text font="bold">(b)</text> The target is occluded by the red cone and purple ball. These occlusions confuse all baselines, while OPNet localizes the target accurately.
</toccaption>
        <caption class="ltx_centering"><tag close=": ">Figure 5</tag>Screenshots from the model comparison video files. Blue boxes denote the ground truth location. Yellow boxes denote the predicted location. OPNet (ours) is at the bottom right panel. <text font="bold">(a)</text> The target is contained and then <text font="italic">carried</text> by the blue cone and is captured successfully by OPNet. <text font="bold">(b)</text> The target is occluded by the red cone and purple ball. These occlusions confuse all baselines, while OPNet localizes the target accurately.
</caption>
      </figure>
      <figure inlist="lof" labels="LABEL:fig:win_loss" placement="t!" xml:id="S7.F6">
        <tags>
          <tag>Figure 6</tag>
          <tag role="autoref">Figure 6</tag>
          <tag role="refnum">6</tag>
          <tag role="typerefnum">Figure 6</tag>
        </tags>
        <p>(a)                  (b)                  (c)                 (d)                 (e)</p>
        <graphics candidates="Figures/win_example_grid.png" class="ltx_centering" graphic="Figures/win_example_grid.png" options="width=433.62pt, height=0.0pt, keepaspectratio=true" xml:id="S7.F6.g1"/>
        <graphics candidates="Figures/loss_example_grid.png" class="ltx_centering" graphic="Figures/loss_example_grid.png" options="width=433.62pt, height=0.0pt, keepaspectratio=true" xml:id="S7.F6.g2"/>
        <toccaption class="ltx_centering"><tag close=" ">6</tag>Examples of a success case (top row) and a failure case (bottom row) for localizing a carried object. The blue box marks the ground-truth location. The yellow box marks the predicted location. <text font="italic">Top</text> (a) The target object is visible; (b-c) The target becomes covered and carried by the orange cone; (d-e) The big golden cone covers and carries the orange cone, illustrating recursive containment. The target object is not visible, but OPNet successfully tracks it. <text font="italic">Bottom</text> (c-d) OPNet accidentally switches to the wrong cone object (the yellow cone instead of the brown cone);
(e) OPNet correctly finds when the yellow cone is picked up and switches to track the blue ball that was underneath.
</toccaption>
        <caption class="ltx_centering"><tag close=": ">Figure 6</tag>Examples of a success case (top row) and a failure case (bottom row) for localizing a carried object. The blue box marks the ground-truth location. The yellow box marks the predicted location. <text font="italic">Top</text> (a) The target object is visible; (b-c) The target becomes covered and carried by the orange cone; (d-e) The big golden cone covers and carries the orange cone, illustrating recursive containment. The target object is not visible, but OPNet successfully tracks it. <text font="italic">Bottom</text> (c-d) OPNet accidentally switches to the wrong cone object (the yellow cone instead of the brown cone);
(e) OPNet correctly finds when the yellow cone is picked up and switches to track the blue ball that was underneath.
</caption>
<!--  %**** arxiv.tex Line 450 **** -->      </figure>
    </subsection>
    <subsection inlist="toc" xml:id="S7.SS4">
      <tags>
        <tag>7.4</tag>
        <tag role="autoref">subsection 7.4</tag>
        <tag role="refnum">7.4</tag>
        <tag role="typerefnum">§7.4</tag>
      </tags>
      <title><tag close=" ">7.4</tag>Qualitative Examples</title>
      <para xml:id="S7.SS4.p1">
        <p>To gain insight into the successes and failures of our model, we now analyze specific examples. We provide two sets of examples to illustrate: (1) Comparison between baselines and variants over the same set of videos; (2) Wins and losses of our approach.</p>
      </para>
<!--  %Errors of our OPNet model typically correspond to cases where there 
     %are multiple objects that are candidates for covering the target and OPNet chooses the wrong one.-->      <para xml:id="S7.SS4.p2">
        <p><text font="bold">Model Comparison</text>. We show two videos comparing OPNet with baselines and other variants. In both videos, four competing methods are applied to the same video scene. We recommend playing videos at a slow speed.</p>
      </para>
      <figure inlist="lof" labels="LABEL:fig:heat_map" placement="h!" xml:id="S7.F7">
        <tags>
          <tag>Figure 7</tag>
          <tag role="autoref">Figure 7</tag>
          <tag role="refnum">7</tag>
          <tag role="typerefnum">Figure 7</tag>
        </tags>
        <p>(a)                                               (b)</p>
        <graphics candidates="Figures/heatmap_win.png" class="ltx_centering" graphic="Figures/heatmap_win.png" options="width=212.4738pt" xml:id="S7.F7.g1"/>
        <graphics candidates="Figures/heatmap_loss.png" class="ltx_centering" graphic="Figures/heatmap_loss.png" options="width=212.4738pt" xml:id="S7.F7.g2"/>
        <toccaption class="ltx_centering"><tag close=" ">7</tag>Switching attention across objects. In each pair of panels, each row traces the probability assigned to an object along the video in the ground truth (left) and predicted attention (right). (a) The system successfully switches attention from object 1 (target) when it is contained by object 6 and then carried by object 3. (b) After a successful switch from the object 1 to 10, the system incorrectly witches to object 3.
</toccaption>
        <caption class="ltx_centering"><tag close=": ">Figure 7</tag>Switching attention across objects. In each pair of panels, each row traces the probability assigned to an object along the video in the ground truth (left) and predicted attention (right). (a) The system successfully switches attention from object 1 (target) when it is contained by object 6 and then carried by object 3. (b) After a successful switch from the object 1 to 10, the system incorrectly witches to object 3.
</caption>
      </figure>
      <para xml:id="S7.SS4.p3">
        <itemize xml:id="S7.I1">
          <item xml:id="S7.I1.i1">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">1st item</tag>
            </tags>
            <para xml:id="S7.I1.i1.p1">
              <p>The first model comparison video (<ref class="ltx_href" font="italic" href="https://youtu.be/TZgoxoKcGrE">https://youtu.be/TZgoxoKcGrE</ref>) shows one visual scene analyzed by four methods.
OPNet (ours) successfully localizes the target throughout the video. When the target is “carried”, the <text font="italic">Transformer</text> model (bottom left) fails to switch and instead of tracking the carrying object it keeps predicting the last seen location of the target. The <text font="italic">Tracker</text> model (top left) switches to a wrong object. The <text font="italic">Heuristic</text> model (top right) successfully tracks the object containing the target, adjusting well to the target size. See Figure <ref labelref="LABEL:fig:all_models_comparison"/>(a).
<!--  %**** arxiv.tex Line 475 **** --></p>
            </para>
          </item>
          <item xml:id="S7.I1.i2">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">2nd item</tag>
            </tags>
            <para xml:id="S7.I1.i2.p1">
              <p>The second model comparison video (<ref class="ltx_href" font="italic" href="https://youtu.be/KoxbhgalazU">https://youtu.be/KoxbhgalazU</ref>) shows a visual scene analyzed by four methods. In this video, the target is being occluded by multiple objects, including full occlusion, which makes it challenging to track.
The <text font="italic">Tracker</text>, <text font="italic">Heuristic</text> and <text font="italic">OPNet MLP</text> models occasionally drift from the target when it is fully occluded by a large object.
OPNet (ours) successfully localizes the target throughout the video. See Figure <ref labelref="LABEL:fig:all_models_comparison"/>(b).</p>
            </para>
          </item>
        </itemize>
      </para>
      <para xml:id="S7.SS4.p4">
        <p><text font="bold">Wins and Losses of OPNet</text>.
We provide interesting examples of OPNet success and failures, adding insights into the behaviour and limitations of the OPNet model.</p>
      </para>
      <para xml:id="S7.SS4.p5">
        <itemize xml:id="S7.I2">
          <item xml:id="S7.I2.i1">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">1st item</tag>
            </tags>
            <para xml:id="S7.I2.i1.p1">
              <p>The video <ref class="ltx_href" font="italic" href="https://youtu.be/FnturB2Blw8">https://youtu.be/FnturB2Blw8</ref> provides a “win” example. It demonstrates the power of OPNet and its <text font="italic">“who to track”</text> reasoning component. In the video, the model handles phases of recursive containment (“babushka”), which involve “carrying”. It suggests that OPNet learns an implicit representation of the object actions (pick-up, slide, contain etc.) even though it was not explicitly trained to do so. See Figure <ref labelref="LABEL:fig:win_loss"/> (top row)</p>
            </para>
          </item>
          <item xml:id="S7.I2.i2">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">2nd item</tag>
            </tags>
            <para xml:id="S7.I2.i2.p1">
              <p>The video <ref class="ltx_href" font="italic" href="https://youtu.be/qkdQSHLrGqI">https://youtu.be/qkdQSHLrGqI</ref> illustrates a failure of our model. It shows an example where OPNet fails to switch between tracked objects when the target is “carried”. The model accidentally switches to a wrong cone object (the yellow cone) that already contains another object, not the target. Interestingly, OPNet properly identifies when the yellow cone is picked up and switches to track the blue ball that was contained by the yellow cone. It suggests that OPNet has implicitly learned the “meaning” of actions performed by objects, without being explicitly trained to do so. See Figure <ref labelref="LABEL:fig:win_loss"/> (bottom row)</p>
            </para>
          </item>
        </itemize>
      </para>
      <para xml:id="S7.SS4.p6">
        <p>Further insight may be provided by comparing the attention mask of the OPNet “Who to Track” module and the ground-truth mask of the containing or carrying object. Figure <ref labelref="LABEL:fig:heat_map"/> compares these masks for success and failure cases. It can be seen that OPNet nicely tracks the correct object for most of the frames.</p>
      </para>
    </subsection>
  </section>
  <section inlist="toc" xml:id="S8">
    <tags>
      <tag>8</tag>
      <tag role="autoref">section 8</tag>
      <tag role="refnum">8</tag>
      <tag role="typerefnum">§8</tag>
    </tags>
    <title><tag close=" ">8</tag>Conclusion</title>
    <para xml:id="S8.p1">
      <p>We considered the problem of localizing one target object in a highly dynamic scenes where the object can be occluded, contained or even carried away by another object. We name this task <text font="italic">object permanence</text>, following the cognitive concept of a target object that is physically present in a scene but is occluded and carried in various ways. We presented an architecture called OPNet, whose components correspond to the natural perceptual and reasoning stages of solving OP. Specifically, it has a module that learns to switch attention to another object if it infers that the object contains or carries th target. Our empirical evaluation shows that these components are needed for improving accuracy in this task.</p>
    </para>
    <para xml:id="S8.p2">
      <p>Our results highlight a remaining gap between perfect perception and a pixel-based detector. It is expected that this gap may be even wider when applying OP to more complex natural videos in an open-world setting. It will be interesting to further improve detection architectures in order to reduce this gap.</p>
    </para>
    <subsection xml:id="S8.SSx1">
      <title>Acknowledgments</title>
      <para xml:id="S8.SSx1.p1">
        <p>This study was funded by a grant to GC from the Israel Science Foundation (ISF 737/2018), and by an equipment grant to GC and Bar-Ilan University from the Israel Science Foundation (ISF 2332/18). AG received funding from the European Research Council (ERC) under the European Unions Horizon 2020 research and innovation program (grant ERC HOLI 819080).</p>
      </para>
<!--  %**** arxiv.tex Line 500 **** -->      <pagination role="newpage"/>
    </subsection>
  </section>
  <bibliography bibstyle="splncs04" citestyle="numbers" files="permanence" xml:id="bib">
    <title>References</title>
  </bibliography>
<!--  %“bibliographystyleAppendix–splncs04˝ 
     %“bibliograpyAppendix–permanence˝
     %updated April 2002 by Antje Endemann
     %Based on CVPR 07 and LNCS, with modifications by DAF, AZ and elle, 2008 and AA, 2010, and CC, 2011; TT, 2014; AAS, 2016; AAS, 2020
     %“documentclass[runningheads]–llncs˝
     %“usepackage–graphicx˝
     %“usepackage–comment˝
     %“usepackage–amsmath,amssymb˝ % define this before the line numbering.
     %“usepackage–acronym˝
     %“usepackage[resetlabels,labeled]–multibib˝
     %INITIAL SUBMISSION - The following two lines are NOT commented
     %CAMERA READY - Comment OUT the following two lines
     %“usepackage–ruler˝
     %“usepackage[width=122mm,left=12mm,paperwidth=146mm,height=193mm,top=12mm,paperheight=217mm]–geometry˝
     %“usepackage–tabularx, lipsum˝
     %“usepackage–hyperref˝
     %“usepackage–color˝
     %“begin–comment˝
     %“renewcommand“thefigure–S“arabic–figure˝˝
     %“renewcommand“thetable–S“arabic–table˝˝
     %“renewcommand“thesection–“Alph–section˝˝
     %“renewcommand–“theequation˝–S“arabic–equation˝˝
     %**** arxiv˙supplementary.tex Line 25 ****
     %“setcounter–figure˝–0˝
     %“setcounter–table˝–0˝
     %“setcounter–section˝–0˝
     %“end–comment˝
     %“usepackage–xr˝
     %“makeatletter
     %“newcommand*–“addFileDependency˝[1]–% argument=file name and extension
     %“typeout–(#1)˝
     %“@addtofilelist–#1˝
     %“IfFileExists–#1˝–˝–“typeout–No file #1.˝˝
     %˝
     %“makeatother
     %“newcommand*–“myexternaldocument˝[1]–%
     %“externaldocument–#1˝%
     %“addFileDependency–#1.tex˝%
     %% “addFileDependency–#1.aux˝%
     %˝
     %“myexternaldocument–main˝
     %**** arxiv˙supplementary.tex Line 50 ****
     %“acrodef–OP˝–Object Permanence˝
     %“newcommand–“galch˝[1]––“color–blue˝–“bf[GC:˝ #1–“bf]˝˝˝
     %“newcommand–“amirg˝[1]––“color–orange˝–“bf[AG:˝ #1–“bf]˝˝˝
     %“newcommand–“amir˝[1]––“color–orange˝–#1˝˝˝
     %“newcommand–“gal˝[1]––“color–blue˝–#1˝˝˝
     %“newcommand–“avivsh˝[1]––“color–red˝–“bf[AS:˝ #1–“bf]˝˝˝
     %“newcommand–“aviv˝[1]––“color–red˝–#1˝˝˝
     %“newcommand–“ofrik˝[1]––“color–green˝–“bf[OK:˝ #1–“bf]˝˝˝
     %“newcommand–“ofri˝[1]––“color–green˝–#1˝˝˝
     %“newcommand–“secref˝[1]–Section “ref–#1˝˝
     %“newcommand–“figref˝[1]–Figure “ref–#1˝˝
     %“renewcommand–“eqref˝[1]–Equation “ref–#1˝˝
     %“begin–document˝
     %“renewcommand“thelinenumber–“color[rgb]–0.2,0.5,0.8˝“normalfont“sffamily“scriptsize“arabic–linenumber˝“color[rgb]–0,0,0˝˝
     %“renewcommand“makeLineNumber –“hss“thelinenumber“ “hspace–6mm˝ “rlap–“hskip“textwidth“ “hspace–6.5mm˝“thelinenumber˝˝
     %**** arxiv˙supplementary.tex Line 75 ****
     %“linenumbers
     %“def“ECCVSubNumber–2481˝  % Insert your submission number here
     %“title–Learning Object Permanence from Video˝ % Replace with your title
     %INITIAL SUBMISSION
     %“begin–comment˝
     %“titlerunning–ECCV-20 submission ID “ECCVSubNumber˝
     %“authorrunning–ECCV-20 submission ID “ECCVSubNumber˝
     %“author–Anonymous ECCV submission˝
     %“institute–Paper ID “ECCVSubNumber˝
     %“end–comment˝
     %******************
     %CAMERA READY SUBMISSION
     %******************-->  <para xml:id="p1">
    <p>Supplementary Material
<!--  %“titlerunning–ECCV-20 submission ID “ECCVSubNumber˝ 
     %“authorrunning–ECCV-20 submission ID “ECCVSubNumber˝
     %“author–Anonymous ECCV submission˝
     %“institute–Paper ID “ECCVSubNumber˝--></p>
  </para>
<!--  %**** arxiv˙supplementary.tex Line 125 **** 
     %“begin–itemize˝
     %“item
     %“url–https://drive.google.com/open?id=1lR3XaWYT1yROkZfpFflCjrXj6HB0GdMa˝
     %“item
     %“url–https://drive.google.com/open?id=1npiUarfHOkfAo3jsb4˙JyKRPofhicSvY˝
     %“item
     %“url–https://drive.google.com/open?id=1Iam5krWPSJdv3t1hPrapwu6m-Sbk9ykO˝
     %“item
     %“url–https://drive.google.com/open?id=11LlKr7foW-eWAZYoo3eyhAtygoSqxuHV˝
     %“end–itemize˝
     %“section–Qualitative Examples˝
     %We provide two sets of examples to illustrate: (1) Comparison between baselines and variants over the same set of videos. (2) Wins and losses of our approach. All videos are uploaded as supplemental data.
     %“galch–rename files ‘‘model comparison’’. In English, this structure takes the singular form.˝
     %“begin–figure˝[]
     %“centering
     %(a) “hspace–155pt˝
     %(b) “hspace–155pt˝~
     %“includegraphics[width=0.495“textwidth]–Figures/model˙comparison˙1˙ver˙2.png˝
     %“includegraphics[width=0.495“textwidth]–Figures/model˙comparison˙2˙ver˙2.png˝
     %“caption–Screenshots from the supplementary video files “textbf–model“˙comparison“˙1.mp4˝ and “textbf–model“˙comparison“˙2.mp4˝. Blue boxes denote the ground truth location. Yellow boxes denote the predicted location. OPNet (ours) is at the bottom right panel. “textbf–(a)˝ The target is contained and then “textit–carried˝ by the blue cone, and is captured successfully by OPNet. “textbf–(b)˝ The target is occluded by the red cone and purple ball. These occlusions confuse all baselines, while OPNet  localizes the target accurately.
     %˝
     %“label–fig:all˙models˙comparison˝
     %**** arxiv˙supplementary.tex Line 150 ****
     %“end–figure˝
     %“subsection–Model Comparison˝
     %We show two videos comparing OPNet with baselines and other variants.
     %In both videos, four competing methods are applied to the same video scene. We recommend playing videos at a slow speed.
     %“begin–itemize˝
     %“item Video file “textbf–model“˙comparison“˙1.mp4˝  shows one visual scene analyzed by four methods.
     %OPNet (ours) successfully localizes the target throughout the video. When the target is ‘‘carried”, the “textit–Transformer˝ model (bottom left) fails to switch, and instead of tracking the carrying object it keeps predicting the last seen location of the target. The “textit–Tracker˝ model (top left) switches to a wrong object. The “textit–Heuristic˝ model (top right) successfully tracks the object containing the target, adjusting well to the target size. See “figref–fig:all˙models˙comparison˝(a).
     %“item  Video file “textbf–model“˙comparison“˙2.mp4˝ shows a second visual scene analyzed by four methods. In this video, the target is being occluded by multiple objects, including full occlusion, which makes it challenging to track.
     %The “textit–Tracker˝, “textit–Heuristic˝ and “textit–OPNet MLP˝ models occasionally drift from the target when it is fully occluded by a large object.
     %OPNet (ours) successfully localizes the target throughout the video. See “figref–fig:all˙models˙comparison˝(b).
     %“end–itemize˝
     %“subsection–Wins and Losses of OPNet˝
     %We provide interesting examples of OPNet success and failures, adding insights into the behaviour and limitations of the OPNet model.
     %“begin–itemize˝
     %“item The video “textbf–win“˙example.mp4˝ demonstrates the power of OPNet and its “textit–‘‘who to track”˝ reasoning component. In the video, the model handles two phases of recursive containment (‘‘babushka”), where the second one involves ‘‘carrying”. It suggests that OPNet learns an implicit representation of the object actions (pick-up, slide, contain etc.) even though it was not explicitly trained to do so.
     %“item The video “textbf–loss“˙example.mp4˝ shows an example where OPNet fails to switch between tracked objects when the target is ‘‘carried”. The model accidentally switches to a wrong cone object (the yellow cone) that already contains another object, not the target. Interestingly, OPNet properly identifies when the yellow cone is picked up, and switches to track the blue ball that was contained by the yellow cone. It suggests that OPNet has implicitly learned the ‘‘meaning” of actions performed by objects, without being explicitly trained to do so.
     %“end–itemize˝-->  <appendix inlist="toc" xml:id="Pt0.A1">
    <tags>
      <tag>Appendix 0.A</tag>
      <tag role="autoref">Appendix 0.A</tag>
      <tag role="refnum">0.A</tag>
      <tag role="typerefnum">Appendix 0.A</tag>
    </tags>
    <title><tag close=" ">Appendix 0.A</tag>Erorr Analysis across the Video Corpus</title>
    <toctitle><tag close=" ">0.A</tag>Erorr Analysis across the Video Corpus</toctitle>
    <para xml:id="Pt0.A1.p1">
      <p>Videos in our dataset vary substantially in terms of what OP tasks they involve. This has a large effect over localization accuracy, because it is much harder to localize a carried target than a visible one. To gain more insight into the performance of the leading models, we compare the localization IoU on a video-by-video basis.</p>
    </para>
<!--  %**** arxiv˙supplementary.tex Line 175 **** -->    <para xml:id="Pt0.A1.p2">
      <p>Figure <ref labelref="LABEL:fig:overall_iou_model_comparison"/> depicts per-video IoU of OPNet and two other strong baselines. Each point corresponds to one video and the color reflects the type of frames in that video. Figure <ref labelref="LABEL:fig:overall_iou_model_comparison"/>(a) shows how OPNet outperforms Transformer on videos including <text font="italic">carried</text> frames (colored in orange). Clearly, videos with carried frames are clustered in the lower half of the figure, where OPNet is superior.</p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:overall_iou_model_comparison" xml:id="Pt0.A1.F8">
      <tags>
        <tag>Figure S8</tag>
        <tag role="autoref">Figure S8</tag>
        <tag role="refnum">S8</tag>
        <tag role="typerefnum">Figure S8</tag>
      </tags>
      <p>(a)                                               (b)</p>
      <graphics candidates="Figures/scatter_yellow.png" class="ltx_centering" graphic="Figures/scatter_yellow.png" options="width=214.6419pt" xml:id="Pt0.A1.F8.g1"/>
      <graphics candidates="Figures/scatter_green.png" class="ltx_centering" graphic="Figures/scatter_green.png" options="width=214.6419pt" xml:id="Pt0.A1.F8.g2"/>
<!--  %“includegraphics[width=“textwidth,height=“textheight,keepaspectratio]–Figures/scatter/overall˙iou˙model˙comparison˙7˙7.png˝ -->      <toccaption class="ltx_centering"><tag close=" ">S8</tag>Sample-by-samples comparison of OPNet with two strong baselines. Each point represents the IoU of a video from the test set, achieved by OPNet and a baseline. <text font="bold">(a)</text> Videos with more than 7% <text font="italic">carried</text> frames are colored in orange. <text font="bold">(b)</text> Videos with more than 7% <text font="italic">occlusion</text> frames are colored in green. Points in the lower part corresponds to videos in which OPNet is superior.</toccaption>
      <caption class="ltx_centering"><tag close=": ">Figure S8</tag>Sample-by-samples comparison of OPNet with two strong baselines. Each point represents the IoU of a video from the test set, achieved by OPNet and a baseline. <text font="bold">(a)</text> Videos with more than 7% <text font="italic">carried</text> frames are colored in orange. <text font="bold">(b)</text> Videos with more than 7% <text font="italic">occlusion</text> frames are colored in green. Points in the lower part corresponds to videos in which OPNet is superior.</caption>
    </figure>
    <para xml:id="Pt0.A1.p3">
      <p>Similarly, Figure <ref labelref="LABEL:fig:overall_iou_model_comparison"/>(b) compares OPNet with the OPNet (LSTM + MLP) baseline, which contains only the first reasoning component (see Our Approach section). It shows that OPNet outperforms the baseline on videos including a high number of <text font="italic">occlusion</text> frames (colored in green). It also emphasizes that for most videos, OPNet is superior, as illustrated by the great number of points in the lower half of the figure.</p>
    </para>
  </appendix>
  <appendix inlist="toc" labels="LABEL:sec:implementation_details" xml:id="Pt0.A2">
    <tags>
      <tag>Appendix 0.B</tag>
      <tag role="autoref">Appendix 0.B</tag>
      <tag role="refnum">0.B</tag>
      <tag role="typerefnum">Appendix 0.B</tag>
    </tags>
    <title><tag close=" ">Appendix 0.B</tag>Implementation Details</title>
    <toctitle><tag close=" ">0.B</tag>Implementation Details</toctitle>
    <para xml:id="Pt0.A2.p1">
      <p>We trained OPNet and baseline variants using <Math mode="inline" tex="L_{1}" text="L _ 1" xml:id="Pt0.A2.p1.m1">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">L</XMTok>
              <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
            </XMApp>
          </XMath>
        </Math> loss optimized using Adam optimizer with <Math mode="inline" tex="\beta_{1}=0.9" text="beta _ 1 = 0.9" xml:id="Pt0.A2.p1.m2">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" name="beta" role="UNKNOWN">β</XMTok>
                <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
              </XMApp>
              <XMTok meaning="0.9" role="NUMBER">0.9</XMTok>
            </XMApp>
          </XMath>
        </Math>, <Math mode="inline" tex="\beta_{2}=0.999" text="beta _ 2 = 0.999" xml:id="Pt0.A2.p1.m3">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" name="beta" role="UNKNOWN">β</XMTok>
                <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
              </XMApp>
              <XMTok meaning="0.999" role="NUMBER">0.999</XMTok>
            </XMApp>
          </XMath>
        </Math>, <Math mode="inline" tex="\varepsilon=1e-08" text="varepsilon = 1 * e - 08" xml:id="Pt0.A2.p1.m4">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMTok font="italic" name="varepsilon" role="UNKNOWN">ε</XMTok>
              <XMApp>
                <XMTok meaning="minus" role="ADDOP">-</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok meaning="1" role="NUMBER">1</XMTok>
                  <XMTok font="italic" role="UNKNOWN">e</XMTok>
                </XMApp>
                <XMTok meaning="08" role="NUMBER">08</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>, and using a batch size of 16. We initialized the learning rate to <Math mode="inline" tex="0.001" text="0.001" xml:id="Pt0.A2.p1.m5">
          <XMath>
            <XMTok meaning="0.001" role="NUMBER">0.001</XMTok>
          </XMath>
        </Math> and employed a learning rate decay policy, which reduced the learning rate by a factor of 0.8 every 3 epochs without loss improvement. We tuned all hyperparameters using the validation set. We experimented with using a higher initial learning rate of <Math mode="inline" tex="1e-2" text="1 * e - 2" xml:id="Pt0.A2.p1.m6">
          <XMath>
            <XMApp>
              <XMTok meaning="minus" role="ADDOP">-</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok meaning="1" role="NUMBER">1</XMTok>
                <XMTok font="italic" role="UNKNOWN">e</XMTok>
              </XMApp>
              <XMTok meaning="2" role="NUMBER">2</XMTok>
            </XMApp>
          </XMath>
        </Math>, but it turned out to be too noisy for the relatively small loss induced by the <Math mode="inline" tex="L_{1}" text="L _ 1" xml:id="Pt0.A2.p1.m7">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">L</XMTok>
              <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
            </XMApp>
          </XMath>
        </Math> loss. We also tried lower learning rate (<Math mode="inline" tex="1e-4" text="1 * e - 4" xml:id="Pt0.A2.p1.m8">
          <XMath>
            <XMApp>
              <XMTok meaning="minus" role="ADDOP">-</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok meaning="1" role="NUMBER">1</XMTok>
                <XMTok font="italic" role="UNKNOWN">e</XMTok>
              </XMApp>
              <XMTok meaning="4" role="NUMBER">4</XMTok>
            </XMApp>
          </XMath>
        </Math>), but it did not converge to a good minimum.</p>
    </para>
    <para xml:id="Pt0.A2.p2">
      <p>The model was trained for 160 epochs, which we verified via manual inspection to be sufficient for convergence of all models. Early stopping was based on the validation-set mean IoU.</p>
    </para>
    <para xml:id="Pt0.A2.p3">
      <p>For comparisons with CATER <cite class="ltx_citemacro_cite">[<bibref bibrefs="girdhar2019cater" separator="," yyseparator=","/>]</cite> (Table <ref labelref="LABEL:table:cater_results"/> of the main paper), we used the accuracy values reported in their paper.</p>
    </para>
    <para xml:id="Pt0.A2.p4">
      <p>For the <text font="italic">learning only from visible frames</text> setup (Section <ref labelref="LABEL:sect:abl"/> and Table <ref labelref="LABEL:table:visible_only"/> of the main paper) we used the values <Math mode="inline" tex="\alpha=1" text="alpha = 1" xml:id="Pt0.A2.p4.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMTok font="italic" name="alpha" role="UNKNOWN">α</XMTok>
              <XMTok meaning="1" role="NUMBER">1</XMTok>
            </XMApp>
          </XMath>
        </Math> and <Math mode="inline" tex="\beta=0.5" text="beta = 0.5" xml:id="Pt0.A2.p4.m2">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMTok font="italic" name="beta" role="UNKNOWN">β</XMTok>
              <XMTok meaning="0.5" role="NUMBER">0.5</XMTok>
            </XMApp>
          </XMath>
        </Math>. We used these values to normalize the different scales of <Math mode="inline" tex="\mathcal{L}_{localization}" text="L _ (l * o * c * a * l * i * z * a * t * i * o * n)" xml:id="Pt0.A2.p4.m3">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">l</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">o</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">c</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">l</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">z</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">o</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> and <Math mode="inline" tex="\mathcal{L}_{consistency}" text="L _ (c * o * n * s * i * s * t * e * n * c * y)" xml:id="Pt0.A2.p4.m4">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">c</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">o</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">e</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">c</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">y</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>. We verified via manual inspection that (1) for the first 60-70 epochs the loss component <Math mode="inline" tex="\mathcal{L}_{localization}" text="L _ (l * o * c * a * l * i * z * a * t * i * o * n)" xml:id="Pt0.A2.p4.m5">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">l</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">o</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">c</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">l</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">z</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">o</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> is significantly greater than the loss component <Math mode="inline" tex="\mathcal{L}_{consistency}" text="L _ (c * o * n * s * i * s * t * e * n * c * y)" xml:id="Pt0.A2.p4.m6">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="caligraphic" role="UNKNOWN">L</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">c</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">o</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">s</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">e</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">c</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">y</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>. Thus, in this phase the model improves its prediction when the target is visible; (2) After 60-70 epochs the two loss components have the same scale. Thus, in this phase the model improves its prediction also when the target is not-visible.</p>
    </para>
<!--  %**** arxiv˙supplementary.tex Line 200 **** 
     %**** arxiv˙supplementary.tex Line 225 ****-->  </appendix>
  <appendix inlist="toc" labels="LABEL:sec:la_cater_prep" xml:id="Pt0.A3">
    <tags>
      <tag>Appendix 0.C</tag>
      <tag role="autoref">Appendix 0.C</tag>
      <tag role="refnum">0.C</tag>
      <tag role="typerefnum">Appendix 0.C</tag>
    </tags>
    <title><tag close=" ">Appendix 0.C</tag>LA-CATER Dataset Preparation</title>
    <toctitle><tag close=" ">0.C</tag>LA-CATER Dataset Preparation</toctitle>
    <para xml:id="Pt0.A3.p1">
      <p>Our new LA-CATER dataset augments the CATER dataset <cite class="ltx_citemacro_cite">[<bibref bibrefs="girdhar2019cater" separator="," yyseparator=","/>]</cite> with ground-truth locations of all objects and with detailed frame level annotations. Also, instead of using the videos released by CATER we
generated new videos using their configuration, and
expanded their code to add ground-truth locations and frame-level annotations.</p>
    </para>
    <para xml:id="Pt0.A3.p2">
      <p>We now describe how we classify frames into the four corresponding OP subtasks. The CATER dataset annotates each frame with the <text font="italic">actions</text> occurring for each object in that frame. These actions are defined as follows:</p>
      <itemize xml:id="Pt0.A3.I1">
        <item xml:id="Pt0.A3.I1.i1">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">1st item</tag>
          </tags>
          <para xml:id="Pt0.A3.I1.i1.p1">
            <p><text font="italic">Slide</text>.
<!--  %**** arxiv˙supplementary.tex Line 250 **** -->Object changes its position by sliding on the XY-plane.</p>
          </para>
        </item>
        <item xml:id="Pt0.A3.I1.i2">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">2nd item</tag>
          </tags>
          <para xml:id="Pt0.A3.I1.i2.p1">
            <p><text font="italic">Pick-Place</text>.
Object is picked up in the air along the Z-axis, moved to a new location and placed down.</p>
          </para>
        </item>
        <item xml:id="Pt0.A3.I1.i3">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">3rd item</tag>
          </tags>
          <para xml:id="Pt0.A3.I1.i3.p1">
            <p><text font="italic">Contain</text>.
A special action performed by cones only, in which cone execute <text font="italic">Pick-Place</text> action and positioned on top of another object.</p>
          </para>
        </item>
      </itemize>
    </para>
    <para xml:id="Pt0.A3.p3">
      <itemize xml:id="Pt0.A3.I2">
        <item xml:id="Pt0.A3.I2.i1">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">1st item</tag>
          </tags>
          <para xml:id="Pt0.A3.I2.i1.p1">
            <p><text font="bold">Contained Frames</text>.
We classify a frame as <text font="italic">Contained</text> when the target is contained by a cone. Explicitly, a frame is classified as <text font="italic">Contained</text> when it is annotated with the “contain” action in CATER, with a cone marked as the containing object and the target marked as the contained object. A frame with recursive containment, namely, a containing cone is itself contained by another cone, is also considered to be a <text font="italic">contained</text> frame. Frames are marked as contained from the moment the target is covered and until the containing object is picked up as part of <text font="italic">pick-place</text> action.</p>
          </para>
        </item>
        <item xml:id="Pt0.A3.I2.i2">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">2nd item</tag>
          </tags>
          <para xml:id="Pt0.A3.I2.i2.p1">
            <p><text font="bold">Carried Frames</text>. We mark a frame as <text font="italic">Carried</text> when the target is <text font="italic">contained</text> by a cone (its action is marked in CATER as contained) and <text font="italic">slides</text> along with it. Frames are marked as <text font="italic">carried</text> from the beginning of the <text font="italic">slide</text> action until the end of the <text font="italic">slide</text> action. Thus, only frames corresponding to the <text font="italic">slide</text> action are marked as <text font="italic">carried</text>.</p>
          </para>
        </item>
        <item xml:id="Pt0.A3.I2.i3">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">3rd item</tag>
          </tags>
          <para xml:id="Pt0.A3.I2.i3.p1">
            <p><text font="bold">Occluded Frames</text>. For frame <Math mode="inline" tex="t" text="t" xml:id="Pt0.A3.I2.i3.p1.m1">
                <XMath>
                  <XMTok font="italic" role="UNKNOWN">t</XMTok>
                </XMath>
              </Math>, we define the <text font="italic">occlusion rate (OR)</text> of object <Math mode="inline" tex="x" text="x" xml:id="Pt0.A3.I2.i3.p1.m2">
                <XMath>
                  <XMTok font="italic" role="UNKNOWN">x</XMTok>
                </XMath>
              </Math> by object <Math mode="inline" tex="y" text="y" xml:id="Pt0.A3.I2.i3.p1.m3">
                <XMath>
                  <XMTok font="italic" role="UNKNOWN">y</XMTok>
                </XMath>
              </Math> as</p>
            <equation xml:id="Pt0.A3.E4">
              <tags>
                <tag>(S4)</tag>
                <tag role="autoref">Equation S4</tag>
                <tag role="refnum">S4</tag>
              </tags>
              <Math mode="display" tex="{OR^{x}_{t}(y)}=\begin{cases}\frac{Area^{x}_{t}\cap Area^{y}_{t}}{Area^{x}_{t}%&#10;}&amp;Area^{x}_{t}\leq Area^{y}_{t}\\&#10;\quad\quad 0&amp;Otherwise\end{cases}" text="O * (R ^ x) _ t * y = cases@((A * r * e * (a ^ x) _ t intersection A * r * e * (a ^ y) _ t) / (A * r * e * (a ^ x) _ t), A * r * e * (a ^ x) _ t less= A * r * e * (a ^ y) _ t, 0, O * t * h * e * r * w * i * s * e)" xml:id="Pt0.A3.E4.m1">
                <XMath>
                  <XMApp>
                    <XMTok meaning="equals" role="RELOP">=</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" role="UNKNOWN">O</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                        <XMApp>
                          <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                          <XMTok font="italic" role="UNKNOWN">R</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                        </XMApp>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      </XMApp>
                      <XMDual>
                        <XMRef idref="Pt0.A3.E4.m1.5"/>
                        <XMWrap>
                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                          <XMTok font="italic" role="UNKNOWN" xml:id="Pt0.A3.E4.m1.5">y</XMTok>
                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                        </XMWrap>
                      </XMDual>
                    </XMApp>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="cases"/>
                        <XMRef idref="Pt0.A3.E4.m1.1"/>
                        <XMRef idref="Pt0.A3.E4.m1.2"/>
                        <XMRef idref="Pt0.A3.E4.m1.3"/>
                        <XMRef idref="Pt0.A3.E4.m1.4"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="true">{</XMTok>
                        <XMArray>
                          <XMRow>
                            <XMCell align="left">
                              <XMApp xml:id="Pt0.A3.E4.m1.1">
                                <XMTok mathstyle="text" meaning="divide" role="FRACOP"/>
                                <XMApp>
                                  <XMTok fontsize="70%" meaning="intersection" name="cap" role="ADDOP">∩</XMTok>
                                  <XMApp>
                                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">A</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">r</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">e</XMTok>
                                    <XMApp>
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMApp>
                                        <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                                        <XMTok font="italic" fontsize="50%" role="UNKNOWN">x</XMTok>
                                      </XMApp>
                                      <XMTok font="italic" fontsize="50%" role="UNKNOWN">t</XMTok>
                                    </XMApp>
                                  </XMApp>
                                  <XMApp>
                                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">A</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">r</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">e</XMTok>
                                    <XMApp>
                                      <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                      <XMApp>
                                        <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                                        <XMTok font="italic" fontsize="50%" role="UNKNOWN">y</XMTok>
                                      </XMApp>
                                      <XMTok font="italic" fontsize="50%" role="UNKNOWN">t</XMTok>
                                    </XMApp>
                                  </XMApp>
                                </XMApp>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">A</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">r</XMTok>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">e</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post7"/>
                                    <XMApp>
                                      <XMTok role="SUPERSCRIPTOP" scriptpos="post7"/>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                                      <XMTok font="italic" fontsize="50%" role="UNKNOWN">x</XMTok>
                                    </XMApp>
                                    <XMTok font="italic" fontsize="50%" role="UNKNOWN">t</XMTok>
                                  </XMApp>
                                </XMApp>
                              </XMApp>
                            </XMCell>
                            <XMCell align="left">
                              <XMApp xml:id="Pt0.A3.E4.m1.2">
                                <XMTok meaning="less-than-or-equals" name="leq" role="RELOP">≤</XMTok>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">A</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">r</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">e</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post6"/>
                                    <XMApp>
                                      <XMTok role="SUPERSCRIPTOP" scriptpos="post6"/>
                                      <XMTok font="italic" role="UNKNOWN">a</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                                    </XMApp>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                  </XMApp>
                                </XMApp>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">A</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">r</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">e</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post6"/>
                                    <XMApp>
                                      <XMTok role="SUPERSCRIPTOP" scriptpos="post6"/>
                                      <XMTok font="italic" role="UNKNOWN">a</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">y</XMTok>
                                    </XMApp>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                  </XMApp>
                                </XMApp>
                              </XMApp>
                            </XMCell>
                          </XMRow>
                          <XMRow>
                            <XMCell align="left">
                              <XMTok lpadding="20.0pt" meaning="0" role="NUMBER" xml:id="Pt0.A3.E4.m1.3">0</XMTok>
                            </XMCell>
                            <XMCell align="left">
                              <XMApp xml:id="Pt0.A3.E4.m1.4">
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMTok font="italic" role="UNKNOWN">O</XMTok>
                                <XMTok font="italic" role="UNKNOWN">t</XMTok>
                                <XMTok font="italic" role="UNKNOWN">h</XMTok>
                                <XMTok font="italic" role="UNKNOWN">e</XMTok>
                                <XMTok font="italic" role="UNKNOWN">r</XMTok>
                                <XMTok font="italic" role="UNKNOWN">w</XMTok>
                                <XMTok font="italic" role="UNKNOWN">i</XMTok>
                                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                                <XMTok font="italic" role="UNKNOWN">e</XMTok>
                              </XMApp>
                            </XMCell>
                          </XMRow>
                        </XMArray>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMath>
              </Math>
            </equation>
            <p>Where <Math mode="inline" tex="Area^{x}_{t}" text="A * r * e * (a ^ x) _ t" xml:id="Pt0.A3.I2.i3.p1.m4">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">A</XMTok>
                    <XMTok font="italic" role="UNKNOWN">r</XMTok>
                    <XMTok font="italic" role="UNKNOWN">e</XMTok>
                    <XMApp>
                      <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                      <XMApp>
                        <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" role="UNKNOWN">a</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                      </XMApp>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    </XMApp>
                  </XMApp>
                </XMath>
              </Math> is the area of object <Math mode="inline" tex="x" text="x" xml:id="Pt0.A3.I2.i3.p1.m5">
                <XMath>
                  <XMTok font="italic" role="UNKNOWN">x</XMTok>
                </XMath>
              </Math> in frame <Math mode="inline" tex="t" text="t" xml:id="Pt0.A3.I2.i3.p1.m6">
                <XMath>
                  <XMTok font="italic" role="UNKNOWN">t</XMTok>
                </XMath>
              </Math>.
<!--  %**** arxiv˙supplementary.tex Line 275 **** 
     %$loc^–x˝˙–t˝$ and $loc^–CA˝˙–t˝$ are the locations of object $x$ and the camera in frame $t$ respectively (represented by 3D coordinates).--></p>
          </para>
          <para xml:id="Pt0.A3.I2.i3.p2">
            <p>We define the <text font="italic">distance from camera (DC)</text> of object <Math mode="inline" tex="x" text="x" xml:id="Pt0.A3.I2.i3.p2.m1">
                <XMath>
                  <XMTok font="italic" role="UNKNOWN">x</XMTok>
                </XMath>
              </Math></p>
            <equation xml:id="Pt0.A3.E5">
              <tags>
                <tag>(S5)</tag>
                <tag role="autoref">Equation S5</tag>
                <tag role="refnum">S5</tag>
              </tags>
              <Math mode="display" tex="DC^{x}_{t}=\left\lVert loc^{x}_{t}-loc^{CA}_{t}\right\rVert^{2}\quad" text="D * (C ^ x) _ t = (norm@(l * o * (c ^ x) _ t - l * o * (c ^ (C * A)) _ t)) ^ 2" xml:id="Pt0.A3.E5.m1">
                <XMath>
                  <XMDual>
                    <XMRef idref="Pt0.A3.E5.m1.1"/>
                    <XMWrap>
                      <XMApp xml:id="Pt0.A3.E5.m1.1">
                        <XMTok meaning="equals" role="RELOP">=</XMTok>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" role="UNKNOWN">D</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMApp>
                              <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="italic" role="UNKNOWN">C</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                            </XMApp>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMApp>
                          <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                          <XMDual>
                            <XMApp>
                              <XMTok meaning="norm"/>
                              <XMRef idref="Pt0.A3.E5.m1.1.1"/>
                            </XMApp>
                            <XMWrap>
                              <XMTok name="lVert" role="OPEN" stretchy="true">∥</XMTok>
                              <XMApp xml:id="Pt0.A3.E5.m1.1.1">
                                <XMTok meaning="minus" role="ADDOP">-</XMTok>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">l</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">o</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                                    <XMApp>
                                      <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                                      <XMTok font="italic" role="UNKNOWN">c</XMTok>
                                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                                    </XMApp>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                  </XMApp>
                                </XMApp>
                                <XMApp>
                                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">l</XMTok>
                                  <XMTok font="italic" role="UNKNOWN">o</XMTok>
                                  <XMApp>
                                    <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                                    <XMApp>
                                      <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                                      <XMTok font="italic" role="UNKNOWN">c</XMTok>
                                      <XMApp>
                                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">C</XMTok>
                                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">A</XMTok>
                                      </XMApp>
                                    </XMApp>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                  </XMApp>
                                </XMApp>
                              </XMApp>
                              <XMTok name="rVert" role="CLOSE" stretchy="true">∥</XMTok>
                            </XMWrap>
                          </XMDual>
                          <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMTok font="italic" name="quad" role="PUNCT"> </XMTok>
                    </XMWrap>
                  </XMDual>
                </XMath>
              </Math>
            </equation>
            <p><Math mode="inline" tex="loc_{t}^{x}" text="l * o * (c _ t) ^ x" xml:id="Pt0.A3.I2.i3.p2.m2">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">l</XMTok>
                    <XMTok font="italic" role="UNKNOWN">o</XMTok>
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" role="UNKNOWN">c</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      </XMApp>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                    </XMApp>
                  </XMApp>
                </XMath>
              </Math>, <Math mode="inline" tex="loc_{t}^{CA}" text="l * o * (c _ t) ^ (C * A)" xml:id="Pt0.A3.I2.i3.p2.m3">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" role="UNKNOWN">l</XMTok>
                    <XMTok font="italic" role="UNKNOWN">o</XMTok>
                    <XMApp>
                      <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMTok font="italic" role="UNKNOWN">c</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      </XMApp>
                      <XMApp>
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">C</XMTok>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">A</XMTok>
                      </XMApp>
                    </XMApp>
                  </XMApp>
                </XMath>
              </Math> denote the 3D coordinates location of object <Math mode="inline" tex="x" text="x" xml:id="Pt0.A3.I2.i3.p2.m4">
                <XMath>
                  <XMTok font="italic" role="UNKNOWN">x</XMTok>
                </XMath>
              </Math> and the camera in frame <Math mode="inline" tex="t" text="t" xml:id="Pt0.A3.I2.i3.p2.m5">
                <XMath>
                  <XMTok font="italic" role="UNKNOWN">t</XMTok>
                </XMath>
              </Math> respectively.</p>
          </para>
          <para xml:id="Pt0.A3.I2.i3.p3">
            <p>We define an indicator for a <text font="italic">fully occluded (FO)</text> object:</p>
            <equation labels="LABEL:eq:fo" xml:id="Pt0.A3.E6">
              <tags>
                <tag>(S6)</tag>
                <tag role="autoref">Equation S6</tag>
                <tag role="refnum">S6</tag>
              </tags>
              <Math mode="display" tex="{FO^{x}_{t}}=\begin{cases}1&amp;\exists\;\;y\;\;s.t\;\;OR^{x}_{t}(y)=1\;\;\text{%&#10;and}\;\;DC^{x}_{t}\geq DC^{y}_{t}\\&#10;0&amp;\quad\quad\quad\quad Otherwise\end{cases}" text="F * (O ^ x) _ t = cases@(1, formulae@(exists@(y * s), t * O * (R ^ x) _ t * y = 1 * [and] * D * (C ^ x) _ t &gt;= D * (C ^ y) _ t), 0, O * t * h * e * r * w * i * s * e)" xml:id="Pt0.A3.E6.m1">
                <XMath>
                  <XMApp>
                    <XMTok meaning="equals" role="RELOP">=</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" role="UNKNOWN">F</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                        <XMApp>
                          <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                          <XMTok font="italic" role="UNKNOWN">O</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                        </XMApp>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMDual>
                      <XMApp>
                        <XMTok meaning="cases"/>
                        <XMRef idref="Pt0.A3.E6.m1.1"/>
                        <XMRef idref="Pt0.A3.E6.m1.2"/>
                        <XMRef idref="Pt0.A3.E6.m1.3"/>
                        <XMRef idref="Pt0.A3.E6.m1.4"/>
                      </XMApp>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="true">{</XMTok>
                        <XMArray>
                          <XMRow>
                            <XMCell align="left">
                              <XMTok meaning="1" role="NUMBER" xml:id="Pt0.A3.E6.m1.1">1</XMTok>
                            </XMCell>
                            <XMCell align="left">
                              <XMDual xml:id="Pt0.A3.E6.m1.2">
                                <XMApp>
                                  <XMTok meaning="formulae"/>
                                  <XMRef idref="Pt0.A3.E6.m1.2.2"/>
                                  <XMRef idref="Pt0.A3.E6.m1.2.3"/>
                                </XMApp>
                                <XMWrap>
                                  <XMApp xml:id="Pt0.A3.E6.m1.2.2">
                                    <XMTok meaning="exists" role="BIGOP" rpadding="5.6pt">∃</XMTok>
                                    <XMApp>
                                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                      <XMTok font="italic" role="UNKNOWN" rpadding="5.6pt">y</XMTok>
                                      <XMTok font="italic" role="UNKNOWN">s</XMTok>
                                    </XMApp>
                                  </XMApp>
                                  <XMTok role="PERIOD">.</XMTok>
                                  <XMApp xml:id="Pt0.A3.E6.m1.2.3">
                                    <XMTok meaning="multirelation"/>
                                    <XMApp>
                                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                      <XMTok font="italic" role="UNKNOWN" rpadding="5.6pt">t</XMTok>
                                      <XMTok font="italic" role="UNKNOWN">O</XMTok>
                                      <XMApp>
                                        <XMTok role="SUBSCRIPTOP" scriptpos="post6"/>
                                        <XMApp>
                                          <XMTok role="SUPERSCRIPTOP" scriptpos="post6"/>
                                          <XMTok font="italic" role="UNKNOWN">R</XMTok>
                                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                                        </XMApp>
                                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                      </XMApp>
                                      <XMDual>
                                        <XMRef idref="Pt0.A3.E6.m1.2.1"/>
                                        <XMWrap>
                                          <XMTok role="OPEN" stretchy="false">(</XMTok>
                                          <XMTok font="italic" role="UNKNOWN" xml:id="Pt0.A3.E6.m1.2.1">y</XMTok>
                                          <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                        </XMWrap>
                                      </XMDual>
                                    </XMApp>
                                    <XMTok meaning="equals" role="RELOP">=</XMTok>
                                    <XMApp>
                                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                      <XMTok meaning="1" role="NUMBER" rpadding="5.6pt">1</XMTok>
                                      <XMText rpadding="5.6pt">and</XMText>
                                      <XMTok font="italic" role="UNKNOWN">D</XMTok>
                                      <XMApp>
                                        <XMTok role="SUBSCRIPTOP" scriptpos="post6"/>
                                        <XMApp>
                                          <XMTok role="SUPERSCRIPTOP" scriptpos="post6"/>
                                          <XMTok font="italic" role="UNKNOWN">C</XMTok>
                                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                                        </XMApp>
                                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                      </XMApp>
                                    </XMApp>
                                    <XMTok meaning="greater-than-or-equals" name="geq" role="RELOP">≥</XMTok>
                                    <XMApp>
                                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                      <XMTok font="italic" role="UNKNOWN">D</XMTok>
                                      <XMApp>
                                        <XMTok role="SUBSCRIPTOP" scriptpos="post6"/>
                                        <XMApp>
                                          <XMTok role="SUPERSCRIPTOP" scriptpos="post6"/>
                                          <XMTok font="italic" role="UNKNOWN">C</XMTok>
                                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">y</XMTok>
                                        </XMApp>
                                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                      </XMApp>
                                    </XMApp>
                                  </XMApp>
                                </XMWrap>
                              </XMDual>
                            </XMCell>
                          </XMRow>
                          <XMRow>
                            <XMCell align="left">
                              <XMTok meaning="0" role="NUMBER" xml:id="Pt0.A3.E6.m1.3">0</XMTok>
                            </XMCell>
                            <XMCell align="left">
                              <XMApp xml:id="Pt0.A3.E6.m1.4">
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMTok font="italic" lpadding="40.0pt" role="UNKNOWN">O</XMTok>
                                <XMTok font="italic" role="UNKNOWN">t</XMTok>
                                <XMTok font="italic" role="UNKNOWN">h</XMTok>
                                <XMTok font="italic" role="UNKNOWN">e</XMTok>
                                <XMTok font="italic" role="UNKNOWN">r</XMTok>
                                <XMTok font="italic" role="UNKNOWN">w</XMTok>
                                <XMTok font="italic" role="UNKNOWN">i</XMTok>
                                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                                <XMTok font="italic" role="UNKNOWN">e</XMTok>
                              </XMApp>
                            </XMCell>
                          </XMRow>
                        </XMArray>
                      </XMWrap>
                    </XMDual>
                  </XMApp>
                </XMath>
              </Math>
            </equation>
          </para>
          <para xml:id="Pt0.A3.I2.i3.p4">
            <p>We then mark frame <Math mode="inline" tex="t" text="t" xml:id="Pt0.A3.I2.i3.p4.m1">
                <XMath>
                  <XMTok font="italic" role="UNKNOWN">t</XMTok>
                </XMath>
              </Math> as <text font="italic">Occluded</text> when the target is fully occluded by another object. e.g <Math mode="inline" tex="FO^{target}_{t}=1" text="F * (O ^ (t * a * r * g * e * t)) _ t = 1" xml:id="Pt0.A3.I2.i3.p4.m2">
                <XMath>
                  <XMApp>
                    <XMTok meaning="equals" role="RELOP">=</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" role="UNKNOWN">F</XMTok>
                      <XMApp>
                        <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                        <XMApp>
                          <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                          <XMTok font="italic" role="UNKNOWN">O</XMTok>
                          <XMApp>
                            <XMTok meaning="times" role="MULOP">⁢</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">r</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">g</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">e</XMTok>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                          </XMApp>
                        </XMApp>
                        <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      </XMApp>
                    </XMApp>
                    <XMTok meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                </XMath>
              </Math></p>
          </para>
        </item>
        <item xml:id="Pt0.A3.I2.i4">
          <tags>
            <tag>•</tag>
            <tag role="autoref">item </tag>
            <tag role="typerefnum">4th item</tag>
          </tags>
          <para xml:id="Pt0.A3.I2.i4.p1">
            <p><text font="bold">Visible Frames</text>. Finally, we define frame as <text font="italic">Visible</text> when the target is not <text font="italic">Contained</text>, <text font="italic">Carried</text> or <text font="italic">Occluded</text>. Thus, the target needs to be only partially visible to be considered as <text font="italic">visible</text>. For instance, the target is still considered <text font="italic">visible</text> when it is 20% occluded (e.g <Math mode="inline" tex="\exists\;\;y\;\;s.t\;\;OR^{x}_{t}(y)=0.2" text="formulae@(exists@(y * s), t * O * (R ^ x) _ t * y = 0.2)" xml:id="Pt0.A3.I2.i4.p1.m1">
                <XMath>
                  <XMDual>
                    <XMApp>
                      <XMTok meaning="formulae"/>
                      <XMRef idref="Pt0.A3.I2.i4.p1.m1.2"/>
                      <XMRef idref="Pt0.A3.I2.i4.p1.m1.3"/>
                    </XMApp>
                    <XMWrap>
                      <XMApp xml:id="Pt0.A3.I2.i4.p1.m1.2">
                        <XMTok meaning="exists" role="BIGOP" rpadding="5.6pt">∃</XMTok>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" role="UNKNOWN" rpadding="5.6pt">y</XMTok>
                          <XMTok font="italic" role="UNKNOWN">s</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMTok role="PERIOD">.</XMTok>
                      <XMApp xml:id="Pt0.A3.I2.i4.p1.m1.3">
                        <XMTok meaning="equals" role="RELOP">=</XMTok>
                        <XMApp>
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" role="UNKNOWN" rpadding="5.6pt">t</XMTok>
                          <XMTok font="italic" role="UNKNOWN">O</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                            <XMApp>
                              <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                              <XMTok font="italic" role="UNKNOWN">R</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                            </XMApp>
                            <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                          </XMApp>
                          <XMDual>
                            <XMRef idref="Pt0.A3.I2.i4.p1.m1.1"/>
                            <XMWrap>
                              <XMTok role="OPEN" stretchy="false">(</XMTok>
                              <XMTok font="italic" role="UNKNOWN" xml:id="Pt0.A3.I2.i4.p1.m1.1">y</XMTok>
                              <XMTok role="CLOSE" stretchy="false">)</XMTok>
                            </XMWrap>
                          </XMDual>
                        </XMApp>
                        <XMTok meaning="0.2" role="NUMBER">0.2</XMTok>
                      </XMApp>
                    </XMWrap>
                  </XMDual>
                </XMath>
              </Math>)</p>
          </para>
<!--  %Thus, the target needs to be only partially visible “galch–Be specific˝ to be considered as visible. 
     %**** arxiv˙supplementary.tex Line 300 ****-->        </item>
      </itemize>
    </para>
  </appendix>
  <appendix inlist="toc" labels="LABEL:sec:pp_annotation" xml:id="Pt0.A4">
    <tags>
      <tag>Appendix 0.D</tag>
      <tag role="autoref">Appendix 0.D</tag>
      <tag role="refnum">0.D</tag>
      <tag role="typerefnum">Appendix 0.D</tag>
    </tags>
    <title><tag close=" ">Appendix 0.D</tag>Annotating Frames in Perfect Perception</title>
    <toctitle><tag close=" ">0.D</tag>Annotating Frames in Perfect Perception</toctitle>
    <para xml:id="Pt0.A4.p1">
      <p>For the perfect-perception setup, we extend the definition of <text font="italic">fully occluded (FO)</text> objects from Eq <ref labelref="LABEL:eq:fo"/>.
We define an object to be <text font="italic">partially occluded (PO)</text> with respect to the rate <Math mode="inline" tex="p" text="p" xml:id="Pt0.A4.p1.m1">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">p</XMTok>
          </XMath>
        </Math> as follows:</p>
      <equation labels="LABEL:eq:po" xml:id="Pt0.A4.E7">
        <tags>
          <tag>(S7)</tag>
          <tag role="autoref">Equation S7</tag>
          <tag role="refnum">S7</tag>
        </tags>
        <Math mode="display" tex="{PO^{x}_{t}(p)}=\begin{cases}1&amp;\exists\;\;y\;\;s.t\;\;OR^{x}_{t}(y)\geq p\;\;%&#10;\text{and}\;\;DC^{x}_{t}\geq DC^{y}_{t}\\&#10;0&amp;\quad\quad\quad\quad Otherwise\end{cases}" text="P * (O ^ x) _ t * p = cases@(1, formulae@(exists@(y * s), t * O * (R ^ x) _ t * y &gt;= p * [and] * D * (C ^ x) _ t &gt;= D * (C ^ y) _ t), 0, O * t * h * e * r * w * i * s * e)" xml:id="Pt0.A4.E7.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">P</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                    <XMTok font="italic" role="UNKNOWN">O</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                  </XMApp>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                </XMApp>
                <XMDual>
                  <XMRef idref="Pt0.A4.E7.m1.5"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="Pt0.A4.E7.m1.5">p</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
              <XMDual>
                <XMApp>
                  <XMTok meaning="cases"/>
                  <XMRef idref="Pt0.A4.E7.m1.1"/>
                  <XMRef idref="Pt0.A4.E7.m1.2"/>
                  <XMRef idref="Pt0.A4.E7.m1.3"/>
                  <XMRef idref="Pt0.A4.E7.m1.4"/>
                </XMApp>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="true">{</XMTok>
                  <XMArray>
                    <XMRow>
                      <XMCell align="left">
                        <XMTok meaning="1" role="NUMBER" xml:id="Pt0.A4.E7.m1.1">1</XMTok>
                      </XMCell>
                      <XMCell align="left">
                        <XMDual xml:id="Pt0.A4.E7.m1.2">
                          <XMApp>
                            <XMTok meaning="formulae"/>
                            <XMRef idref="Pt0.A4.E7.m1.2.2"/>
                            <XMRef idref="Pt0.A4.E7.m1.2.3"/>
                          </XMApp>
                          <XMWrap>
                            <XMApp xml:id="Pt0.A4.E7.m1.2.2">
                              <XMTok meaning="exists" role="BIGOP" rpadding="5.6pt">∃</XMTok>
                              <XMApp>
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMTok font="italic" role="UNKNOWN" rpadding="5.6pt">y</XMTok>
                                <XMTok font="italic" role="UNKNOWN">s</XMTok>
                              </XMApp>
                            </XMApp>
                            <XMTok role="PERIOD">.</XMTok>
                            <XMApp xml:id="Pt0.A4.E7.m1.2.3">
                              <XMTok meaning="multirelation"/>
                              <XMApp>
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMTok font="italic" role="UNKNOWN" rpadding="5.6pt">t</XMTok>
                                <XMTok font="italic" role="UNKNOWN">O</XMTok>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post6"/>
                                  <XMApp>
                                    <XMTok role="SUPERSCRIPTOP" scriptpos="post6"/>
                                    <XMTok font="italic" role="UNKNOWN">R</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                                  </XMApp>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                </XMApp>
                                <XMDual>
                                  <XMRef idref="Pt0.A4.E7.m1.2.1"/>
                                  <XMWrap>
                                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                                    <XMTok font="italic" role="UNKNOWN" xml:id="Pt0.A4.E7.m1.2.1">y</XMTok>
                                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                                  </XMWrap>
                                </XMDual>
                              </XMApp>
                              <XMTok meaning="greater-than-or-equals" name="geq" role="RELOP">≥</XMTok>
                              <XMApp>
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMTok font="italic" role="UNKNOWN" rpadding="5.6pt">p</XMTok>
                                <XMText rpadding="5.6pt">and</XMText>
                                <XMTok font="italic" role="UNKNOWN">D</XMTok>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post6"/>
                                  <XMApp>
                                    <XMTok role="SUPERSCRIPTOP" scriptpos="post6"/>
                                    <XMTok font="italic" role="UNKNOWN">C</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                                  </XMApp>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                </XMApp>
                              </XMApp>
                              <XMTok meaning="greater-than-or-equals" name="geq" role="RELOP">≥</XMTok>
                              <XMApp>
                                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                                <XMTok font="italic" role="UNKNOWN">D</XMTok>
                                <XMApp>
                                  <XMTok role="SUBSCRIPTOP" scriptpos="post6"/>
                                  <XMApp>
                                    <XMTok role="SUPERSCRIPTOP" scriptpos="post6"/>
                                    <XMTok font="italic" role="UNKNOWN">C</XMTok>
                                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">y</XMTok>
                                  </XMApp>
                                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                                </XMApp>
                              </XMApp>
                            </XMApp>
                          </XMWrap>
                        </XMDual>
                      </XMCell>
                    </XMRow>
                    <XMRow>
                      <XMCell align="left">
                        <XMTok meaning="0" role="NUMBER" xml:id="Pt0.A4.E7.m1.3">0</XMTok>
                      </XMCell>
                      <XMCell align="left">
                        <XMApp xml:id="Pt0.A4.E7.m1.4">
                          <XMTok meaning="times" role="MULOP">⁢</XMTok>
                          <XMTok font="italic" lpadding="40.0pt" role="UNKNOWN">O</XMTok>
                          <XMTok font="italic" role="UNKNOWN">t</XMTok>
                          <XMTok font="italic" role="UNKNOWN">h</XMTok>
                          <XMTok font="italic" role="UNKNOWN">e</XMTok>
                          <XMTok font="italic" role="UNKNOWN">r</XMTok>
                          <XMTok font="italic" role="UNKNOWN">w</XMTok>
                          <XMTok font="italic" role="UNKNOWN">i</XMTok>
                          <XMTok font="italic" role="UNKNOWN">s</XMTok>
                          <XMTok font="italic" role="UNKNOWN">e</XMTok>
                        </XMApp>
                      </XMCell>
                    </XMRow>
                  </XMArray>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math>
      </equation>
      <p>We say that object <Math mode="inline" tex="x" text="x" xml:id="Pt0.A4.p1.m2">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">x</XMTok>
          </XMath>
        </Math> is non-visible in frame <Math mode="inline" tex="t" text="t" xml:id="Pt0.A4.p1.m3">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">t</XMTok>
          </XMath>
        </Math> with respect to <Math mode="inline" tex="p" text="p" xml:id="Pt0.A4.p1.m4">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">p</XMTok>
          </XMath>
        </Math> if <Math mode="inline" tex="PO^{x}_{t}(p)=1" text="P * (O ^ x) _ t * p = 1" xml:id="Pt0.A4.p1.m5">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">P</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">O</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">x</XMTok>
                  </XMApp>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                </XMApp>
                <XMDual>
                  <XMRef idref="Pt0.A4.p1.m5.1"/>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok font="italic" role="UNKNOWN" xml:id="Pt0.A4.p1.m5.1">p</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
              <XMTok meaning="1" role="NUMBER">1</XMTok>
            </XMApp>
          </XMath>
        </Math>. We use the value <Math mode="inline" tex="p=0.7" text="p = 0.7" xml:id="Pt0.A4.p1.m6">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMTok font="italic" role="UNKNOWN">p</XMTok>
              <XMTok meaning="0.7" role="NUMBER">0.7</XMTok>
            </XMApp>
          </XMath>
        </Math> to decide which objects are non-visible. Contained objects are defined as non-visible, regardless of their <Math mode="inline" tex="PO" text="P * O" xml:id="Pt0.A4.p1.m7">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" role="UNKNOWN">P</XMTok>
              <XMTok font="italic" role="UNKNOWN">O</XMTok>
            </XMApp>
          </XMath>
        </Math> value.</p>
    </para>
    <para xml:id="Pt0.A4.p2">
      <p>Objects are represented by a 5-coordinate vector, containing 4 bounding box coordinates in <Math mode="inline" tex="(x_{1},y_{1},x_{2},y_{2})" text="vector@(x _ 1, y _ 1, x _ 2, y _ 2)" xml:id="Pt0.A4.p2.m1">
          <XMath>
            <XMDual>
              <XMApp>
                <XMTok meaning="vector"/>
                <XMRef idref="Pt0.A4.p2.m1.1"/>
                <XMRef idref="Pt0.A4.p2.m1.2"/>
                <XMRef idref="Pt0.A4.p2.m1.3"/>
                <XMRef idref="Pt0.A4.p2.m1.4"/>
              </XMApp>
              <XMWrap>
                <XMTok role="OPEN" stretchy="false">(</XMTok>
                <XMApp xml:id="Pt0.A4.p2.m1.1">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">x</XMTok>
                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp xml:id="Pt0.A4.p2.m1.2">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">y</XMTok>
                  <XMTok fontsize="70%" meaning="1" role="NUMBER">1</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp xml:id="Pt0.A4.p2.m1.3">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">x</XMTok>
                  <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                </XMApp>
                <XMTok role="PUNCT">,</XMTok>
                <XMApp xml:id="Pt0.A4.p2.m1.4">
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">y</XMTok>
                  <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                </XMApp>
                <XMTok role="CLOSE" stretchy="false">)</XMTok>
              </XMWrap>
            </XMDual>
          </XMath>
        </Math> format and an additional visibility bit. Visible objects are represented by their ground-truth bounding boxes and a turned-on visibility bit. Non-visible objects are represented by a four-zeros bounding box coordinates and a turned-off visibility bit.
<!--  %“end–document˝ --></p>
    </para>
  </appendix>
</document>
