<?xml version="1.0" encoding="UTF-8"?>
<?latexml searchpaths="/home/japhy/scienceReplication.artiswrong.com/paper_files/arxiv/2204.14114/latex_extracted"?>
<!--  %This must be in the first 5 lines to tell arXiv to use pdfLaTeX, which is strongly recommended. --><!--  %In particular, the hyperref package requires pdfLaTeX in order to break URLs across lines. --><?latexml class="article" options="11pt"?>
<!--  %Remove the ”review” option to generate the final version. --><?latexml package="acl"?>
<!--  %Standard package includes --><?latexml package="times"?>
<?latexml package="latexsym"?>
<!--  %For proper rendering and hyphenation of words containing Latin characters (including in bib files) --><?latexml package="fontenc" options="T1"?>
<!--  %This␣assumes␣your␣files␣are␣encoded␣as␣UTF8 --><?latexml package="inputenc" options="utf8"?>
<!--  %This␣is␣not␣strictly␣necessary,␣and␣may␣be␣commented␣out, --><!--  %but␣it␣will␣improve␣the␣layout␣of␣the␣manuscript, --><!--  %****␣acl_latex.tex␣Line␣25␣**** --><!--  %and␣will␣typically␣save␣some␣space. --><?latexml package="microtype"?>
<!--  %If␣the␣title␣and␣author␣information␣does␣not␣fit␣in␣the␣area␣allocated,␣uncomment␣the␣following --><!--  %\setlength\titlebox{&lt;dim&gt;} --><!--  %and␣set␣&lt;dim&gt;␣to␣something␣5cm␣or␣larger. --><?latexml package="caption"?>
<?latexml package="subcaption"?>
<?latexml package="graphicx"?>
<?latexml package="hyperref"?>
<?latexml package="float"?>
<!--  %Author␣information␣can␣be␣set␣in␣various␣styles: --><!--  %For␣several␣authors␣from␣the␣same␣institution: --><!--  %\author{Author␣1␣\and␣...␣\and␣Author␣n␣\\ --><!--  %Address␣line␣\\␣...␣\\␣Address␣line} --><!--  %if␣the␣names␣do␣not␣fit␣well␣on␣one␣line␣use --><!--  %Author␣1␣\\␣{\bf␣Author␣2}␣\\␣...␣\\␣{\bf␣Author␣n}␣\\ --><!--  %For␣authors␣from␣different␣institutions: --><!--  %****␣acl_latex.tex␣Line␣50␣**** --><!--  %\author{Author␣1␣\\␣Address␣line␣\\␣␣...␣\\␣Address␣line --><!--  %\And␣␣...␣\And --><!--  %Author␣n␣\\␣Address␣line␣\\␣...␣\\␣Address␣line} --><!--  %To␣start␣a␣seperate␣‘‘row’’␣of␣authors␣use␣\AND,␣as␣in --><!--  %\author{Author␣1␣\\␣Address␣line␣\\␣␣...␣\\␣Address␣line --><!--  %\AND --><!--  %Author␣2␣\\␣Address␣line␣\\␣...␣\\␣Address␣line␣\And --><!--  %Author␣3␣\\␣Address␣line␣\\␣...␣\\␣Address␣line} --><?latexml RelaxNGSchema="LaTeXML"?>
<document xmlns="http://dlmf.nist.gov/LaTeXML" class="ltx_authors_1line">
  <resource src="LaTeXML.css" type="text/css"/>
  <resource src="ltx-article.css" type="text/css"/>
  <title>Developmental Negation Processing in Transformer Language Models</title>
  <creator role="author">
    <personname>Antonio Laverghetta Jr. </personname>
  </creator>
  <creator before="  " role="author">
    <personname>John Licato <break/>Advancing Machine and Human Reasoning (AMHR) Lab <break/>Department of Computer Science and Engineering <break/>University of South Florida <break/>Tampa, FL, USA <break/><text font="typewriter">{alaverghett,licato}@usf.edu</text> <break/></personname>
  </creator>
  <abstract name="Abstract">
    <p>Reasoning using negation is known to be difficult for transformer-based language models. While previous studies have used the tools of psycholinguistics to probe a transformer’s ability to reason over negation, none have focused on the types of negation studied in developmental psychology. We explore how well transformers can process such categories of negation, by framing the problem as a natural language inference (NLI) task. We curate a set of diagnostic questions for our target categories from popular NLI datasets and evaluate how well a suite of models reason over them. We find that models perform consistently better only on certain categories, suggesting clear distinctions in how they are processed.<note mark="1" role="footnote" xml:id="footnote1"><tags>
          <tag>1</tag>
          <tag role="autoref">footnote 1</tag>
          <tag role="refnum">1</tag>
          <tag role="typerefnum">footnote 1</tag>
        </tags>Code and data to reproduce our experiments can be found on Github: <ref class="ltx_href" href="https://github.com/Advancing-Machine-Human-Reasoning-Lab/negation-processing-ACL-2022">https://github.com/Advancing-Machine-Human-Reasoning-Lab/negation-processing-ACL-2022</ref></note></p>
<!--  %despite␣the␣many␣similarities␣that␣they␣share. -->  </abstract>
<!--  %****␣acl_latex.tex␣Line␣75␣**** -->  <section inlist="toc" xml:id="S1">
    <tags>
      <tag>1</tag>
      <tag role="autoref">section 1</tag>
      <tag role="refnum">1</tag>
      <tag role="typerefnum">§1</tag>
    </tags>
    <title><tag close=" ">1</tag>Introduction</title>
    <para xml:id="S1.p1">
      <p>Negation is an important construct in language for reasoning over the truth of propositions <cite class="ltx_citemacro_cite"><bibref bibrefs="10.1093/aristotelian/44.1.127" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, garnering interest from philosophy <cite class="ltx_citemacro_cite"><bibref bibrefs="Horn1989-HORANH" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, psycholinguistics <cite class="ltx_citemacro_cite"><bibref bibrefs="zwaan2012experiential" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, and natural language processing (NLP) <cite class="ltx_citemacro_cite"><bibref bibrefs="morante2021recent" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>. While transformer language models (TLMs) <cite class="ltx_citemacro_cite"><bibref bibrefs="NIPS2017_3f5ee243" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> have achieved impressive performance across many NLP tasks, a great deal of recent work has found that they do not process negation well, and often make predictions that would be trivially false in the eyes of a human <cite class="ltx_citemacro_cite"><bibref bibrefs="rogers-etal-2020-primer,ettinger-2020-bert,laverghetta-jr-etal-2021-transformer" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>.</p>
    </para>
    <para xml:id="S1.p2">
      <p>In developmental psychology, there has likewise been a great deal of interest in how a child’s ability to comprehend negation emerges in the early years of life <cite class="ltx_citemacro_cite"><bibref bibrefs="nordmeyer2013measuring,nordmeyer2018early,reuter2018getting,grigoroglou2019toddlers" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>. Unlike in NLP, which typically treats negation as representing a single monolithic competency, this research has long understood that there are many kinds of negation used in everyday interactions <cite class="ltx_citemacro_cite"><bibref bibrefs="bloom1970language,pea1982origins" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>. This ranges from using negation to express a child’s rejection of something to clarifying a child’s knowledge. These “developmental” categories of negation do not emerge simultaneously; children tend to start using certain kinds before others <cite class="ltx_citemacro_cite"><bibref bibrefs="nordmeyer2018individual" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>.</p>
    </para>
    <para xml:id="S1.p3">
      <p>Given that these categories represent some of the earliest uses of negation among humans, understanding how well TLMs can master them is important for building more human-like models of language processing. Understanding how well models perform on different categories will indicate whether they have mastery of some forms of negation, while also helping to identify failure points. Another interesting question is whether the proficiency of TLMs on these categories is at all related to competencies in human children (e.g., is the category which models consistently perform the best on the same that children most frequently employ?). However, to our knowledge, no prior work in NLP has focused on how well models perform on the forms of negation of interest to developmental psychology.</p>
    </para>
    <para xml:id="S1.p4">
      <p>In this short paper, we investigate how well a suite of TLMs can process developmental negation,<note mark="2" role="footnote" xml:id="footnote2"><tags>
            <tag>2</tag>
            <tag role="autoref">footnote 2</tag>
            <tag role="refnum">2</tag>
            <tag role="typerefnum">footnote 2</tag>
          </tags>By which we mean the forms of negation studied in development psychology.</note> by framing the problem as a natural language inference (NLI) task. We develop a rule-based parser to extract problems from existing NLI datasets, and evaluate our models on each category, in order to determine <text font="italic">(i)</text> whether certain categories are more solvable by our models than others, and <text font="italic">(ii)</text> what relationships exist among the categories. We find that models can consistently achieve stronger performance only on certain categories, and that training on combinations or sequences of these categories does not substantially improve a model’s downstream performance.</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S2">
    <tags>
      <tag>2</tag>
      <tag role="autoref">section 2</tag>
      <tag role="refnum">2</tag>
      <tag role="typerefnum">§2</tag>
    </tags>
    <title><tag close=" ">2</tag>Related Work</title>
    <para xml:id="S2.p1">
      <p>Negation is known to be frequently used in everyday conversation. While this includes its logical form, we primarily focus on negation’s psycholinguistic forms, especially those that have been studied in the context of developmental psychology. Negation emerges early in child development, with ‘no’ sometimes being a child’s first word <cite class="ltx_citemacro_cite"><bibref bibrefs="schneider2015large" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, and even infants appear to understand forms of negation <cite class="ltx_citemacro_cite"><bibref bibrefs="Piaget1980,HOCHMANN2021104599" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>. Preschool children use at least three different kinds of negation <cite class="ltx_citemacro_cite"><bibref bibrefs="bloom1970language" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, but possibly as many as nine <cite class="ltx_citemacro_cite"><bibref bibrefs="choi1988semantic" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>. As noted by <cite class="ltx_citemacro_citet"><bibref bibrefs="nordmeyer2018individual" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, one of the first categories children use is <text font="italic">rejection</text>, where a child rejects an object or activity. This is later followed by <text font="italic">existence</text>, where a child might express the lack of an object, and later still <text font="italic">denial</text>, which a child uses to deny the truth of a claim. Larger scale studies of child-directed speech have found that truth-functional kinds of negation tend to emerge later <cite class="ltx_citemacro_cite"><bibref bibrefs="liu2021english" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, but individual children do vary in their specific order of acquisition <cite class="ltx_citemacro_cite"><bibref bibrefs="nordmeyer2018individual" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>.
It is unknown whether this ordering reflects any deeper dependencies among the different categories, or whether the ordering is reflected in how artificial language models (LMs) learn negation.</p>
    </para>
    <para xml:id="S2.p2">
      <p>In NLP, methods from psycholinguistics have been used to probe the reasoning capabilities of LMs. Results from some studies have indicated that TLMs are not human-like in their processing of negation <cite class="ltx_citemacro_cite"><bibref bibrefs="ettinger-2020-bert,kassner-schutze-2020-negated" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>. A similar line of work has used the NLI task to probe a model’s ability to process negation and found that TLMs will often alter their predictions when negation is inserted or removed, even when the negation does not alter the entailment relationship <cite class="ltx_citemacro_cite"><bibref bibrefs="hossain-etal-2020-analysis,hartmann-etal-2021-multilingual" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>.
As argued by <cite class="ltx_citemacro_citet"><bibref bibrefs="kruszewski-etal-2016-logical" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, part of the challenge of modeling purely logical negation is that a predicate often occurs in very similar contexts regardless of whether it is being negated. They argue that we should view negation as being a “graded similarity function”, and show that distributional models can predict human plausibility judgments quite well, even in the presence of negation. These works show that it is unclear how well distributional models, especially TLMs, are actually processing negation. We contribute to this literature from a new perspective, by studying how well models can reason over forms of negation common in developmental psychology.</p>
    </para>
  </section>
  <section inlist="toc" labels="LABEL:sec:corpus" xml:id="S3">
    <tags>
      <tag>3</tag>
      <tag role="autoref">section 3</tag>
      <tag role="refnum">3</tag>
      <tag role="typerefnum">§3</tag>
    </tags>
    <title><tag close=" ">3</tag>The Developmental Negation Corpus</title>
    <para xml:id="S3.p1">
      <p>We use the NLI task to study the negation reasoning capabilities of our models. NLI problems consist of two sentences: a premise (<Math mode="inline" tex="p" text="p" xml:id="S3.p1.m1">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">p</XMTok>
          </XMath>
        </Math>) and hypothesis (<Math mode="inline" tex="h" text="h" xml:id="S3.p1.m2">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">h</XMTok>
          </XMath>
        </Math>), and solving such a problem involves assessing whether <Math mode="inline" tex="p" text="p" xml:id="S3.p1.m3">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">p</XMTok>
          </XMath>
        </Math> textually entails <Math mode="inline" tex="h" text="h" xml:id="S3.p1.m4">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">h</XMTok>
          </XMath>
        </Math>. The generic structure of the NLI task makes it suitable for studying a variety of underlying reasoning skills, including negation. We specifically use the SNLI <cite class="ltx_citemacro_cite"><bibref bibrefs="bowman-etal-2015-large" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> and MNLI <cite class="ltx_citemacro_cite"><bibref bibrefs="williams-etal-2018-broad" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> datasets.</p>
    </para>
<!--  %****␣acl_latex.tex␣Line␣100␣**** -->    <table inlist="lot" labels="LABEL:tab:dataset" placement="tb" xml:id="S3.T1">
      <tags>
        <tag><text fontsize="90%">Table 1</text></tag>
        <tag role="autoref"><text fontsize="80%">Table 1</text></tag>
        <tag role="refnum"><text fontsize="80%">1</text></tag>
        <tag role="typerefnum"><text fontsize="80%">Table 1</text></tag>
      </tags>
      <tabular class="ltx_centering ltx_guessed_headers" vattach="middle">
        <thead>
          <tr>
            <td align="left" border="t" thead="column row"><text fontsize="80%">Category</text></td>
            <td align="center" border="t" thead="column"><text fontsize="80%"># Train</text></td>
            <td align="center" border="t" thead="column"><text fontsize="80%"># Test</text></td>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td align="left" border="t" thead="row"><text fontsize="80%">Possession (</text><Math mode="inline" tex="PO" text="P * O" xml:id="S3.T1.m1">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">P</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">O</XMTok>
                  </XMApp>
                </XMath>
              </Math><text fontsize="80%">)</text></td>
            <td align="center" border="t"><text fontsize="80%">1053</text></td>
            <td align="center" border="t"><text fontsize="80%">520</text></td>
          </tr>
          <tr>
            <td align="left" thead="row"><text fontsize="80%">Existence (</text><Math mode="inline" tex="EX" text="E * X" xml:id="S3.T1.m2">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">E</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">X</XMTok>
                  </XMApp>
                </XMath>
              </Math><text fontsize="80%">)</text></td>
            <td align="center"><text fontsize="80%">5528</text></td>
            <td align="center"><text fontsize="80%">2723</text></td>
          </tr>
          <tr>
            <td align="left" thead="row"><text fontsize="80%">Labeling (</text><Math mode="inline" tex="L" text="L" xml:id="S3.T1.m3">
                <XMath>
                  <XMTok font="italic" fontsize="80%" role="UNKNOWN">L</XMTok>
                </XMath>
              </Math><text fontsize="80%">)</text></td>
            <td align="center"><text fontsize="80%">2241</text></td>
            <td align="center"><text fontsize="80%">1104</text></td>
          </tr>
          <tr>
            <td align="left" thead="row"><text fontsize="80%">Prohibition (</text><Math mode="inline" tex="PR" text="P * R" xml:id="S3.T1.m4">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">P</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">R</XMTok>
                  </XMApp>
                </XMath>
              </Math><text fontsize="80%">)</text></td>
            <td align="center"><text fontsize="80%">814</text></td>
            <td align="center"><text fontsize="80%">400</text></td>
          </tr>
          <tr>
            <td align="left" thead="row"><text fontsize="80%">Inability (</text><Math mode="inline" tex="I" text="I" xml:id="S3.T1.m5">
                <XMath>
                  <XMTok font="italic" fontsize="80%" role="UNKNOWN">I</XMTok>
                </XMath>
              </Math><text fontsize="80%">)</text></td>
            <td align="center"><text fontsize="80%">1384</text></td>
            <td align="center"><text fontsize="80%">682</text></td>
          </tr>
          <tr>
            <td align="left" thead="row"><text fontsize="80%">Epistemic (</text><Math mode="inline" tex="EP" text="E * P" xml:id="S3.T1.m6">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">E</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">P</XMTok>
                  </XMApp>
                </XMath>
              </Math><text fontsize="80%">)</text></td>
            <td align="center"><text fontsize="80%">1903</text></td>
            <td align="center"><text fontsize="80%">936</text></td>
          </tr>
          <tr>
            <td align="left" border="b" thead="row"><text fontsize="80%">Rejection (</text><Math mode="inline" tex="R" text="R" xml:id="S3.T1.m7">
                <XMath>
                  <XMTok font="italic" fontsize="80%" role="UNKNOWN">R</XMTok>
                </XMath>
              </Math><text fontsize="80%">)</text></td>
            <td align="center" border="b"><text fontsize="80%">1737</text></td>
            <td align="center" border="b"><text fontsize="80%">856</text></td>
          </tr>
        </tbody>
      </tabular>
      <toccaption class="ltx_centering"><tag close=" "><text fontsize="80%">1</text></tag><text fontsize="80%">Summary statistics for the curated dataset.</text></toccaption>
      <caption class="ltx_centering" fontsize="80%"><tag close=": "><text fontsize="113%">Table 1</text></tag><text fontsize="113%">Summary statistics for the curated dataset.</text></caption>
    </table>
    <table inlist="lot" labels="LABEL:tab:examples" placement="h" xml:id="S3.T2">
      <tags>
        <tag><text fontsize="90%">Table 2</text></tag>
        <tag role="autoref"><text fontsize="80%">Table 2</text></tag>
        <tag role="refnum"><text fontsize="80%">2</text></tag>
        <tag role="typerefnum"><text fontsize="80%">Table 2</text></tag>
      </tags>
      <tabular class="ltx_centering ltx_guessed_headers" vattach="middle">
        <thead>
          <tr>
            <td align="left" border="t" thead="column row"><text fontsize="80%">Category</text></td>
            <td align="left" border="t" thead="column"><text fontsize="80%">Premise</text></td>
            <td align="left" border="t" thead="column"><text fontsize="80%">Hypothesis</text></td>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td align="left" border="t" thead="row"><Math mode="inline" tex="PO" text="P * O" xml:id="S3.T2.m1">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">P</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">O</XMTok>
                  </XMApp>
                </XMath>
              </Math></td>
            <td align="left" border="t"><text fontsize="80%">yeah you probably don’t have the right temperatures…</text></td>
            <td align="left" border="t"><text fontsize="80%">You probably have ideal temperatures…</text></td>
          </tr>
          <tr>
            <td align="left" thead="row"><Math mode="inline" tex="EX" text="E * X" xml:id="S3.T2.m2">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">E</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">X</XMTok>
                  </XMApp>
                </XMath>
              </Math></td>
            <td align="left"><text fontsize="80%">This analysis pooled estimates…</text></td>
            <td align="left"><text fontsize="80%">The analysis proves that there is no link…</text></td>
          </tr>
          <tr>
            <td align="left" thead="row"><Math mode="inline" tex="L" text="L" xml:id="S3.T2.m3">
                <XMath>
                  <XMTok font="italic" fontsize="80%" role="UNKNOWN">L</XMTok>
                </XMath>
              </Math></td>
            <td align="left"><text fontsize="80%">Not orders, no.</text></td>
            <td align="left"><text fontsize="80%">It is not orders.</text></td>
          </tr>
          <tr>
            <td align="left" thead="row"><Math mode="inline" tex="PR" text="P * R" xml:id="S3.T2.m4">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">P</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">R</XMTok>
                  </XMApp>
                </XMath>
              </Math></td>
            <td align="left"><text fontsize="80%">Two people are sitting against a building near shopping carts.</text></td>
            <td align="left"><text fontsize="80%">Run that way but don’t run into the…</text></td>
          </tr>
          <tr>
            <td align="left" thead="row"><Math mode="inline" tex="I" text="I" xml:id="S3.T2.m5">
                <XMath>
                  <XMTok font="italic" fontsize="80%" role="UNKNOWN">I</XMTok>
                </XMath>
              </Math></td>
            <td align="left"><text fontsize="80%">His manner was unfortunate, I observed thoughtfully.</text></td>
            <td align="left"><text fontsize="80%">I could not pick out what kind of manner he…</text></td>
          </tr>
          <tr>
            <td align="left" thead="row"><Math mode="inline" tex="EP" text="E * P" xml:id="S3.T2.m6">
                <XMath>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">E</XMTok>
                    <XMTok font="italic" fontsize="80%" role="UNKNOWN">P</XMTok>
                  </XMApp>
                </XMath>
              </Math></td>
            <td align="left"><text fontsize="80%">yeah i don’t know why</text></td>
            <td align="left"><text fontsize="80%">I know why</text></td>
          </tr>
          <tr>
            <td align="left" border="b" thead="row"><Math mode="inline" tex="R" text="R" xml:id="S3.T2.m7">
                <XMath>
                  <XMTok font="italic" fontsize="80%" role="UNKNOWN">R</XMTok>
                </XMath>
              </Math></td>
            <td align="left" border="b"><text fontsize="80%">I lowered my voice…</text></td>
            <td align="left" border="b"><text fontsize="80%">I didn’t want to be overheard.</text></td>
          </tr>
        </tbody>
      </tabular>
      <toccaption class="ltx_centering"><tag close=" "><text fontsize="80%">2</text></tag><text fontsize="80%">NLI examples extracted from each category, long examples have been trimmed to fit on one line.</text></toccaption>
      <caption class="ltx_centering" fontsize="80%"><tag close=": "><text fontsize="113%">Table 2</text></tag><text fontsize="113%">NLI examples extracted from each category, long examples have been trimmed to fit on one line.</text></caption>
    </table>
    <para xml:id="S3.p2">
      <p>To automatically identify questions that contain a specific kind of negation, we rely on the work by <cite class="ltx_citemacro_citet"><bibref bibrefs="liu2021english" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> which studied how frequently different kinds of developmental negation occur in child-directed speech, using the data from the CHILDES corpus <cite class="ltx_citemacro_cite"><bibref bibrefs="macwhinney2014childes" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>. To do this, they created a simple rule-based parser to automatically tag each sentence in CHILDES with the type of negation it contained (if any). We re-implement their parser, in some cases tweaking the rules slightly to better suit the structure of the NLI task. For each example across all the splits of both datasets, we first obtain a dependency parse of both <Math mode="inline" tex="p" text="p" xml:id="S3.p2.m1">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">p</XMTok>
          </XMath>
        </Math> and <Math mode="inline" tex="h" text="h" xml:id="S3.p2.m2">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">h</XMTok>
          </XMath>
        </Math> using the diaparser package <cite class="ltx_citemacro_cite"><bibref bibrefs="wang-etal-2019-second" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, and check if either contains an explicit negation marker (“no”, “not”, or “n’t”). If one span contains negation, we check if the syntactic structure obeys the rules of any of our categories. If the span falls into a category, we mark it as belonging to that category. We use these questions as the diagnostic set for our experiments, splitting out 1/3 of the questions in each category as a <text font="italic">diagnostic test</text> set, and leaving the remainder as a <text font="italic">diagnostic train</text> set (and we will refer to them as such). We place the remaining NLI questions containing no negation in a separate <Math mode="inline" tex="NLI_{train}" text="N * L * I _ (t * r * a * i * n)" xml:id="S3.p2.m3">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" role="UNKNOWN">N</XMTok>
              <XMTok font="italic" role="UNKNOWN">L</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">I</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">r</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                </XMApp>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> set, giving us about 730,000 examples we use to finetune our models on the NLI task. We split out 9,000 questions from this train set at random to use as a <Math mode="inline" tex="NLI_{dev}" text="N * L * I _ (d * e * v)" xml:id="S3.p2.m4">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" role="UNKNOWN">N</XMTok>
              <XMTok font="italic" role="UNKNOWN">L</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">I</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">d</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">e</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">v</XMTok>
                </XMApp>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> set, balanced for each label. In the following, we describe the precise rules used to determine which category a negated example should be assigned to:</p>
    </para>
    <paragraph inlist="toc" xml:id="S3.SS0.SSS0.Px1">
      <title>Possession (<Math mode="inline" tex="PO" text="P * O" xml:id="S3.SS0.SSS0.Px1.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" role="UNKNOWN">P</XMTok>
              <XMTok font="italic" role="UNKNOWN">O</XMTok>
            </XMApp>
          </XMath>
        </Math>)</title>
      <para xml:id="S3.SS0.SSS0.Px1.p1">
        <p>We require that the lemma of the root be <text font="italic">have</text>, <text font="italic">has</text>, or <text font="italic">had</text>, and that the root is directly modified by both the negation and the verb <text font="italic">do</text>.</p>
      </para>
    </paragraph>
    <paragraph inlist="toc" xml:id="S3.SS0.SSS0.Px2">
      <title>Existence (<Math mode="inline" tex="EX" text="E * X" xml:id="S3.SS0.SSS0.Px2.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" role="UNKNOWN">E</XMTok>
              <XMTok font="italic" role="UNKNOWN">X</XMTok>
            </XMApp>
          </XMath>
        </Math>)</title>
      <para xml:id="S3.SS0.SSS0.Px2.p1">
        <p>We require that <text font="italic">there</text> occur in the text and precede the negative marker and that the negative marker directly modifies a noun phrase, determiner, or an adverb.</p>
      </para>
    </paragraph>
    <paragraph inlist="toc" xml:id="S3.SS0.SSS0.Px3">
      <title>Labeling (<Math mode="inline" tex="L" text="L" xml:id="S3.SS0.SSS0.Px3.m1">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">L</XMTok>
          </XMath>
        </Math>)</title>
      <para xml:id="S3.SS0.SSS0.Px3.p1">
        <p>We require that the sentence begin with either <text font="italic">That</text> or <text font="italic">It</text>, and that the root of the sentence is a noun which is modified by <text font="italic">is</text> or <text font="italic">’s</text>.</p>
      </para>
<!--  %****␣acl_latex.tex␣Line␣150␣**** -->    </paragraph>
    <paragraph inlist="toc" xml:id="S3.SS0.SSS0.Px4">
      <title>Prohibition (<Math mode="inline" tex="PR" text="P * R" xml:id="S3.SS0.SSS0.Px4.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" role="UNKNOWN">P</XMTok>
              <XMTok font="italic" role="UNKNOWN">R</XMTok>
            </XMApp>
          </XMath>
        </Math>)</title>
      <para xml:id="S3.SS0.SSS0.Px4.p1">
        <p>We require that the sentence not contain a subject and that the negation is immediately preceded by <text font="italic">do</text>. To not conflate this category with others, we filter out cases where the root contains one of the explicit markers of another category (e.g., <text font="italic">like</text> or <text font="italic">want</text> in the case of rejection).</p>
      </para>
    </paragraph>
    <paragraph inlist="toc" xml:id="S3.SS0.SSS0.Px5">
      <title>Inability (<Math mode="inline" tex="I" text="I" xml:id="S3.SS0.SSS0.Px5.m1">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">I</XMTok>
          </XMath>
        </Math>)</title>
      <para xml:id="S3.SS0.SSS0.Px5.p1">
        <p>We require that the negation directly modify the root of the sentence, and that the word immediately before the negation is either <text font="italic">can</text> or <text font="italic">could</text> (e.g., <text font="italic">can not do</text>). Prior literature has typically viewed inability from an egocentric perspective. However, we found that allowing only the first person severely restricted the number of examples extracted, and therefore chose to also allow the second and third person.</p>
      </para>
    </paragraph>
    <paragraph inlist="toc" xml:id="S3.SS0.SSS0.Px6">
      <title>Epistemic (<Math mode="inline" tex="EP" text="E * P" xml:id="S3.SS0.SSS0.Px6.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" role="UNKNOWN">E</XMTok>
              <XMTok font="italic" role="UNKNOWN">P</XMTok>
            </XMApp>
          </XMath>
        </Math>)</title>
      <para xml:id="S3.SS0.SSS0.Px6.p1">
        <p>We require that the root be <text font="italic">remember</text>, <text font="italic">know</text>, or <text font="italic">think</text>, and that the root be directly modified by the verb <text font="italic">do</text>.</p>
      </para>
    </paragraph>
    <paragraph inlist="toc" xml:id="S3.SS0.SSS0.Px7">
      <title>Rejection (<Math mode="inline" tex="R" text="R" xml:id="S3.SS0.SSS0.Px7.m1">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">R</XMTok>
          </XMath>
        </Math>)</title>
      <para xml:id="S3.SS0.SSS0.Px7.p1">
        <p>We require that the lemma of the root word be either <text font="italic">like</text> or <text font="italic">want</text>, and that the root is modified by the negative marker.</p>
      </para>
      <para xml:id="S3.SS0.SSS0.Px7.p2">
        <p>After performing extraction, categories <Math mode="inline" tex="L" text="L" xml:id="S3.SS0.SSS0.Px7.p2.m1">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">L</XMTok>
            </XMath>
          </Math> and <Math mode="inline" tex="PR" text="P * R" xml:id="S3.SS0.SSS0.Px7.p2.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">P</XMTok>
                <XMTok font="italic" role="UNKNOWN">R</XMTok>
              </XMApp>
            </XMath>
          </Math> contained fewer than 1000 examples, which we deemed was insufficient to split into separate train and test sets. To address this, we developed a simple data augmentation approach that utilized the Wordnet database <cite class="ltx_citemacro_cite"><bibref bibrefs="miller1998wordnet" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>. From the dependency parse of both <Math mode="inline" tex="p" text="p" xml:id="S3.SS0.SSS0.Px7.p2.m3">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">p</XMTok>
            </XMath>
          </Math> and <Math mode="inline" tex="h" text="h" xml:id="S3.SS0.SSS0.Px7.p2.m4">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">h</XMTok>
            </XMath>
          </Math>, we check if the root of either parse occurs in both spans. If it does, we obtain all synonyms of the word in Wordnet and replace the root in both spans with the synonym (doing this for every synonym). We found this simple approach increased the number of examples for both <Math mode="inline" tex="L" text="L" xml:id="S3.SS0.SSS0.Px7.p2.m5">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">L</XMTok>
            </XMath>
          </Math> and <Math mode="inline" tex="PR" text="P * R" xml:id="S3.SS0.SSS0.Px7.p2.m6">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">P</XMTok>
                <XMTok font="italic" role="UNKNOWN">R</XMTok>
              </XMApp>
            </XMath>
          </Math> to at least 1500. Note that we performed no augmentation for the other categories, as our parser extracted at least 1500 examples for all other cases. Table <ref labelref="LABEL:tab:dataset"/> shows statistics for the dataset after augmentation.</p>
      </para>
      <para xml:id="S3.SS0.SSS0.Px7.p3">
        <p>Table <ref labelref="LABEL:tab:examples"/> shows extracted examples, along with their category assignment. We generally found that the extracted examples matched up with the prototypical category quite well, although in some cases their semantics differed slightly. For instance, consider a <Math mode="inline" tex="PR" text="P * R" xml:id="S3.SS0.SSS0.Px7.p3.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">P</XMTok>
                <XMTok font="italic" role="UNKNOWN">R</XMTok>
              </XMApp>
            </XMath>
          </Math> example with <Math mode="inline" tex="p" text="p" xml:id="S3.SS0.SSS0.Px7.p3.m2">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">p</XMTok>
            </XMath>
          </Math> = <text font="italic">don’t miss having a flick through the albums</text> and <Math mode="inline" tex="h" text="h" xml:id="S3.SS0.SSS0.Px7.p3.m3">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">h</XMTok>
            </XMath>
          </Math> = <text font="italic">The pictures of old Madeira show a more interesting city than now</text>, which is an MNLI example originally extracted from a travel guide. Although this technically counts as <Math mode="inline" tex="PR" text="P * R" xml:id="S3.SS0.SSS0.Px7.p3.m4">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">P</XMTok>
                <XMTok font="italic" role="UNKNOWN">R</XMTok>
              </XMApp>
            </XMath>
          </Math>, it does not have quite the same semantics as an actual command. Unfortunately, these ambiguities are not easily resolved, given that negation takes on many forms and may occur at any location within a sentence. We, therefore, opted to focus on forms of negation that can be easily extracted, and leave improvements to our dataset creation protocol for future work.</p>
      </para>
    </paragraph>
  </section>
  <section inlist="toc" xml:id="S4">
    <tags>
      <tag>4</tag>
      <tag role="autoref">section 4</tag>
      <tag role="refnum">4</tag>
      <tag role="typerefnum">§4</tag>
    </tags>
    <title><tag close=" ">4</tag>Experiments</title>
    <para xml:id="S4.p1">
      <p>Using the curated dataset, we performed a series of exploratory experiments to help us understand how well TLMs process each of the negation categories. We use BERT <cite class="ltx_citemacro_cite"><bibref bibrefs="devlin-etal-2019-bert" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, and RoBERTa <cite class="ltx_citemacro_cite"><bibref bibrefs="liu2019roberta" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, two popular transformer LMs that have demonstrated impressive results on a variety of language understanding tasks. We also examine MiniBERTa <cite class="ltx_citemacro_cite"><bibref bibrefs="warstadt-etal-2020-learning" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> and BabyBERTa <cite class="ltx_citemacro_cite"><bibref bibrefs="huebner-etal-2021-babyberta" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, which are both based on the RoBERTa architecture but were pre-trained on a much smaller number of tokens (10 million and 5 million respectively), which is more realistic to the amount of language a child is exposed to in the first few years of life. We use the Huggingface implementation of all models <cite class="ltx_citemacro_cite"><bibref bibrefs="wolf-etal-2020-transformers" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, and use both the <text font="italic">base</text> and <text font="italic">large</text> version of BERT and RoBERTa, which differ only in the number of trainable parameters.</p>
    </para>
    <paragraph inlist="toc" xml:id="S4.SS0.SSS0.Px1">
      <title>Experiment 1:</title>
      <para xml:id="S4.SS0.SSS0.Px1.p1">
        <p>We began by investigating whether TLMs would master certain negation categories sooner than others over the course of training. We train our models on <Math mode="inline" tex="NLI_{train}" text="N * L * I _ (t * r * a * i * n)" xml:id="S4.SS0.SSS0.Px1.p1.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">N</XMTok>
                <XMTok font="italic" role="UNKNOWN">L</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">I</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">r</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                  </XMApp>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> for 10 epochs, using a learning rate of <Math mode="inline" tex="1e-5" text="1 * e - 5" xml:id="S4.SS0.SSS0.Px1.p1.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="minus" role="ADDOP">-</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok meaning="1" role="NUMBER">1</XMTok>
                  <XMTok font="italic" role="UNKNOWN">e</XMTok>
                </XMApp>
                <XMTok meaning="5" role="NUMBER">5</XMTok>
              </XMApp>
            </XMath>
          </Math>, a weight decay of <Math mode="inline" tex="0.01" text="0.01" xml:id="S4.SS0.SSS0.Px1.p1.m3">
            <XMath>
              <XMTok meaning="0.01" role="NUMBER">0.01</XMTok>
            </XMath>
          </Math>, a batch size of <Math mode="inline" tex="16" text="16" xml:id="S4.SS0.SSS0.Px1.p1.m4">
            <XMath>
              <XMTok meaning="16" role="NUMBER">16</XMTok>
            </XMath>
          </Math>, and a maximum sequence length <Math mode="inline" tex="175" text="175" xml:id="S4.SS0.SSS0.Px1.p1.m5">
            <XMath>
              <XMTok meaning="175" role="NUMBER">175</XMTok>
            </XMath>
          </Math>.<note mark="3" role="footnote" xml:id="footnote3"><tags>
              <tag>3</tag>
              <tag role="autoref">footnote 3</tag>
              <tag role="refnum">3</tag>
              <tag role="typerefnum">footnote 3</tag>
            </tags>We set the maximum sequence length for BabyBERTa to 128, which is the longest that the model supports.</note> We selected these hyperparameters to be similar to those which were previously reported to yield strong results when training on NLI datasets <cite class="ltx_citemacro_cite"><bibref bibrefs="laverghetta-jr-etal-2021-transformer" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>. We additionally evaluated the models on <Math mode="inline" tex="NLI_{dev}" text="N * L * I _ (d * e * v)" xml:id="S4.SS0.SSS0.Px1.p1.m6">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">N</XMTok>
                <XMTok font="italic" role="UNKNOWN">L</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">I</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">d</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">e</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">v</XMTok>
                  </XMApp>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>, and found that they all achieved a Matthews Correlation of at least 0.6 <cite class="ltx_citemacro_cite"><bibref bibrefs="matthews1975comparison" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>, and thus concluded that these hyperparameters were suitable. For every end of epoch checkpoint across all models, we obtained evaluation results on each diagnostic test set. Importantly, the models are not finetuned on any negated NLI questions for this experiment, meaning that all knowledge of negation comes from pre-training. Results are shown in Figure <ref labelref="LABEL:fig:experiment_1"/>. We see that the categories have similar rankings in terms of accuracy. For example, <Math mode="inline" tex="L" text="L" xml:id="S4.SS0.SSS0.Px1.p1.m7">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">L</XMTok>
            </XMath>
          </Math> and <Math mode="inline" tex="PO" text="P * O" xml:id="S4.SS0.SSS0.Px1.p1.m8">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">P</XMTok>
                <XMTok font="italic" role="UNKNOWN">O</XMTok>
              </XMApp>
            </XMath>
          </Math> are among the top two best-performing categories, while <Math mode="inline" tex="R" text="R" xml:id="S4.SS0.SSS0.Px1.p1.m9">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">R</XMTok>
            </XMath>
          </Math> is generally one of the worst-performing ones, indicating clear distinctions in how LMs process the categories. BabyBERTa, unlike other models, also shows stronger similarities to how children acquire negation. For instance, while <Math mode="inline" tex="R" text="R" xml:id="S4.SS0.SSS0.Px1.p1.m10">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">R</XMTok>
            </XMath>
          </Math> is thought to be one of the first categories children acquire, BabyBERTa is the only model where <Math mode="inline" tex="R" text="R" xml:id="S4.SS0.SSS0.Px1.p1.m11">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">R</XMTok>
            </XMath>
          </Math> is one of the highest-ranking categories in terms of accuracy.</p>
      </para>
      <figure inlist="lof" labels="LABEL:fig:experiment_1" placement="h!" xml:id="S4.F1">
        <tags>
          <tag><text fontsize="90%">Figure 1</text></tag>
          <tag role="autoref">Figure 1</tag>
          <tag role="refnum">1</tag>
          <tag role="typerefnum">Figure 1</tag>
        </tags>
        <graphics candidates="experiment_1/experiment_1.png" class="ltx_centering" graphic="experiment_1/experiment_1.png" options="width=433.62pt" xml:id="S4.F1.g1"/>
<!--  %****␣acl_latex.tex␣Line␣175␣**** -->        <toccaption class="ltx_centering"><tag close=" ">1</tag>Performance of models finetuned on <Math mode="inline" tex="NLI_{train}" text="N * L * I _ (t * r * a * i * n)" xml:id="S4.F1.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">N</XMTok>
                <XMTok font="italic" role="UNKNOWN">L</XMTok>
                <XMApp>
                  <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">I</XMTok>
                  <XMApp>
                    <XMTok meaning="times" role="MULOP">⁢</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">r</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                    <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                  </XMApp>
                </XMApp>
              </XMApp>
            </XMath>
          </Math> for each diagnostic test set. We refer to MiniBERTa using its Huggingface model ID (<text font="italic">roberta-base-10M-2</text>).</toccaption>
        <caption class="ltx_centering"><tag close=": "><text fontsize="90%">Figure 1</text></tag><text fontsize="90%">Performance of models finetuned on <Math mode="inline" tex="NLI_{train}" text="N * L * I _ (t * r * a * i * n)" xml:id="S4.F1.m2">
              <XMath>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">N</XMTok>
                  <XMTok font="italic" role="UNKNOWN">L</XMTok>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                    <XMTok font="italic" role="UNKNOWN">I</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">t</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">r</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">a</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">n</XMTok>
                    </XMApp>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math> for each diagnostic test set. We refer to MiniBERTa using its Huggingface model ID (<text font="italic">roberta-base-10M-2</text>).</text></caption>
      </figure>
    </paragraph>
    <paragraph inlist="toc" xml:id="S4.SS0.SSS0.Px2">
      <title>Experiment 2:</title>
      <para xml:id="S4.SS0.SSS0.Px2.p1">
        <p>One might expect that children develop a more abstract understanding of negation as they are exposed to different categories. This was suggested by <cite class="ltx_citemacro_citet"><bibref bibrefs="pea1978development" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> who argued that more abstract forms of negation develop from less abstract ones, suggesting that mastering one form of negation can lead to positive transfer on others. In Experiment 2, we examined how much positive transfer could be obtained from training on one of the negation categories, and then testing on the others. We adopt a similar methodology to <cite class="ltx_citemacro_citet"><bibref bibrefs="pruksachatkun-etal-2020-intermediate" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>, who explored the conditions that affect intermediate task transfer learning. Using the models trained in Experiment 1, we further finetune these models for 25 epochs on each diagnostic train set separately. We then evaluate the finetuned models on each diagnostic test set, which allows us to examine all possible pairwise interactions among categories. Figure <ref labelref="LABEL:fig:experiment_2"/> shows the results for all combinations of diagnostic categories for training and testing. Surprisingly, we find that positive transfer generally only occurs when a model is trained on the same category it is being tested on. Training on a different category has little to no effect on the target category. BabyBERTa is again an exception, as we do see positive transfer for most pairs, suggesting the model is generalizing across categories</p>
      </para>
      <figure inlist="lof" labels="LABEL:fig:experiment_2" placement="h!" xml:id="S4.F2">
        <tags>
          <tag><text fontsize="90%">Figure 2</text></tag>
          <tag role="autoref">Figure 2</tag>
          <tag role="refnum">2</tag>
          <tag role="typerefnum">Figure 2</tag>
        </tags>
        <graphics candidates="experiment_2/experiment-2.png" class="ltx_centering" graphic="experiment_2/experiment-2.png" options="width=433.62pt" xml:id="S4.F2.g1"/>
        <toccaption class="ltx_centering"><tag close=" ">2</tag>Accuracy of each model on every diagnostic test set, after being finetuned on every diagnostic train set. Plots are color-coded based on the target category.</toccaption>
        <caption class="ltx_centering"><tag close=": "><text fontsize="90%">Figure 2</text></tag><text fontsize="90%">Accuracy of each model on every diagnostic test set, after being finetuned on every diagnostic train set. Plots are color-coded based on the target category.</text></caption>
      </figure>
    </paragraph>
    <paragraph inlist="toc" xml:id="S4.SS0.SSS0.Px3">
      <title>Experiment 3:</title>
      <para xml:id="S4.SS0.SSS0.Px3.p1">
        <p>Building on Experiment 2, we examined how the performance of our models is affected when trained on all diagnostic categories in sequence. Assuming that no positive transfer exists among the categories, we would expect to see a model’s performance on a particular category improve only after it has been trained on that same category, and even training on multiple other categories should not substantially improve performance on the target. Using the models from Experiment 1, we finetune each model for 10 epochs on every diagnostic train set, using the sequence of categories shown in the x-axis of Figure <ref labelref="LABEL:fig:experiment_3"/>. Additionally, we under-sample all diagnostic train sets to have the same number of questions as <Math mode="inline" tex="PR" text="P * R" xml:id="S4.SS0.SSS0.Px3.p1.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">P</XMTok>
                <XMTok font="italic" role="UNKNOWN">R</XMTok>
              </XMApp>
            </XMath>
          </Math>, so that all categories contribute the same amount of data. Figure <ref labelref="LABEL:fig:experiment_3"/> shows the results. For some categories, such as <Math mode="inline" tex="L" text="L" xml:id="S4.SS0.SSS0.Px3.p1.m2">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">L</XMTok>
            </XMath>
          </Math> and <Math mode="inline" tex="PR" text="P * R" xml:id="S4.SS0.SSS0.Px3.p1.m3">
            <XMath>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" role="UNKNOWN">P</XMTok>
                <XMTok font="italic" role="UNKNOWN">R</XMTok>
              </XMApp>
            </XMath>
          </Math>, we see the expected trend. The largest accuracy gain for these categories occurs whenever the model is trained on the same category it is being tested on, and performance drops slightly after being trained on others. However, for categories such as <Math mode="inline" tex="R" text="R" xml:id="S4.SS0.SSS0.Px3.p1.m4">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">R</XMTok>
            </XMath>
          </Math>, the best performance gain is not always after being trained on the same category. We sometimes see the model continue to improve on <Math mode="inline" tex="R" text="R" xml:id="S4.SS0.SSS0.Px3.p1.m5">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">R</XMTok>
            </XMath>
          </Math> after being trained on <Math mode="inline" tex="R" text="R" xml:id="S4.SS0.SSS0.Px3.p1.m6">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">R</XMTok>
            </XMath>
          </Math>, and in some cases, training on <Math mode="inline" tex="R" text="R" xml:id="S4.SS0.SSS0.Px3.p1.m7">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">R</XMTok>
            </XMath>
          </Math> causes performance on <Math mode="inline" tex="R" text="R" xml:id="S4.SS0.SSS0.Px3.p1.m8">
            <XMath>
              <XMTok font="italic" role="UNKNOWN">R</XMTok>
            </XMath>
          </Math> to <text font="italic">decrease</text>.</p>
      </para>
      <figure inlist="lof" labels="LABEL:fig:experiment_3" placement="h!" xml:id="S4.F3">
        <tags>
          <tag><text fontsize="90%">Figure 3</text></tag>
          <tag role="autoref">Figure 3</tag>
          <tag role="refnum">3</tag>
          <tag role="typerefnum">Figure 3</tag>
        </tags>
        <graphics candidates="experiment_3/experiment-3.png" class="ltx_centering" graphic="experiment_3/experiment-3.png" options="width=433.62pt" xml:id="S4.F3.g1"/>
        <toccaption class="ltx_centering"><tag close=" ">3</tag>Results from Experiment 3. The x-axis shows the sequence of categories on which all models were trained, while the y-axis shows the accuracy obtained after being trained on a category.</toccaption>
        <caption class="ltx_centering"><tag close=": "><text fontsize="90%">Figure 3</text></tag><text fontsize="90%">Results from Experiment 3. The x-axis shows the sequence of categories on which all models were trained, while the y-axis shows the accuracy obtained after being trained on a category.</text></caption>
      </figure>
<!--  %****␣acl_latex.tex␣Line␣200␣**** -->    </paragraph>
  </section>
  <section inlist="toc" xml:id="S5">
    <tags>
      <tag>5</tag>
      <tag role="autoref">section 5</tag>
      <tag role="refnum">5</tag>
      <tag role="typerefnum">§5</tag>
    </tags>
    <title><tag close=" ">5</tag>Discussion and Conclusion</title>
    <para xml:id="S5.p1">
      <p>In this paper, we have explored how well transformers process categories of developmental negation. We find that performance rankings across categories are generally consistent, but that the categories seem to test for orthogonal skills in the majority of LMs. In BabyBERTa, we see significant similarities with the order of negation acquisition in children. Two of the best performing categories are <Math mode="inline" tex="R" text="R" xml:id="S5.p1.m1">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">R</XMTok>
          </XMath>
        </Math> and <Math mode="inline" tex="L" text="L" xml:id="S5.p1.m2">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">L</XMTok>
          </XMath>
        </Math>, while two of the worst are <Math mode="inline" tex="EX" text="E * X" xml:id="S5.p1.m3">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" role="UNKNOWN">E</XMTok>
              <XMTok font="italic" role="UNKNOWN">X</XMTok>
            </XMApp>
          </XMath>
        </Math> and <Math mode="inline" tex="PR" text="P * R" xml:id="S5.p1.m4">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMTok font="italic" role="UNKNOWN">P</XMTok>
              <XMTok font="italic" role="UNKNOWN">R</XMTok>
            </XMApp>
          </XMath>
        </Math>, which aligns quite well to the order observed by <cite class="ltx_citemacro_citet"><bibref bibrefs="liu2021english" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>. It thus seems that TLMs do at least partially reflect the order of negation acquisition observed in children, although more experiments would be needed to understand the extent of this correlation. That we found category rankings to generally be consistent across LMs may have interesting implications, and understanding why LMs struggle with certain categories may help to improve the ability of LMs to process negation.</p>
    </para>
    <para xml:id="S5.p2">
      <p>Future work can build on these experiments in several ways. In Experiments 2 and 3, we modeled interactions among the negation categories in either a pairwise or sequential fashion, which is unlikely to reflect how children are exposed to negation. More experiments, mixing all of the categories at once in various proportions, might yield a more realistic model of cognitive development. Our approach also requires that each category fits into a specific structure, which limits the amount of examples that can be extracted. Future work will need to expand our ruleset to include more variations in the negated utterances covered. Finally, while we primarily focus on finetuning, pre-training is likely to impact the proficiency of our models on the categories as well. Future work should precisely control the prevalence of each category in the pre-training corpus, to observe what effect this has on downstream performance.</p>
    </para>
<!--  %Entries␣for␣the␣entire␣Anthology,␣followed␣by␣custom␣entries -->  </section>
  <bibliography citestyle="authoryear" files="Antonio" xml:id="bib">
    <title>References</title>
  </bibliography>
</document>
