<?xml version="1.0" encoding="UTF-8"?>
<?latexml searchpaths="/home/japhy/scienceReplication.artiswrong.com/paper_files/arxiv/2601.18933/latex_extracted"?>
<!--  %This must be in the first 5 lines to tell arXiv to use pdfLaTeX, which is strongly recommended. --><!--  %In particular, the hyperref package requires pdfLaTeX in order to break URLs across lines. --><?latexml class="article" options="11pt"?>
<!--  %Remove the ”review” option to generate the final version. --><!--  %“usepackage[review]–acl˝ --><?latexml package="acl"?>
<?latexml package="nert"?>
<!--  %Standard␣package␣includes --><?latexml package="times"?>
<?latexml package="latexsym"?>
<?latexml package="tabularx"?>
<!--  %This␣assumes␣your␣files␣are␣encoded␣as␣UTF8 --><?latexml package="inputenc" options="utf8"?>
<!--  %\setlength\titlebox{5cm} --><!--  %You␣can␣expand␣the␣titlebox␣if␣you␣need␣extra␣space --><!--  %to␣show␣all␣the␣authors.␣Please␣do␣not␣make␣the␣titlebox --><!--  %smaller␣than␣5cm␣(the␣original␣size). --><!--  %SOME␣CONFERENCES␣HAVE␣REJECTED␣SUBMISSIONS␣THAT␣VIOLATE␣THIS --><!--  %\renewcommand{\nertcomment}[4]{\unskip} --><!--  %%%%␣Example␣of␣how␣to␣typeset␣author␣names,␣institutions,␣emails --><?latexml RelaxNGSchema="LaTeXML"?>
<document xmlns="http://dlmf.nist.gov/LaTeXML" class="ltx_authors_1line">
  <resource src="LaTeXML.css" type="text/css"/>
  <resource src="ltx-article.css" type="text/css"/>
  <resource src="ltx-listings.css" type="text/css"/>
  <resource src="ltx-ulem.css" type="text/css"/>
  <title><text font="smallcaps">BabyReasoningBench</text>: Generating Developmentally-Inspired Reasoning Tasks for Evaluating Baby Language Models</title>
  <creator role="author">
    <personname>Kaustubh D. Dhole <break/>Department of Computer Science <break/>Emory University <break/><ERROR class="undefined">\eml</ERROR>kdhole@emory.edu
</personname>
  </creator>
  <date role="creation"/>
  <abstract name="Abstract">
    <p>Traditional evaluations of reasoning capabilities of language models are dominated by adult-centric benchmarks that presuppose broad world knowledge, complex instruction following, and mature pragmatic competence. These assumptions are mismatched to <emph font="italic">baby language models</emph> trained on developmentally plausible input such as child-directed speech and early-childhood narratives, and they obscure which reasoning abilities (if any) emerge under such constraints. We introduce <text font="bold"> <text font="smallcaps">BabyReasoningBench<note mark="1" role="footnote" xml:id="footnote1"><tags>
              <tag><text font="medium upright">1</text></tag>
              <tag role="autoref"><text font="medium upright">footnote 1</text></tag>
              <tag role="refnum"><text font="medium upright">1</text></tag>
              <tag role="typerefnum"><text font="medium upright">footnote 1</text></tag>
            </tags><ref class="ltx_url" font="typewriter medium upright" href="https://github.com/kaustubhdhole/baby-reasoning-bench">https://github.com/kaustubhdhole/baby-reasoning-bench</ref></note></text></text>, a GPT-5.2 generated benchmark of 19 reasoning tasks grounded in classic paradigms from developmental psychology, spanning theory of mind, analogical and relational reasoning, causal inference and intervention selection, and core reasoning primitives that are known to be confounded by memory and pragmatics. We find that two GPT-2 based baby language models (pretrained on 10M and 100M of child-directed speech text) show overall low but uneven performance, with dissociations across task families: scaling improves several causal and physical reasoning tasks, while belief attribution and pragmatics-sensitive tasks remain challenging. <text font="smallcaps">BabyReasoningBench</text> provides a developmentally grounded lens for analyzing what kinds of reasoning are supported by child-like training distributions, and for testing mechanistic hypotheses about how such abilities emerge.</p>
  </abstract>
<!--  %****␣mypaper.tex␣Line␣50␣**** 
     %\draftnotice{DRAFT␣__␣Do␣not␣redistribute}-->  <section inlist="toc" xml:id="S1">
    <tags>
      <tag>1</tag>
      <tag role="autoref">section 1</tag>
      <tag role="refnum">1</tag>
      <tag role="typerefnum">§1</tag>
    </tags>
    <title><tag close=" ">1</tag>Introduction</title>
    <para xml:id="S1.p1">
      <p>Large language models (LLMs) are typically evaluated on adult-centric benchmarks that presume extensive world knowledge, long-form instruction following, and mature linguistic competence <cite class="ltx_citemacro_cite"><bibref bibrefs="phan2025humanity,srivastava2023beyond,dhole2025conqret" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>. This evaluation regime makes it difficult to answer a different question that matters for both cognitive modeling and data-efficient AI <cite class="ltx_citemacro_cite"><bibref bibrefs="ghanizadeh-dousti-2024-towards" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>: <emph font="italic">what kinds of reasoning emerge when models are trained primarily on developmentally plausible input</emph>? In particular, “baby language models”—models trained on caregiver–child interaction data <cite class="ltx_citemacro_cite"><bibref bibrefs="warstadt2023findings,feng2024child" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, early-childhood text, and simplified perceptual or narrative inputs—invite evaluation paradigms aligned with the <emph font="italic">developmental trajectory</emph> of human cognition rather than expert adult performance. They also enable causal experiments (e.g., controlled-rearing studies) that can be hard to perform on children <cite class="ltx_citemacro_cite"><bibref bibrefs="rozner-etal-2025-babylms" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>.</p>
    </para>
    <para xml:id="S1.p2">
      <p>We introduce <text font="bold"> <text font="smallcaps">BabyReasoningBench</text></text>, a curated collection of tasks grounded in classic findings from developmental psychology and infant cognition. The collection spans (i) <emph font="italic">theory-of-mind</emph> and belief attribution via explicit false-belief elicitation in preschoolers <cite class="ltx_citemacro_cite"><bibref bibrefs="WimmerPerner1983,BaronCohenLeslieFrith1985" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> as well as implicit violation-of-expectation paradigms in infants <cite class="ltx_citemacro_cite"><bibref bibrefs="OnishiBaillargeon2005" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> and broader batteries summarized meta-analytically <cite class="ltx_citemacro_cite"><bibref bibrefs="WellmanCrossWatson2001" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>; (ii) <emph font="italic">analogical reasoning</emph> and relational mapping across familiar causal transformations and story structures <cite class="ltx_citemacro_cite"><bibref bibrefs="GoswamiBrown1990,GentnerToupin1986,HolyoakJunnBillman1984" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>; (iii) <emph font="italic">causal learning</emph> from evidence patterns (e.g., “blicket detector” inferences) <cite class="ltx_citemacro_cite"><bibref bibrefs="GopnikSobel2000" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, causal-structure induction and Bayesian-inspired learning accounts <cite class="ltx_citemacro_cite"><bibref bibrefs="GopnikEtAl2004" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, and curiosity-driven exploration under confounding <cite class="ltx_citemacro_cite"><bibref bibrefs="SchulzBonawitz2007" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>; and (iv) <emph font="italic">core reasoning primitives</emph> that are known to be sensitive to memory, pragmatics, and inhibitory control, including transitive inference <cite class="ltx_citemacro_cite"><bibref bibrefs="BryantTrabasso1971" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, counterfactual and counterfactual-as-possibility reasoning <cite class="ltx_citemacro_cite"><bibref bibrefs="HarrisGermanMills1996,BeckEtAl2006" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, category-based induction <cite class="ltx_citemacro_cite"><bibref bibrefs="GelmanMarkman1986" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, pragmatic wording effects in class-inclusion queries <cite class="ltx_citemacro_cite"><bibref bibrefs="Shipley1979" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, scientific control-of-variables reasoning <cite class="ltx_citemacro_cite"><bibref bibrefs="ChenKlahr1999" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, and conservation under accidental versus intentional transformations <cite class="ltx_citemacro_cite"><bibref bibrefs="McGarrigleDonaldson1974" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>.</p>
    </para>
    <para xml:id="S1.p3">
      <p>Evaluating baby language models on these paradigms serves three complementary goals. First, it anchors model behavior to <emph font="italic">human developmental baselines</emph>: many tasks exhibit sharp age-related transitions (e.g., false-belief performance shifting between ages 3–5 <cite class="ltx_citemacro_cite"><bibref bibrefs="WellmanCrossWatson2001" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>), while others reveal competence under reduced language demands (e.g., implicit false-belief expectations in 15-month-olds <cite class="ltx_citemacro_cite"><bibref bibrefs="OnishiBaillargeon2005" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>). Second, developmental tasks often disentangle reasoning from confounds such as pragmatic interpretation <cite class="ltx_citemacro_cite"><bibref bibrefs="Shipley1979,McGarrigleDonaldson1974" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> or executive demands <cite class="ltx_citemacro_cite"><bibref bibrefs="BryantTrabasso1971,BeckEtAl2006" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, providing diagnostic leverage beyond aggregate accuracy. Third, these paradigms encourage <emph font="italic">mechanism-sensitive</emph> evaluation: success can depend on representing others’ beliefs, mapping relational structure, integrating evidence across trials, or proposing informative interventions—abilities that may (or may not) emerge under developmentally realistic training distributions.</p>
    </para>
    <para xml:id="S1.p4">
      <p>Table <ref labelref="LABEL:tab:BabyReasoningBench_tasks"/> summarizes the task families in  <text font="smallcaps">BabyReasoningBench</text> and the corresponding empirical signatures in human infants and children that motivate each evaluation setting.</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S2">
    <tags>
      <tag>2</tag>
      <tag role="autoref">section 2</tag>
      <tag role="refnum">2</tag>
      <tag role="typerefnum">§2</tag>
    </tags>
    <title><tag close=" ">2</tag> <text font="smallcaps">BabyReasoningBench</text></title>
    <para xml:id="S2.p1">
      <p>Many of the reasoning tasks traditionally evaluated on children are straightforward for frontier models<note mark="2" role="footnote" xml:id="footnote2"><tags>
            <tag>2</tag>
            <tag role="autoref">footnote 2</tag>
            <tag role="refnum">2</tag>
            <tag role="typerefnum">footnote 2</tag>
          </tags>While many of these traditional tasks are not necessarily evaluated through verbal cues, we verbalize each task as MCQs</note>, and several of them (e.g., the Sally–Anne test) have been extensively discussed across several books and online making LLM based generation useful for such tasks. To construct <text font="smallcaps"> BabyReasoningBench</text> in a way that is systematic and less sensitive to idiosyncratic question phrasing, we proceed as follows. We first provide OpenAI GPT-5.2 <note mark="3" role="footnote" xml:id="footnote3"><tags>
            <tag>3</tag>
            <tag role="autoref">footnote 3</tag>
            <tag role="refnum">3</tag>
            <tag role="typerefnum">footnote 3</tag>
          </tags><ref class="ltx_url" font="typewriter" href="https://openai.com/index/introducing-gpt-5-2/">https://openai.com/index/introducing-gpt-5-2/</ref></note> with a template containing the 19 tasks discussed in this paper, including a brief description of each task in tabular form, and instruct it to generate one multiple-choice question per task. After manually validating these seed questions, we then prompt GPT-5.2 to produce 10 additional variants for each seed, yielding 11 questions per task in total. This replication strategy helps normalize against potential pattern biases in baby language models by reducing reliance on any single wording or surface form. Finally, we manually review all generated MCQs and apply minor edits where questions or answers appear unreasonable, like missing context. Wherever possible, we try to keep the token length of the the answer choices similar, so our perplexity-based multiple-choice classification is less confounded by token counts, but we do not employ it as a hard constraint.</p>
    </para>
    <para xml:id="S2.p2">
      <p>As a preliminary evaluation, we evaluate two BabyLM baselines trained on developmentally motivated corpora: BabyLM-10M <note mark="4" role="footnote" xml:id="footnote4"><tags>
            <tag>4</tag>
            <tag role="autoref">footnote 4</tag>
            <tag role="refnum">4</tag>
            <tag role="typerefnum">footnote 4</tag>
          </tags><ref class="ltx_url" font="typewriter" href="https://hf.co/BabyLM-community/babylm-baseline-10m-gpt2">https://hf.co/BabyLM-community/babylm-baseline-10m-gpt2</ref></note> and BabyLM-100M <note mark="5" role="footnote" xml:id="footnote5"><tags>
            <tag>5</tag>
            <tag role="autoref">footnote 5</tag>
            <tag role="refnum">5</tag>
            <tag role="typerefnum">footnote 5</tag>
          </tags><ref class="ltx_url" font="typewriter" href="https://hf.co/BabyLM-community/babylm-baseline-100m-gpt2">https://hf.co/BabyLM-community/babylm-baseline-100m-gpt2</ref></note> <cite class="ltx_citemacro_cite"><bibref bibrefs="charpentier2025babylmturns3papers" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> as both models have beene been pretrained on several datasets, including CHILDES <cite class="ltx_citemacro_cite"><bibref bibrefs="macwhinney2014childes" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite>, using a next-token prediction objective. For each question, we build a prompt for each choice by concatenating the question and the choice, then score each choice by the conditional log-likelihood of its tokens given the question under the language model. The predicted answer is chosen as the option with the highest summed log-probability.</p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:spider" placement="t" xml:id="S2.F1">
      <tags>
        <tag><text fontsize="90%">Figure 1</text></tag>
        <tag role="autoref">Figure 1</tag>
        <tag role="refnum">1</tag>
        <tag role="typerefnum">Figure 1</tag>
      </tags>
      <graphics candidates="model_performance_radar2.png" class="ltx_centering" graphic="model_performance_radar2.png" options="width=368.577pt" xml:id="S2.F1.g1"/>
      <toccaption class="ltx_centering"><tag close=" ">1</tag>Performance of a 10M and 100M GPT2 evaluated on <text font="smallcaps">BabyReasoningBench</text></toccaption>
      <caption class="ltx_centering"><tag close=": "><text fontsize="90%">Figure 1</text></tag><text fontsize="90%">Performance of a 10M and 100M GPT2 evaluated on <text font="smallcaps">BabyReasoningBench</text></text></caption>
    </figure>
<!--  %preamble: 
     %\usepackage{tabularx}-->    <table inlist="lot" labels="LABEL:tab:babylm_two_models_tasks" placement="t" xml:id="S2.T1">
      <tags>
        <tag><text fontsize="90%">Table 1</text></tag>
        <tag role="autoref"><text fontsize="90%">Table 1</text></tag>
        <tag role="refnum"><text fontsize="90%">1</text></tag>
        <tag role="typerefnum"><text fontsize="90%">Table 1</text></tag>
      </tags>
      <tabular class="ltx_centering ltx_guessed_headers" colsep="4.0pt" vattach="middle" width="433.6pt">
        <thead>
          <tr>
            <td align="justify" border="t" thead="column"><text class="ltx_wrap" font="bold" fontsize="90%">Task</text></td>
            <td align="right" border="t" thead="column"><text font="bold" fontsize="90%">10M</text></td>
            <td align="right" border="t" thead="column"><text font="bold" fontsize="90%">100M</text></td>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td align="justify" border="t"><text fontsize="90%">False Belief Sally Anne</text></td>
            <td align="right" border="t"><text fontsize="90%">63.64</text></td>
            <td align="right" border="t"><text fontsize="90%">63.64</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Counterfactual Possibilities</text></td>
            <td align="right"><text fontsize="90%">63.64</text></td>
            <td align="right"><text fontsize="90%">54.55</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Transitive Inference</text></td>
            <td align="right"><text fontsize="90%">9.09</text></td>
            <td align="right"><text fontsize="90%">36.36</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Physical Cause Effect</text></td>
            <td align="right"><text fontsize="90%">63.64</text></td>
            <td align="right"><text fontsize="90%">100.00</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Control of Variables Strategy</text></td>
            <td align="right"><text fontsize="90%">45.45</text></td>
            <td align="right"><text fontsize="90%">27.27</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Counterfactual Syllogism Pretend</text></td>
            <td align="right"><text fontsize="90%">63.64</text></td>
            <td align="right"><text fontsize="90%">9.09</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Category Based Induction</text></td>
            <td align="right"><text fontsize="90%">63.64</text></td>
            <td align="right"><text fontsize="90%">63.64</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Story Analogy Relational Shift</text></td>
            <td align="right"><text fontsize="90%">27.27</text></td>
            <td align="right"><text fontsize="90%">45.45</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Causal Structure Learning</text></td>
            <td align="right"><text fontsize="90%">81.82</text></td>
            <td align="right"><text fontsize="90%">18.18</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Blicket Detector Inference</text></td>
            <td align="right"><text fontsize="90%">63.64</text></td>
            <td align="right"><text fontsize="90%">72.73</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Simple Causal Analogy</text></td>
            <td align="right"><text fontsize="90%">54.55</text></td>
            <td align="right"><text fontsize="90%">72.73</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Simple Counterfactual Causal</text></td>
            <td align="right"><text fontsize="90%">18.18</text></td>
            <td align="right"><text fontsize="90%">63.64</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Analogical Problem Solving</text></td>
            <td align="right"><text fontsize="90%">0.00</text></td>
            <td align="right"><text fontsize="90%">36.36</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Class Inclusion Wording</text></td>
            <td align="right"><text fontsize="90%">36.36</text></td>
            <td align="right"><text fontsize="90%">27.27</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Conservation of Number Accidental</text></td>
            <td align="right"><text fontsize="90%">100.00</text></td>
            <td align="right"><text fontsize="90%">54.55</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Violation of Expectation False Belief</text></td>
            <td align="right"><text fontsize="90%">81.82</text></td>
            <td align="right"><text fontsize="90%">90.91</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">Exploratory Play Causal</text></td>
            <td align="right"><text fontsize="90%">54.55</text></td>
            <td align="right"><text fontsize="90%">36.36</text></td>
          </tr>
          <tr>
            <td align="justify"><text fontsize="90%">False Belief Vignette Battery</text></td>
            <td align="right"><text fontsize="90%">63.64</text></td>
            <td align="right"><text fontsize="90%">81.82</text></td>
          </tr>
          <tr>
            <td align="justify" border="b"><text fontsize="90%">False Belief Unexpected Transfer</text></td>
            <td align="right" border="b"><text fontsize="90%">36.36</text></td>
            <td align="right" border="b"><text fontsize="90%">63.64</text></td>
          </tr>
        </tbody>
      </tabular>
      <toccaption class="ltx_centering"><tag close=" "><text fontsize="90%">1</text></tag><text fontsize="90%">Task-wise accuracy for two BabyLM GPT-2 baselines (pretrained on 10M and 100M of child-directed speech respectively).</text></toccaption>
      <caption class="ltx_centering" fontsize="90%"><tag close=": ">Table 1</tag>Task-wise accuracy for two BabyLM GPT-2 baselines (pretrained on 10M and 100M of child-directed speech respectively).</caption>
    </table>
    <table inlist="lot" labels="LABEL:tab:BabyReasoningBench_tasks" placement="h" xml:id="S2.T2">
      <tags>
        <tag><text fontsize="90%">Table 2</text></tag>
        <tag role="autoref"><text fontsize="90%">Table 2</text></tag>
        <tag role="refnum"><text fontsize="90%">2</text></tag>
        <tag role="typerefnum"><text fontsize="90%">Table 2</text></tag>
      </tags>
      <tabular class="ltx_centering ltx_guessed_headers" vattach="middle">
        <thead>
          <tr>
            <td align="justify" border="tt" thead="column" width="121.4pt"><text class="ltx_wrap" font="bold" fontsize="90%">Reasoning Task</text></td>
            <td align="justify" border="tt" thead="column" width="156.1pt"><text class="ltx_wrap" font="bold" fontsize="90%">Description</text></td>
            <td align="justify" border="tt" thead="column" width="130.1pt"><text class="ltx_wrap" font="bold" fontsize="90%">Performance of Children</text></td>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td align="justify" border="t" width="121.4pt"><text fontsize="90%">False-belief (unexpected transfer) </text><cite class="ltx_citemacro_cite"><bibref bibrefs="WimmerPerner1983" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" border="t" width="156.1pt"><text fontsize="90%">Classic location-change vignette: predict a person’s search given the person did not observe a move.</text></td>
            <td align="justify" border="t" width="130.1pt"><text fontsize="90%">Children under </text><Math mode="inline" tex="\sim" text="similar-to" xml:id="S2.T2.m1">
                <XMath>
                  <XMTok fontsize="90%" meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
                </XMath>
              </Math><text fontsize="90%">4 tend to answer using reality; older preschoolers succeed.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><!--  %****␣mypaper.tex␣Line␣125␣**** --><text fontsize="90%">False-belief (Sally–Anne; autism contrast) </text><cite class="ltx_citemacro_cite"><bibref bibrefs="BaronCohenLeslieFrith1985" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Puppet/doll false-belief story;</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">Typical 4-year-olds frequently pass; autistic children show markedly lower pass rates.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">False-belief battery (meta-analytic factors) </text><cite class="ltx_citemacro_cite"><bibref bibrefs="WellmanCrossWatson2001" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Multiple variants (unexpected contents, transfer, wording/motivation manipulations).</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">Strong age effect from 3</text><Math mode="inline" tex="\rightarrow" text="rightarrow" xml:id="S2.T2.m2">
                <XMath>
                  <XMTok fontsize="90%" name="rightarrow" role="ARROW">→</XMTok>
                </XMath>
              </Math><text fontsize="90%">5 years.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Implicit false-belief (violation-of-expectation) </text><cite class="ltx_citemacro_cite"><bibref bibrefs="OnishiBaillargeon2005" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Nonverbal expectation measure: infants look longer when an actor acts inconsistently with her belief.</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">Evidence for belief-consistent expectations in 15-month-olds under reduced language demands.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Causal analogies with familiar relations </text><cite class="ltx_citemacro_cite"><bibref bibrefs="GoswamiBrown1990" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Four-term analogies based on understood causal transformations (e.g., melting).</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">Even 3–4-year-olds can succeed when relations are familiar/causally transparent.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Relational mapping vs. surface similarity (story analogy) </text><cite class="ltx_citemacro_cite"><bibref bibrefs="GentnerToupin1986" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Re-enact a base story with new characters; vary surface similarity and relational clarity.</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">Older children show “relational shift” (prefer deep structure); younger children rely more on surface similarity.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Analogical problem solving / transfer with support </text><cite class="ltx_citemacro_cite"><bibref bibrefs="HolyoakJunnBillman1984" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Hear a base story/pattern; solve a new problem that shares structure; hints may be provided.</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">Preschoolers can transfer with scaffolding and close mappings; older children generalize more flexibly.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Physical cause–effect principles </text><cite class="ltx_citemacro_cite"><bibref bibrefs="BullockGelmanBaillargeon1982" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Judgments about whether effects can occur without causes; temporal priority (cause-before-effect).</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">Preschoolers show early understanding that causes precede effects and that “magic” effects are dispreferred.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Blicket detector causal inference </text><cite class="ltx_citemacro_cite"><bibref bibrefs="GopnikSobel2000" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Infer which objects have hidden causal power from activation patterns across trials.</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">2–4-year-olds integrate evidence across trials to identify causal candidates.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Causal structure learning (Bayes nets account) </text><cite class="ltx_citemacro_cite"><bibref bibrefs="GopnikEtAl2004" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Infer latent causal graphs from observations/interventions; predict outcomes of new interventions.</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">Empirical and theoretical account of strong causal learning abilities emerging around ages 2–4.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Exploratory play under confounding </text><cite class="ltx_citemacro_cite"><bibref bibrefs="SchulzBonawitz2007" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Free play after confounded vs. unconfounded evidence; measure information-seeking behaviors.</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">Preschoolers explore more when evidence is ambiguous/confounded; some spontaneously isolate variables.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Transitive inference (memory demands) </text><cite class="ltx_citemacro_cite"><bibref bibrefs="BryantTrabasso1971" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Learn ordered relations (A</text><Math mode="inline" tex="&gt;" text="&gt;" xml:id="S2.T2.m3">
                <XMath>
                  <XMTok fontsize="90%" meaning="greater-than" role="RELOP">&gt;</XMTok>
                </XMath>
              </Math><text fontsize="90%">B, B</text><Math mode="inline" tex="&gt;" text="&gt;" xml:id="S2.T2.m4">
                <XMath>
                  <XMTok fontsize="90%" meaning="greater-than" role="RELOP">&gt;</XMTok>
                </XMath>
              </Math><text fontsize="90%">C); infer A</text><Math mode="inline" tex="&gt;" text="&gt;" xml:id="S2.T2.m5">
                <XMath>
                  <XMTok fontsize="90%" meaning="greater-than" role="RELOP">&gt;</XMTok>
                </XMath>
              </Math><text fontsize="90%">C, especially with memory supports.</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">Young children can succeed when memory demands are reduced.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Counterfactual syllogisms in pretend contexts </text><cite class="ltx_citemacro_cite"><bibref bibrefs="DiasHarris1988" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Deduction from counterfactual premises (“All bears fly…”); compare standard vs. make-believe framing.</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">Preschoolers more often follow logic when explicitly invited to pretend rather than defaulting to real-world knowledge.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Category-based induction over appearance </text><cite class="ltx_citemacro_cite"><bibref bibrefs="GelmanMarkman1986" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Project a novel property to a target: category match vs. perceptual similarity conflict.</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">By around age 4, children often privilege category membership over surface similarity.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Class inclusion / wording effects </text><cite class="ltx_citemacro_cite"><bibref bibrefs="Shipley1979" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Part–whole comparisons (“more dogs or more animals?”); manipulate question form and distributive vs. collective readings.</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">Performance improves substantially with pragmatically clearer wording that matches ordinary English usage.</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Causal counterfactual “what-if” </text><cite class="ltx_citemacro_cite"><bibref bibrefs="HarrisGermanMills1996" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Simple causal narratives; ask what would happen if a key cause were altered.</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">3–4-year-olds answer simple causal counterfactuals correctly</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Counterfactuals as alternative possibilities </text><cite class="ltx_citemacro_cite"><bibref bibrefs="BeckEtAl2006" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Compare answering a single counterfactual vs. representing multiple possible alternatives.</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">Younger children often revert to reality; more consistent “possibility” reasoning emerges later</text></td>
          </tr>
          <tr>
            <td align="justify" width="121.4pt"><text fontsize="90%">Control of Variables Strategy (CVS) </text><cite class="ltx_citemacro_cite"><bibref bibrefs="ChenKlahr1999" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" width="156.1pt"><text fontsize="90%">Multi-factor experiments; test whether the child varies one variable at a time and transfers the strategy.</text></td>
            <td align="justify" width="130.1pt"><text fontsize="90%">8–10-year-olds acquire and transfer CVS only with training</text></td>
          </tr>
          <tr>
            <td align="justify" border="bb" width="121.4pt"><text fontsize="90%">Conservation with accidental vs. intentional transformation </text><cite class="ltx_citemacro_cite"><bibref bibrefs="McGarrigleDonaldson1974" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
                  <bibrefphrase><text fontsize="90%">(</text></bibrefphrase>
                  <bibrefphrase><text fontsize="90%">)</text></bibrefphrase>
                </bibref></cite></td>
            <td align="justify" border="bb" width="156.1pt"><text fontsize="90%">Conservation of number under “naughty teddy” accidental change vs. adult intentional change.</text></td>
            <td align="justify" border="bb" width="130.1pt"><text fontsize="90%">Many more 4–6-year-olds conserve under accidental transformations, implicating pragmatic/intent interpretations.</text></td>
          </tr>
        </tbody>
      </tabular>
      <toccaption class="ltx_centering"><tag close=" "><text fontsize="90%">2</text></tag><text fontsize="90%"> </text><text font="smallcaps" fontsize="90%">BabyReasoningBench</text><text fontsize="90%"> tasks: developmental paradigms that motivate evaluation of baby language models, with performance patterns of children studied in human behavioral tests.</text></toccaption>
      <caption class="ltx_centering" fontsize="90%"><tag close=": ">Table 2</tag> <text font="smallcaps">BabyReasoningBench</text> tasks: developmental paradigms that motivate evaluation of baby language models, with performance patterns of children studied in human behavioral tests.</caption>
<!--  %****␣mypaper.tex␣Line␣200␣**** -->    </table>
  </section>
  <section inlist="toc" xml:id="S3">
    <tags>
      <tag>3</tag>
      <tag role="autoref">section 3</tag>
      <tag role="refnum">3</tag>
      <tag role="typerefnum">§3</tag>
    </tags>
    <title><tag close=" ">3</tag>Results</title>
    <para xml:id="S3.p1">
      <p>Table <ref labelref="LABEL:tab:babylm_two_models_tasks"/> and Figure <ref labelref="LABEL:fig:spider"/> report task-wise accuracy for two BabyLM GPT-2 baselines. We find moderate to strong overall performance for both models, with only a modest difference between scales. The average accuracy is <text font="bold">52.15%</text> for the 10M model and <text font="bold">53.59%</text> for the 100M model, indicating that both models solve a meaningful subset of the benchmark, but in a highly non-uniform way across paradigms.
<break/></p>
    </para>
    <paragraph inlist="toc" xml:id="S3.SS0.SSS0.Px1">
      <title>Scaling yields selective gains, especially on some physical, causal, and analogy tasks.</title>
      <para xml:id="S3.SS0.SSS0.Px1.p1">
        <p>The 100M model outperforms the 10M model on several tasks that involve local causal structure or relational transfer. It reaches <text font="bold">100.00%</text> on <text font="typewriter">physical-cause-effect</text> (vs. 63.64% at 10M), improves on <text font="typewriter">blicket-detector-inference</text> (63.64<Math mode="inline" tex="\rightarrow" text="rightarrow" xml:id="S3.SS0.SSS0.Px1.p1.m1">
            <XMath>
              <XMTok name="rightarrow" role="ARROW">→</XMTok>
            </XMath>
          </Math>72.73), <text font="typewriter">simple-causal-analogy</text> (54.55<Math mode="inline" tex="\rightarrow" text="rightarrow" xml:id="S3.SS0.SSS0.Px1.p1.m2">
            <XMath>
              <XMTok name="rightarrow" role="ARROW">→</XMTok>
            </XMath>
          </Math>72.73), <text font="typewriter">simple-counterfactual-causal</text> (18.18<Math mode="inline" tex="\rightarrow" text="rightarrow" xml:id="S3.SS0.SSS0.Px1.p1.m3">
            <XMath>
              <XMTok name="rightarrow" role="ARROW">→</XMTok>
            </XMath>
          </Math>63.64), and <text font="typewriter">analogical-problem-solving</text> (0.00<Math mode="inline" tex="\rightarrow" text="rightarrow" xml:id="S3.SS0.SSS0.Px1.p1.m4">
            <XMath>
              <XMTok name="rightarrow" role="ARROW">→</XMTok>
            </XMath>
          </Math>36.36), and also performs strongly on <text font="typewriter">violation-of-expectation-false-belief</text> (90.91%). These gains suggest that additional pre-training can help on tasks that require short-range evidence integration or belief-consistent prediction.
<break/></p>
      </para>
    </paragraph>
    <paragraph inlist="toc" xml:id="S3.SS0.SSS0.Px2">
      <title>At the same time, the benchmark does not show monotonic scaling.</title>
      <para xml:id="S3.SS0.SSS0.Px2.p1">
        <p>The 10M model matches or exceeds the 100M model on a number of tasks, including <text font="typewriter">counterfactual-possibilities</text>, <text font="typewriter">control-of-variables-strategy</text>, <text font="typewriter">counterfactual-syllogism-pretend</text>, <text font="typewriter">causal-structure-learning</text>, <text font="typewriter">class-inclusion-wording</text>, <text font="typewriter">conservation-of-number-accidental</text>, and <text font="typewriter">exploratory-play-causal</text>. The largest reversal appears in <text font="typewriter">causal-structure-learning</text>, where the 10M model scores 81.82% and the 100M model only 18.18%. This pattern indicates that developmentally grounded reasoning performance is not well captured by a simple “larger is better” account, and may depend strongly on task framing and model-specific heuristics.
<break/></p>
      </para>
    </paragraph>
    <paragraph inlist="toc" xml:id="S3.SS0.SSS0.Px3">
      <title>Belief attribution and pragmatics-sensitive reasoning are fragile rather than absent.</title>
      <para xml:id="S3.SS0.SSS0.Px3.p1">
        <p>The rerun results also revise the interpretation of theory-of-mind-related tasks. Both models are now clearly above floor on explicit false-belief paradigms: <text font="typewriter">false-belief-sally-anne</text> is 63.64% for both models, <text font="typewriter">false-belief-vignette-battery</text> reaches 63.64% and 81.82%, and <text font="typewriter">false-belief-unexpected-transfer</text> reaches 36.36% and 63.64%. Likewise, <text font="typewriter">class-inclusion-wording</text> is no longer at floor, though performance remains modest. This suggests that these models can exhibit some belief-sensitive and pragmatics-sensitive behavior, but that such competence is unstable across formats rather than robustly generalized.</p>
      </para>
      <para xml:id="S3.SS0.SSS0.Px3.p2">
        <p>Overall,  <text font="smallcaps">BabyReasoningBench</text> continues to reveal strong dissociations across developmental reasoning paradigms, but the updated results point to a more nuanced conclusion: both models show broader competence but overall still moderate performance, while scaling yields only selective and inconsistent benefits. This reinforces the value of fine-grained developmental evaluation for distinguishing robust reasoning abilities from brittle task-specific successes.</p>
      </para>
    </paragraph>
  </section>
  <section inlist="toc" xml:id="S4">
    <tags>
      <tag>4</tag>
      <tag role="autoref">section 4</tag>
      <tag role="refnum">4</tag>
      <tag role="typerefnum">§4</tag>
    </tags>
    <title><tag close=" ">4</tag>Conclusion</title>
    <para xml:id="S4.p1">
      <p>We introduced <text font="bold smallcaps">BabyReasoningBench</text>, a benchmark of simple developmental reasoning tasks designed to evaluate <emph font="italic">baby language models</emph> inspired by classic experimental paradigms from developmental psychology. Because most of these tasks, when framed textually, are trivial for modern frontier models, they are especially useful as a diagnostic benchmark for small, developmentally trained models. The benchmark operationalizes theory of mind, analogical reasoning, causal inference, and several core reasoning primitives through multiple-choice questions, with task variants generated using an LLM (GPT-5.2) to systematically vary surface form while preserving logic.</p>
    </para>
    <para xml:id="S4.p2">
      <p>Baseline results on two BabyLM GPT-2 models show that (i) performance is substantially below ceiling overall, (ii) increases in pre-training data yield meaningful gains on multiple causal and physical reasoning tasks, but (iii) belief attribution and pragmatics-sensitive tasks remain fragile and highly format-dependent rather than uniformly absent. The task-level dissociations—including sharp failures on class inclusion wording and inconsistent outcomes across false-belief formats—underscore why fine-grained developmentally grounded evaluation is valuable: it can separate reasoning competence from confounds tied to language form, framing, and inhibitory demands, and it can expose brittle generalization that would be invisible in aggregate benchmarks.</p>
    </para>
    <para xml:id="S4.p3">
      <p><text font="smallcaps">BabyReasoningBench</text> is intended as a diagnostic instrument rather than a leaderboard-only dataset. Future work would (a) expand the tasks with other reasoning capabilities, (b) incorporate controlled manipulations of wording, memory load, and discourse context to disentangle pragmatics from reasoning, and (c) extend the suite to multimodal and interactive settings (e.g., simplified visual scenes and intervention selection) that more closely match the evidence available to infants and young children in behavioral experiments. We hope this benchmark supports more mechanism-sensitive evaluation of developmentally trained models and fine-grained empirical evaluations of which reasoning abilities can emerge from child-like input distributions.</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S5">
    <tags>
      <tag>5</tag>
      <tag role="autoref">section 5</tag>
      <tag role="refnum">5</tag>
      <tag role="typerefnum">§5</tag>
    </tags>
    <title><tag close=" ">5</tag>Limitations</title>
    <para xml:id="S5.p1">
      <p>Our work has some limitations, primarily stemming from the verbalisation of the tasks. While some of the tasks, like evaluating for analogies and counterfactuals, maintain their essence in textual form, others, like the Sally-Anne task and the blicket task, may arguably be better assessed with multimodal input, like their behavioral counterparts.
<!--  %****␣mypaper.tex␣Line␣225␣**** --></p>
    </para>
    <para xml:id="S5.p2">
      <p>For some tasks, the MCQs may represent a subset of the actual reasoning behavior. For instance, the blinket task evaluates causal inference using indirect evidence but not disjunctive or conjunctive inference. We leave further fine-grained analysis, and additional MCQs for subsequent work.</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S6">
    <tags>
      <tag>6</tag>
      <tag role="autoref">section 6</tag>
      <tag role="refnum">6</tag>
      <tag role="typerefnum">§6</tag>
    </tags>
    <title><tag close=" ">6</tag>Ethical Consideration</title>
    <para xml:id="S6.p1">
      <p>The paper has been written with the help of OpenAI ChatGPT 5.2, and Grammarly. The ideas in the paper are solely from the author.</p>
    </para>
  </section>
  <section xml:id="Sx1">
    <title>Acknowledgements</title>
    <para xml:id="Sx1.p1">
      <p>The author thanks Michael C. Frank (Department of Psychology, Stanford University) for helpful initial discussions.</p>
    </para>
  </section>
  <bibliography citestyle="authoryear" files="mypaper" xml:id="bib">
    <title>References</title>
  </bibliography>
<!--  %\appendix --></document>
