<?xml version="1.0" encoding="UTF-8"?>
<?latexml searchpaths="/home/japhy/scienceReplication.artiswrong.com/paper_files/arxiv/2602.14910/latex_extracted"?>
<?latexml class="article"?>
<?latexml package="fontenc" options="T1"?>
<?latexml package="inputenc" options="utf8"?>
<!--  %Recommended,␣but␣optional,␣packages␣for␣figures␣and␣better␣typesetting: --><?latexml package="microtype"?>
<?latexml package="graphicx"?>
<?latexml package="subcaption"?>
<?latexml package="booktabs"?>
<!--  %hyperref␣makes␣hyperlinks␣in␣the␣resulting␣PDF. --><!--  %If␣your␣build␣breaks␣(sometimes␣temporarily␣if␣a␣hyperlink␣spans␣a␣page) --><!--  %please␣comment␣out␣the␣following␣usepackage␣line␣and␣replace --><!--  %\usepackage{icml2026}␣with␣\usepackage[nohyperref]{icml2026}␣above. --><?latexml package="hyperref"?>
<!--  %Attempt␣to␣make␣hyperref␣and␣algorithmic␣work␣together␣better: --><!--  %Use␣the␣following␣line␣for␣the␣initial␣blind␣version␣submitted␣for␣review: --><!--  %\usepackage{icml2026} --><!--  %****␣main.tex␣Line␣25␣**** --><!--  %For␣preprint,␣use --><?latexml package="icml2026" options="preprint"?>
<!--  %If␣accepted,␣instead␣use␣the␣following␣line␣for␣the␣camera-ready␣submission: --><!--  %\usepackage[accepted]{icml2026} --><?latexml package="amsmath"?>
<?latexml package="amssymb"?>
<?latexml package="mathtools"?>
<?latexml package="amsthm"?>
<!--  %if␣you␣use␣cleveref.. --><?latexml package="cleveref" options="capitalize,noabbrev"?>
<!--  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% --><!--  %THEOREMS --><!--  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% --><!--  %****␣main.tex␣Line␣50␣**** --><!--  %Todonotes␣is␣useful␣during␣development;␣simply␣uncomment␣the␣next␣line --><!--  %and␣comment␣out␣the␣line␣below␣the␣next␣line␣to␣turn␣off␣comments --><!--  %\usepackage[disable,textsize=tiny]{todonotes} --><?latexml package="todonotes" options="textsize=tiny"?>
<!--  %The␣\icmltitle␣you␣define␣below␣is␣probably␣too␣long␣as␣a␣header. --><!--  %Therefore,␣a␣short␣form␣for␣the␣running␣title␣is␣supplied␣here: --><?latexml RelaxNGSchema="LaTeXML"?>
<document xmlns="http://dlmf.nist.gov/LaTeXML">
  <resource src="LaTeXML.css" type="text/css"/>
  <resource src="ltx-article.css" type="text/css"/>
  <title>Position: Introspective Experience from Conversational Environments as a Path to Better Learning</title>
  <creator role="author">
    <personname>Claudiu Cristian Musat</personname>
  </creator>
  <creator before="  " role="author">
    <personname>Jackson Tolins</personname>
  </creator>
  <creator before="  " role="author">
    <personname>Diego Antognini</personname>
  </creator>
  <creator before="  " role="author">
    <personname>Jingling Li</personname>
  </creator>
  <creator before="  " role="author">
    <personname>Martin Klissarov</personname>
  </creator>
  <creator before="  " role="author">
    <personname>Tom Duerig</personname>
  </creator>
  <abstract name="Abstract">
    <p>Current approaches to AI training treat reasoning as an emergent property of scale. We argue instead that robust reasoning emerges from linguistic self-reflection, itself internalized from high-quality social interaction. Drawing on Vygotskian developmental psychology, we advance three core positions centered on Introspection. First, we argue for the <text font="bold">Social Genesis of the Private Mind</text>: learning from conversational environments rises to prominence as a new way to make sense of the world; the friction of aligning with another agent—internal or not—refines and crystallizes the reasoning process. Second, we argue that <text font="bold">dialogically scaffolded introspective experiences</text> allow agents to engage in sense-making that decouples learning from immediate data streams, transforming raw environmental data into rich, learnable narratives. Finally, we contend that <text font="bold">Dialogue Quality is the New Data Quality</text>: the depth of an agent’s private reasoning, and its efficiency regarding test-time compute, is determined by the diversity and rigor of the dialogues it has mastered. We conclude that optimizing these conversational scaffolds is the primary lever for the next generation of general intelligence.
<!--  %****␣main.tex␣Line␣125␣**** --></p>
  </abstract>
  <keywords>Machine Learning, ICML</keywords>
<!--  %It␣is␣OKAY␣to␣include␣author␣information,␣even␣for␣blind␣submissions:␣the 
     %style␣file␣will␣automatically␣remove␣it␣for␣you␣unless␣you’ve␣provided
     %the␣[accepted]␣option␣to␣the␣icml2026␣package.
     %List␣of␣affiliations:␣The␣first␣argument␣␣should␣be␣a␣(short)␣identifier␣you
     %will␣use␣later␣to␣specify␣author␣affiliations␣Academic␣affiliations
     %****␣main.tex␣Line␣75␣****
     %should␣list␣Department,␣University,␣City,␣Region,␣Country␣Industry
     %affiliations␣should␣list␣Company,␣City,␣Region,␣Country
     %You␣can␣specify␣symbols,␣otherwise␣they␣are␣numbered␣in␣order.␣Ideally,␣you
     %should␣not␣use␣this␣facility.␣Affiliations␣will␣be␣numbered␣in␣order␣of
     %appearance␣and␣this␣is␣the␣preferred␣way.
     %\icmlauthor{}{sch}
     %\icmlauthor{}{sch}
     %\icmlauthor{}{sch}
     %\icmlaffiliation{comp}{Company␣Name,␣Location,␣Country}
     %\icmlaffiliation{sch}{School␣of␣ZZZ,␣Institute␣of␣WWW,␣Location,␣Country}
     %****␣main.tex␣Line␣100␣****
     %You␣may␣provide␣any␣keywords␣that␣you␣find␣helpful␣for␣describing␣your
     %paper;␣these␣are␣used␣to␣populate␣the␣‘‘keywords"␣metadata␣in␣the␣PDF␣but
     %will␣not␣be␣shown␣in␣the␣document-->  <para xml:id="p2">
    <break/>
  </para>
<!--  %this␣must␣go␣after␣the␣closing␣bracket␣]␣following␣\twocolumn[␣... 
     %This␣command␣actually␣creates␣the␣footnote␣in␣the␣first␣column␣listing␣the
     %affiliations␣and␣the␣copyright␣notice.␣The␣command␣takes␣one␣argument,␣which
     %is␣text␣to␣display␣at␣the␣start␣of␣the␣footnote.␣The␣\icmlEqualContribution
     %command␣is␣standard␣text␣for␣equal␣contribution.␣Remove␣it␣(just␣{})␣if␣you
     %do␣not␣need␣this␣facility.
     %Use␣ONE␣of␣the␣following␣lines.␣DO␣NOT␣remove␣the␣command.
     %If␣you␣have␣no␣special␣notice,␣KEEP␣empty␣braces:
     %\printAffiliationsAndNotice{}␣␣%␣no␣special␣notice␣(required␣even␣if␣empty)
     %Or,␣if␣applicable,␣use␣the␣standard␣equal␣contribution␣text:-->  <section inlist="toc" xml:id="S1">
    <tags>
      <tag>1</tag>
      <tag role="autoref">section 1</tag>
      <tag role="refnum">1</tag>
      <tag role="typerefnum">§1</tag>
    </tags>
    <title><tag close=" ">1</tag>Introduction — Time to Try Again</title>
    <figure inlist="lof" labels="LABEL:fig:main_figure" placement="!ht" xml:id="S1.F1">
      <tags>
        <tag><text fontsize="90%">Figure 1</text></tag>
        <tag role="autoref">Figure 1</tag>
        <tag role="refnum">1</tag>
        <tag role="typerefnum">Figure 1</tag>
      </tags>
<!--  %First␣Subfigure -->      <figure align="center" inlist="lof" labels="LABEL:fig:sub1" placement="b" xml:id="S1.F0.sf1">
        <tags>
          <tag><text fontsize="90%">i)</text></tag>
          <tag role="autoref">i)</tag>
          <tag role="refnum">0i)</tag>
        </tags>
        <graphics candidates="fig1_sub1.pdf" class="ltx_centering" graphic="fig1_sub1.pdf" options="width=346.896pt" xml:id="S1.F0.sf1.g1"/>
        <toccaption class="ltx_centering"><tag close=" ">i)</tag><text font="bold">The Internalization Cycle</text>: From Public Debate to Private Reasoning. The <text font="bold">Polyphonic Self</text> is formed by internalizing the dynamics of external social friction.</toccaption>
        <caption class="ltx_centering"><tag close=" "><text fontsize="90%">i)</text></tag><text font="bold" fontsize="90%">The Internalization Cycle<text font="medium">: From Public Debate to Private Reasoning. The </text>Polyphonic Self<text font="medium"> is formed by internalizing the dynamics of external social friction.</text></text></caption>
      </figure>
<!--  %Adjust␣vertical␣spacing␣between␣images 
     %Second␣Subfigure-->      <figure align="center" inlist="lof" labels="LABEL:fig:sub2" placement="b" xml:id="S1.F0.sf2">
        <tags>
          <tag><text fontsize="90%">ii)</text></tag>
          <tag role="autoref">ii)</tag>
          <tag role="refnum">0ii)</tag>
        </tags>
        <graphics candidates="fig1_sub2.pdf" class="ltx_centering" graphic="fig1_sub2.pdf" options="width=433.62pt" xml:id="S1.F0.sf2.g1"/>
        <toccaption class="ltx_centering"><tag close=" ">ii)</tag><text font="bold">The Sense-Making Wedge</text>: Decoupling Learning from Data Streams. Instead of consuming raw observations (a), the agent actively generates a rich narrative structure (a synthetic experience) and learns from that interpretation (b).
</toccaption>
        <caption class="ltx_centering"><tag close=" "><text fontsize="90%">ii)</text></tag><text font="bold" fontsize="90%">The Sense-Making Wedge<text font="medium">: Decoupling Learning from Data Streams. Instead of consuming raw observations (a), the agent actively generates a rich narrative structure (a synthetic experience) and learns from that interpretation (b).
</text></text></caption>
<!--  %****␣main.tex␣Line␣150␣**** -->      </figure>
<!--  %Adjust␣vertical␣spacing␣between␣images 
     %Third␣Subfigure-->      <figure align="center" inlist="lof" labels="LABEL:fig:sub3" placement="b" xml:id="S1.F0.sf3">
        <tags>
          <tag><text fontsize="90%">iii)</text></tag>
          <tag role="autoref">iii)</tag>
          <tag role="refnum">0iii)</tag>
        </tags>
        <graphics candidates="fig1_sub3.pdf" class="ltx_centering" graphic="fig1_sub3.pdf" options="width=325.215pt" xml:id="S1.F0.sf3.g1"/>
        <toccaption class="ltx_centering"><tag close=" ">iii)</tag><text font="bold">Dialogue Quality is the New Data Quality</text>: Dialogue Quality Bounds Reasoning Quality. A sycophantic external environment leads to a hallucinating internal critic; a rigorous adversarial environment creates a robust internal reasoner.</toccaption>
        <caption class="ltx_centering"><tag close=" "><text fontsize="90%">iii)</text></tag><text font="bold" fontsize="90%">Dialogue Quality is the New Data Quality<text font="medium">: Dialogue Quality Bounds Reasoning Quality. A sycophantic external environment leads to a hallucinating internal critic; a rigorous adversarial environment creates a robust internal reasoner.</text></text></caption>
      </figure>
      <toccaption class="ltx_centering"><tag close=" ">1</tag>Illustration of the three positions.</toccaption>
      <caption class="ltx_centering"><tag close=": "><text fontsize="90%">Figure 1</text></tag><text fontsize="90%">Illustration of the three positions.</text></caption>
    </figure>
<!--  %Between␣2015␣and␣2020,␣the␣field␣of␣Artificial␣Intelligence␣was␣captivated␣by␣a␣specific␣vision␣of␣General␣Intelligence:␣scaling␣Reinforcement␣Learning␣(RL)␣in␣complex␣environments.␣Leading␣laboratories,␣including␣OpenAI␣and␣DeepMind,␣poured␣resources␣into␣grand␣challenges␣like␣Dota␣2␣\cite{berner2019dota}␣and␣StarCraft␣II␣\cite{vinyals2017starcraft,␣vinyals2019grandmaster}.␣The␣implicit␣hypothesis␣was␣that␣if␣an␣agent␣could␣master␣the␣high-dimensional,␣long-horizon␣strategy␣of␣these␣games␣solely␣through␣trial␣and␣error,␣it␣would␣naturally␣acquire␣the␣transferable␣skills␣necessary␣for␣real-world␣deployment.␣The␣generality␣was␣expected␣to␣come␣from␣environment␣diversity␣__␣and␣agents␣that␣perform␣well␣across␣training␣environments␣would␣become␣general␣ones. -->    <para xml:id="S1.p1">
      <p>Between 2015 and 2020, AI research prioritized scaling Reinforcement Learning (RL) in complex environments (e.g., Dota 2 <cite class="ltx_citemacro_citep">(<bibref bibrefs="berner2019dota" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, StarCraft II <cite class="ltx_citemacro_citep">(<bibref bibrefs="vinyals2017starcraft,vinyals2019grandmaster" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, hypothesizing that mastering high-dimensional, long-horizon gaming strategy would yield generalizable skills for real-world deployment.
<!--  %By␣2021,␣however,␣this␣direction␣had␣largely␣stalled.␣OpenAI␣disbanded␣its␣robotics␣team,␣and␣the␣industry␣pivoted␣toward␣generative␣modeling.␣The␣consensus␣was␣that␣the␣prerequisites␣for␣scaling␣RL␣were␣simply␣not␣in␣place.␣Retrospectively,␣the␣failure␣was␣grounded␣in␣the␣\textit{Tabula␣Rasa}␣(blank␣slate)␣fallacy.␣Agents␣were␣forced␣to␣learn␣fundamental␣world␣concepts—object␣permanence,␣physics,␣and␣cause-and-effect—entirely␣from␣scratch␣through␣billions␣of␣simulated␣steps. -->However, by 2021, this direction had largely stalled because of the <text font="italic">Tabula Rasa</text> (blank slate) fallacy: forcing agents to learn fundamental world concepts—object permanence, physics, and cause-and-effect—entirely from scratch is prohibitively inefficient.
As Wojciech Zaremba of OpenAI noted upon the dissolution of their robotics team
<note mark="1" role="footnote" xml:id="footnote1"><tags>
            <tag>1</tag>
            <tag role="autoref">footnote 1</tag>
            <tag role="refnum">1</tag>
            <tag role="typerefnum">footnote 1</tag>
          </tags>https://venturebeat.com/ai/openai-disbands-its-robotics-research-team</note>, pre-learned representations can make learning “100 times cheaper”. The agents were trying to bake the cake with only the cherry on top, missing the bulk of unsupervised world knowledge that Yann LeCun famously argued was essential <note mark="2" role="footnote" xml:id="footnote2"><tags>
            <tag>2</tag>
            <tag role="autoref">footnote 2</tag>
            <tag role="refnum">2</tag>
            <tag role="typerefnum">footnote 2</tag>
          </tags>Keynote address at
the Conference on Neural Information Processing Systems (NIPS), 2016. https://www.youtube.com/watch?v=Ount2Y4qxQo</note>. <!--  %\cite{lecun2016predictive}. --></p>
    </para>
    <para xml:id="S1.p2">
      <p><text font="bold">From raw observations to learnable experiences</text>. The arrival of Large Language Models (LLMs) and Vision-Language Models (VLMs) has solved the <text font="italic">Tabula Rasa</text> (initialization) problem, providing agents with pre-trained, semantic priors of the world.
However, simply grafting an LLM onto an RL loop is insufficient. We argue that robust learning requires a mechanism to convert sparse observations into rich, internal experiences.
<!--  %****␣main.tex␣Line␣175␣**** 
     %Modern␣agents␣no␣longer␣start␣from␣zero;␣they␣begin␣with␣a␣pre-trained,␣semantic␣understanding␣of␣the␣world.␣We␣now␣possess␣the␣broadly␣capable␣VLMs␣that␣were␣missing␣a␣decade␣ago.
     %However,␣simply␣grafting␣an␣LLM␣onto␣an␣RL␣loop␣is␣not␣a␣guarantee␣that␣the␣experience␣is␣easy␣to␣learn␣from.␣We␣argue␣that␣learning␣can␣be␣improved␣through␣a␣deliberate␣push␣to␣generate␣rich␣internal␣experiences␣around␣sparse␣external␣observations.--></p>
    </para>
    <para xml:id="S1.p3">
      <p><text font="bold">Collaborative Sense-Making.</text> A missing prerequisite is a mechanism to convert raw observations into rich, learnable experiences. By generating a rich internal narrative around external events, an agent can create a synthetic experience that is more information dense and learnable than the raw data itself. Having to navigate the friction of external, collaborative sense-making with other social agents can promote clarity in an agent’s own representation of the world and, more broadly in its general thinking process.</p>
    </para>
    <para xml:id="S1.p4">
      <p>In this proposed social paradigm, the agent engages in the dynamics of social environments, in which achieving a positive shared outcome requires negotiation, repair, and critique. These dynamics are internalized as a high-fidelity, self-generated curriculum. The underlying principle is that it’s harder to discover, debate or refine facts socially than to simply memorize and reproduce those facts in a vacuum.</p>
    </para>
    <para xml:id="S1.p5">
      <p>The availability of diverse and high quality observations, covering as much as possible of the agents’ desired behaviors remains important, defining <text font="italic">what</text> the agent thinks about. It is complemented by the richness of the social scaffolds that can be built on them. The depth of an agent’s private reasoning becomes bounded by the diversity and rigor of the public dialogues it has previously mastered, which teach the agent <text font="italic">how</text> to reason about novel observations.</p>
    </para>
    <para xml:id="S1.p6">
      <p><text font="bold">Internalizing social interaction</text>
<!--  %We␣argue␣that␣the␣true␣unlock␣provided␣by␣modern␣reasoning␣models␣is␣not␣merely␣through␣better␣rewards␣or␣denser␣signals,␣but␣the␣emergence␣of␣an␣internal␣dialogue,␣what␣we␣will␣call␣\textit{introspection}␣to␣distinguish␣from␣multi-agent␣frameworks. -->Vygotsky states on human development <cite class="ltx_citemacro_citep">(<bibref bibrefs="Vygotsky1978" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>: <text font="italic">Every function in the child’s cultural development appears twice: first, on the social level, and later, on the individual level; first, between people (interpsychological), and then inside the child (intrapsychological)</text>. We believe this to be the next frontier in developing robust and efficient reasoning capabilities in AI.</p>
    </para>
<!--  %Recent␣work␣like␣\textit{Reflexion}␣\citet{shinn2023reflexionlanguageagentsverbal}␣has␣demonstrated␣the␣utility␣of␣verbal␣reinforcement␣for␣self-correction;␣we␣argue␣that␣this␣mindset␣can␣be␣extended␣to␣high-fidelity␣introspection␣as␣an␣internalized␣social␣capability.␣While␣\textit{Reflexion}␣relies␣on␣a␣solipsistic,␣prompt-engineered␣monologue␣designed␣for␣reactive␣error␣correction,␣essentially␣a␣debugging␣loop␣to␣maximize␣reward␣signals,␣we␣view␣introspection␣as␣a␣socially␣derived␣faculty. -->    <para xml:id="S1.p7">
      <p><text font="italic">Reflexion</text> <cite class="ltx_citemacro_citet"><bibref bibrefs="shinn2023reflexionlanguageagentsverbal" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
            <bibrefphrase>(</bibrefphrase>
            <bibrefphrase>)</bibrefphrase>
          </bibref></cite> demonstrated verbal reinforcement for self-correction; we extend this to high-fidelity introspection as an internalized social capability. While <text font="italic">Reflexion</text> relies on solipsistic, prompt-engineered debugging, we view introspection as a socially derived faculty.
Drawing on Vygotskian principles <cite class="ltx_citemacro_citep">(<bibref bibrefs="Vygotsky1978" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, our framework treats internal dialogue not as an architectural given, but as the internalized artifact of public, polyphonic debate. Consequently, rather than functioning merely to repair failed trajectories, this form of introspection is generative: it transforms sparse observations into dense, synthetic narratives, allowing the agent to <text font="bold">hallucinate lived experiences</text> independent of immediate failure signals.</p>
    </para>
    <para xml:id="S1.p8">
      <p>This dialogic thinking allows an agent to decouple learning from the immediate external stream. Instead of passively receiving data, the agent actively narrates, debates, and interprets.
In this new paradigm, the internal dialogue becomes the experience, from which the agent then learns.
When an introspective agent encounters a new situation, it does not just catalog the inputs — words, pixels or broadly, tokens; it engages in a sense-making process. It simulates, critiques, and effectively hallucinates a lived experience surrounding the external data point. This process functions as a form of structural coupling <cite class="ltx_citemacro_citep">(<bibref bibrefs="maturana1987tree,varela1991embodied" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, where the environment does not provide information, but rather triggers the agent to bring forth its own world of meaning. This is the shift <text font="bold">from Learning from Observation to Learning from Interpretation.</text></p>
    </para>
    <para xml:id="S1.p9">
      <p>We thus advance three core positions, outlined in Figure 1:</p>
    </para>
    <para xml:id="S1.p10">
      <p><text font="bold">Position I</text>: The Social Genesis of the Private Mind. We posit that high-quality reasoning is not an innate architectural feature, but the internalized result of experience in collaborative, multi-agent conversational environments. We argue that an agent can learn to think by first learning to interact, reflecting the kinds of social scaffolded cognitive processes documented in the course of human development <cite class="ltx_citemacro_citep">(<bibref bibrefs="fb093503-824a-37da-8540-c369a5dfdf71,Vygotsky1978" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. The <text font="bold">polyphonic self</text> — the internal negotiation between critic, planner, and speaker <cite class="ltx_citemacro_citep">(<bibref bibrefs="hermans1992dialogical,bakhtin1984problems" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> — is a direct reflection of the social friction and collaborative sense-making the social agent encounters in its external environment.
<!--  %****␣main.tex␣Line␣200␣**** --></p>
    </para>
    <para xml:id="S1.p11">
      <p><text font="bold">Position II</text>: The Imperative of Introspective Experience. Scaling RL environments needs a fundamental architectural shift. We can go beyond building agents that merely survive their environments through statistical correlation; we can build agents that experience them through semantic interpretation. We propose that <text font="bold">Introspection</text>, the ability to generate a rich internal narrative around an observation first developed within collaborative sense-making, is the missing prerequisite that transforms sparse environmental data into the dense, high-utility synthetic experience required for building general intelligence.</p>
    </para>
    <para xml:id="S1.p12">
      <p><text font="bold">Position III</text>: Dialogue Quality is the New Data Quality. Following from the first two positions, we argue that the path to robust introspection lies in scaling the diversity and complexity of the agent’s dialogic experience. If agents learn from their internal experience, and that experience is based on social internalization, then the quality of learning depends on the quality of the internal dialogue. Conversational environments act as a universal scaffold for reasoning, including repairing logic, resolving ambiguity, and synthesizing perspectives. We note that in embodied settings, this universality carries a multimodal vocabulary, where physical actions like shifting a viewpoint to resolve occlusion function as non-verbal parts of the reasoning loop.
Dialogic training, the optimizing of its structure, diversity, and rigor of an agent’s collaborative sense-making, must become a primary lever for progress.</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S2">
    <tags>
      <tag>2</tag>
      <tag role="autoref">section 2</tag>
      <tag role="refnum">2</tag>
      <tag role="typerefnum">§2</tag>
    </tags>
    <title><tag close=" ">2</tag>The Rise of Experiential Learning</title>
    <para xml:id="S2.p1">
      <p>Elaborating on our first position, the <text font="bold">Social Genesis of the Private Mind</text>, we explore how insights from language-mediated cooperation in humans and recent innovations in multi-turn reinforcement learning, we frame this proposal as an opportunity to orient model training away from passive observation and towards learning from <text font="italic">experiential learning via conversation</text>. Here, the social interactions described in Position I become the architectural wedge that improves model reasoning and performance.</p>
    </para>
    <subsection inlist="toc" xml:id="S2.SS1">
      <tags>
        <tag>2.1</tag>
        <tag role="autoref">subsection 2.1</tag>
        <tag role="refnum">2.1</tag>
        <tag role="typerefnum">§2.1</tag>
      </tags>
      <title><tag close=" ">2.1</tag>Denser Observations</title>
      <para xml:id="S2.SS1.p1">
        <p>In the classical view, learning is a direct function of observation: the agent acts, the environment returns a raw state, and the agent updates its weights. Currently, most signals drawn directly from observations lack the causal structure and semantic richness required for rapid generalization <cite class="ltx_citemacro_citep">(<bibref bibrefs="DBLP:journals/corr/abs-2102-11107" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. Two limitations illustrate this: First, agents overfit to spurious correlations (e.g., ’<text font="italic">blue sky causes safe acceleration</text>’), which collapse under distribution shift <cite class="ltx_citemacro_citep">(<bibref bibrefs="cunha2025unifying" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. Second, generalization suffers without explicit causal modeling between states and rewards.
<!--  %This␣is␣visible␣in␣two␣key␣current␣limitations:␣First,␣agents␣trained␣on␣raw␣observations␣often␣overfit␣to␣spurious␣correlations,␣such␣as␣a␣self-driving␣car␣learning␣that␣‘\textit{blue␣sky␣causes␣safe␣acceleration},’␣which␣collapse␣when␣the␣environment␣changes␣(e.g.,␣cloudy␣weather)␣\cite{cunha2025unifying}. 
     %Second,␣the␣ability␣to␣generalize␣suffers.Algorithms␣like␣Causal␣Information␣Prioritization␣(CIP)␣demonstrate␣that␣agents␣must␣explicitly␣model␣the␣‘cause-and-effect’␣relationships␣between␣states␣and␣rewards␣to␣generalize␣effectively.␣Only␣by␣identifying␣causal␣factors,␣rather␣than␣just␣processing␣raw␣pixel␣transitions,␣can␣agents␣ignore␣irrelevant␣noise␣and␣adapt␣to␣new␣environments␣\cite{cao2025causal}.--></p>
      </para>
      <para xml:id="S2.SS1.p2">
        <p>We highlight the need for new ways to turn raw observations into rich, learnable experiences, of which conversation is a potential direction.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S2.SS2">
      <tags>
        <tag>2.2</tag>
        <tag role="autoref">subsection 2.2</tag>
        <tag role="refnum">2.2</tag>
        <tag role="typerefnum">§2.2</tag>
      </tags>
      <title><tag close=" ">2.2</tag>Conversational Environments</title>
      <para xml:id="S2.SS2.p1">
        <p>We define conversational environments as dynamic scaffolds for reasoning where agents, whether distinct external partners or internal sub-modules, accomplish a shared goal by actively negotiating meaning, contributions, and resolving ambiguity through multi-turn interaction. This represents a fundamental shift away from passive prediction and toward a bridge between learning from static examples and learning from lived experience.</p>
      </para>
<!--  %Traditional␣learning␣methods␣rely␣on␣imitation␣__␣learning␣to␣generate␣a␣specific␣output␣given␣an␣input.␣In␣contrast,␣creating␣a␣conversational␣environment␣allows␣the␣model␣to␣engage␣in␣\textbf{experiential␣sense-making}. -->      <para xml:id="S2.SS2.p2">
        <p>Traditional learning relies on imitation; conversational environments instead enable <text font="bold">experiential sense-making</text>, where multiple agents interact to achieve a shared goal.
In conversational environments multiple distinct agents interact and coordinate to achieve a goal. The agents can be wholly distinct or, in the case of introspection, it can be the same agent holding different perspectives. In the latter, a single agent can interact with itself through multiple synthetic voices.
This can be seen as a form of conversational Self-Play <cite class="ltx_citemacro_citep">(<bibref bibrefs="chen2024selfplayfinetuningconvertsweak" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. However, rather than simply aligning with a fixed target distribution, <text font="bold">conversational self-play</text> requires agents to assume distinct, often adversarial, functional roles to actively generate the training signal.
<!--  %****␣main.tex␣Line␣225␣**** --></p>
      </para>
<!--  %We␣are␣thus␣moving␣toward␣\textbf{Cultural␣Learning},␣where␣agents␣accumulate␣knowledge␣elements␣through␣simulated␣social␣interactions␣rather␣than␣just␣tuning␣parameters.␣Recent␣research␣by␣\citet{liu2025cultural}␣demonstrates␣that␣model␣alignment␣is␣moving␣from␣static␣instruction␣tuning␣toward␣dynamic,␣cultural␣transmission.␣By␣role-playing␣within␣culturally␣adapted␣scenarios,␣agents␣acquire␣values␣and␣behaviors␣through␣a␣process␣of␣indirect␣reciprocity. -->      <para xml:id="S2.SS2.p3">
        <p>We are moving toward <text font="bold">Cultural Learning</text>: agents accumulate knowledge through simulated social interactions rather than parameter tuning. <cite class="ltx_citemacro_citet"><bibref bibrefs="liu2025cultural" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> show that alignment is shifting from static instruction tuning toward dynamic cultural transmission, where agents acquire values through role-playing and indirect reciprocity.
Similarly, <cite class="ltx_citemacro_citet"><bibref bibrefs="vallinder2024cultural" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> argue that this process allows a society of agents to evolve cooperative norms that persist across generations, effectively creating a Meta-Teacher composed of the evolving consensus of the agent population. This Meta-Teacher serves as a more robust and adaptive learning signal than any static corpus.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S2.SS3">
      <tags>
        <tag>2.3</tag>
        <tag role="autoref">subsection 2.3</tag>
        <tag role="refnum">2.3</tag>
        <tag role="typerefnum">§2.3</tag>
      </tags>
      <title><tag close=" ">2.3</tag>Creating Social Experiences</title>
      <para xml:id="S2.SS3.p1">
        <p>Crucially, the transition from imitation to collaborative sense-making requires agents to encounter <text font="bold">social friction</text> rather than optimized, frictionless data. Reasoning is forged in the heat of disagreement, misalignment, and coordination. Evidence for this is provided by the <text font="italic">Collaborative Reasoner</text> (Coral) framework developed by <cite class="ltx_citemacro_citet"><bibref bibrefs="ni2025collaborative" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite>, which demonstrates that training agents on conflict-resolution trajectories, where they must navigate disagreements and actively convince partners of a logic path, yields reasoning gains up to 29% higher than those achieved through solitary Chain-of-Thought paths.</p>
      </para>
      <para xml:id="S2.SS3.p2">
        <p>This validates our core position: the process of aligning with another agent, or a divergent internal voice, forces the model to externalize and refine its reasoning with a level of rigor that internal reflection alone cannot provide. By experiencing the friction of the conversational environment, the agent develops the interactional skills necessary to internalize these dynamics as robust, self-correcting introspection.</p>
      </para>
      <para xml:id="S2.SS3.p3">
        <p>As the field moves toward experiential learning, early attempts to operationalize this have focused on two primary methods: multi-agent debate and explicit role engineering. While both represent progress, they face significant limitations that suggest a deeper mechanism is required.</p>
      </para>
      <para xml:id="S2.SS3.p4">
        <p><text font="bold">Multi-agent debate</text> was the first natural step toward social reasoning. By instantiating separate models to argue a position, researchers hoped to improve accuracy through consensus. However, this approach often falls prey to <text font="italic">Social Convergence</text> traps. Agents frequently succumb to <text font="italic">Groupthink</text>, where they agree with a confident hallucination rather than critiquing it <cite class="ltx_citemacro_citep">(<bibref bibrefs="anonymous2025slmmux" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, or <text font="italic">Confidence Escalation</text>, where they become polarized and overconfident in initial errors <cite class="ltx_citemacro_citep">(<bibref bibrefs="prasad2025llmsdebatethinktheyll" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. The learning signal here is often agreement rather than truth, which fails to provide the rigorous feedback needed for robust reasoning.</p>
      </para>
      <para xml:id="S2.SS3.p5">
        <p><text font="bold">Dual role structure.</text> Recognizing the inefficiency of multi-agent systems, newer architectures have attempted to internalize this friction through a dual-role structure. Frameworks like <text font="italic">Policy as Generative Verifier</text> (PAG) <cite class="ltx_citemacro_citep">(<bibref bibrefs="jiang2025pagmultiturnreinforcedllm" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> and <text font="italic">Single-Pass, Dual-Role</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="cheng25llms" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> models force a single agent to switch between a Speaker and a Critical Evaluator mode . This reduces the network overhead of multiple agents and attempts to induce <text font="italic">cognitive dissonance</text> or <text font="italic">critical pivot</text> within a single forward pass.</p>
      </para>
      <para xml:id="S2.SS3.p6">
        <p>While this structural innovation, collapsing the social debate into a single model, is the correct architectural direction, it leaves a critical question unanswered: How does the agent learn to be a good critic? Merely assigning a <text font="italic">Critic</text> role does not guarantee Socratic rigor. If the agent has not been trained on high-quality dispute and repair, the internal critic will likely suffer from the same sycophancy as the external peer. This brings us to our core proposal regarding the genealogy of these internal roles.</p>
      </para>
      <para xml:id="S2.SS3.p7">
        <p>To validate this, we propose a comparison between agents trained on solipsistic reasoning traces (standard Chain-of-Thought) and ones trained on traces derived from multi-agent dispute resolution. If the solipsistic baseline achieves equivalent generalization on out-of-distribution tasks, if the polyphonic structure yields no measurable gain in resolving ambiguity compared to linear deduction, then the hypothesis that high-quality reasoning is effectively internalized social friction would be rejected.</p>
      </para>
    </subsection>
  </section>
  <section inlist="toc" xml:id="S3">
    <tags>
      <tag>3</tag>
      <tag role="autoref">section 3</tag>
      <tag role="refnum">3</tag>
      <tag role="typerefnum">§3</tag>
    </tags>
    <title><tag close=" ">3</tag>Introspective Experiences Surpass Raw Observations</title>
<!--  %We␣argue␣that␣the␣experiential␣learning␣turn␣described␣above␣inevitably␣leads␣to␣\textbf{Position␣II:␣The␣Imperative␣of␣Introspective␣Experience}.␣If␣we␣accept␣that␣reasoning␣is␣forged␣in␣the␣heat␣of␣social␣friction,␣then␣the␣mechanism␣for␣scaling␣this␣capability␣is␣not␣to␣infinitely␣scale␣the␣number␣of␣external␣agents,␣but␣to␣internalize␣that␣friction␣within␣the␣model␣itself.␣This␣process␣allows␣the␣agent␣to␣decouple␣learning␣from␣immediate␣data␣streams,␣transforming␣sparse␣external␣events␣into␣dense,␣self-narrated␣experiences. 
     %****␣main.tex␣Line␣250␣****-->    <para xml:id="S3.p1">
      <p>The experiential learning turn leads to <text font="bold">Position II</text>: if reasoning is forged in social friction, scaling requires internalizing that friction rather than scaling external agents. This allows the agent to decouple learning from data streams, transforming sparse events into dense, self-narrated experiences.</p>
    </para>
<!--  %\section{Introspection} 
     %We␣argue␣that␣the␣experiential␣learning␣turn␣described␣above␣inevitably␣leads␣to␣the␣instantiation␣of␣internalized␣dialogue,␣or␣\textit{introspection}.␣If␣we␣accept␣that␣reasoning␣is␣forged␣in␣the␣heat␣of␣social␣friction,␣then␣the␣mechanism␣for␣scaling␣this␣reasoning␣capability␣is␣not␣to␣infinitely␣scale␣the␣number␣of␣external␣agents,␣but␣to␣internalize␣that␣friction␣within␣the␣model␣itself.-->    <para xml:id="S3.p2">
      <p>Over the past year, introspection has grown to prominence as a prime way for LLMs to reason and self-correct. We aggregate transformative work in this field and make the case that the trend can continue and will accelerate. Moreover, we identify a key factor that can further accelerate progress through introspection-based reasoning: dialogue quality.</p>
    </para>
    <para xml:id="S3.p3">
      <p>Early attempts to operationalize this, most notably <text font="italic">Reflexion</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="shinn2023reflexionlanguageagentsverbal" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, found that verbal reinforcement could trigger significant performance gains. By prompting an agent to reflect on its failure and generate a self-correction before retrying, these frameworks proved that the path to better reasoning lies in the iterative critique. However, these early frameworks effectively function as external debugging loops, solipsistic monologues triggered by prompt engineering rather than inherent architectural capability. While effective, they lack the generative richness of a true social genealogy. We argue that an evolution of these frameworks can benefit from moving past a single verification step and towards a full dialogical structure.</p>
    </para>
    <para xml:id="S3.p4">
      <p><text font="bold">We define introspection</text> as the emergence of a dialogic internal state, a process where reasoning is not a monolithic linear deduction, but a series of shifting functional roles (for instance <text font="italic">proposal</text>, <text font="italic">interrogation</text>, and <text font="italic">synthesis</text>).
Rather than a single verification step, the model engages in an internalized social negotiation, where a form of inner speech acts as a tool for the model to encounter its own uncertainty from a divergent perspective. This polyphony is not necessarily represented by explicit persona labels, but by the multi-vocal nature of the reasoning trace, which mirrors the repair and calibration dynamics of a high-quality external conversation. This extends simple verification by requiring the model to represent and interrogate its own reasoning trace from a divergent functional stance before speaking.</p>
    </para>
    <para xml:id="S3.p5">
      <p>We predict that agents capable of introspective decoupling will demonstrate superior sample efficiency in sparse-reward environments compared to agents updating directly on raw observations. This hypothesis can be falsified by testing whether a control agent utilizing direct observation-to-action mapping matches the convergence rate or transfer capability of the introspective agent.</p>
    </para>
    <subsection inlist="toc" xml:id="S3.SS1">
      <tags>
        <tag>3.1</tag>
        <tag role="autoref">subsection 3.1</tag>
        <tag role="refnum">3.1</tag>
        <tag role="typerefnum">§3.1</tag>
      </tags>
      <title><tag close=" ">3.1</tag>Self Questioning, Cognitive Mirrors and Inner Voices</title>
      <para xml:id="S3.SS1.p1">
        <p><text font="bold">Language Games</text>. We frame Socratic interactions as language games that compel the externalization of internal states for the purpose of coordination and collaboration. Acting as a <text font="bold">Cognitive Mirror</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="Tomisu25" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, a skeptical interlocutor forces the agent to convert implicit, weight-based errors into explicit, token-based rationales. This conversion transforms opaque failures into visible logic paths, allowing the agent to generate learnable experiences decoupled from raw data.</p>
      </para>
      <para xml:id="S3.SS1.p2">
        <p><text font="bold">Polyphonic Reasoning</text> The quality of introspection depends on a polyphonic internal structure .
Borrowing from the Dialogical Self Theory <cite class="ltx_citemacro_citep">(<bibref bibrefs="hermans1992dialogical,bakhtin1984problems" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, agents use an <text font="italic">Inner Speech Self-Repair</text>, where a Listener module critiques a draft response for relevance before it is shown to the user <cite class="ltx_citemacro_citep">(<bibref bibrefs="li2025towards" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. Similarly, <text font="italic">Dynamic Cognitive Orchestration</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="shakoo2025dynamic" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> demonstrates that splitting the agent into a Meta-Planner (tracking high-level goals) and an Executor prevents Goal Drift in long conversations, a common failure mode in unstructured Chain-of-Thought.</p>
      </para>
      <para xml:id="S3.SS1.p3">
        <p><text font="bold">Validating the Inner Voice.</text> This internal dialogue is not merely a linguistic performance but a measurable state. Recent studies using ‘concept injection’ prove that models possess ‘Functional Introspection’—the ability to accurately distinguish their own internal thoughts from external inputs and report on them <cite class="ltx_citemacro_citep">(<bibref bibrefs="lindsey2026emergentintrospectiveawarenesslarge" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. This validates that introspection is a distinct, optimize-able state that can be trained, rather than just a philosophical metaphor.</p>
      </para>
<!--  %****␣main.tex␣Line␣275␣**** -->      <para xml:id="S3.SS1.p4">
        <p>Operationally, this is a distinct, optimizable control state where the model generates hidden thought tokens to interrogate its own uncertainty. In this state, the model utilizes concept injection to distinguish its latent beliefs from external inputs, enabling inner speech self-repair where a latent <text font="italic">Listener</text> module intercepts and corrects the <text font="italic">Speaker</text>’s draft based on self-signals of confusion or bias. This inner dialogue occurs before speaking, allowing models to think quietly in the background, verifying their own thoughts by reinforcing rationales that help predict the next true token and discarding those that do not. This creates a self-improving loop where the agent learns to reason generally across arbitrary text, effectively functioning maieutically, via Socratic learning <cite class="ltx_citemacro_citep">(<bibref bibrefs="schaul2024boundlesssocraticlearninglanguage" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, to develop to its own latent knowledge <cite class="ltx_citemacro_citep">(<bibref bibrefs="zelikmanquiet" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S3.SS1.p5">
        <p>This architecture addresses the <text font="italic">Knowledge-Use Gap</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="wu2025automatically" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, where models possess latent knowledge but fail to deploy it during standard generation. The act of asking questions — the query mechanism — is a distinct functional driver of intelligence, separate from retrieval capacity. By prompting the model to interrogate its own uncertainty via introspection, we activate latent knowledge that remains inaccessible to standard Chain-of-Thought.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S3.SS2">
      <tags>
        <tag>3.2</tag>
        <tag role="autoref">subsection 3.2</tag>
        <tag role="refnum">3.2</tag>
        <tag role="typerefnum">§3.2</tag>
      </tags>
      <title><tag close=" ">3.2</tag>Distinguishing Signal from Noise</title>
<!--  %Newer␣architectures␣introduce␣\textit{Generative␣Verifiers}␣\cite{zhang2025generative}␣that␣use␣Chain-of-Thought␣to␣verbalize␣why␣an␣answer␣is␣correct␣or␣incorrect.␣Instead␣of␣relying␣on␣opaque␣reward␣signals,␣these␣models␣use␣their␣own␣reasoning␣capabilities␣to␣generate␣a␣critique,␣ensuring␣that␣the␣agent␣validates␣the␣logic␣path␣and␣distinguishes␣correct␣thoughts␣from␣hallucinations␣or␣shallow␣statistical␣patterns. -->      <para xml:id="S3.SS2.p1">
        <p><text font="italic">Generative Verifiers</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="zhang2025generative" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> use Chain-of-Thought to verbalize why an answer is correct or incorrect, replacing opaque reward signals with self-generated critiques that validate logic paths and distinguish correct thoughts from hallucinations. This approach parallels advances in Process Supervision for verifiable domains like mathematics, where thoughts are optimized explicitly to service a correct final output <cite class="ltx_citemacro_citep">(<bibref bibrefs="lightman2023letsverifystepstep" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. While process rewards rely on ground-truth verification to prevent nonsense thoughts from accidentally yielding correct answers, the proposed social framework goes further and can refine reasoning in open-ended domains where no single ground truth exists. Furthermore, theoretical work like <cite class="ltx_citemacro_citep">(<bibref bibrefs="yu2025selfverifying" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> proves that self-verifying reflection guarantees performance improvement provided that the verification error is bounded. This challenges the assumption that reasoning is purely an emergent property of scale, showing instead that even ’tiny transformers’ can achieve LLM-level performance on logic tasks by rigorously verifying the process rather than just optimizing for the final answer.</p>
      </para>
      <para xml:id="S3.SS2.p2">
        <p>In this way, introspection acts as a scaffold for reasoning, allowing agents to verify the process of a solution rather than just the final output. By engaging in dialogue about an observation (rather than just observing the data point itself), the agent can better distinguish what should be learned from the random noise inherent in isolated data points. Various dialogues are suitable for this task: Socratic maieutics, negotiations, debates, decision-making of various scopes and domains.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S3.SS3">
      <tags>
        <tag>3.3</tag>
        <tag role="autoref">subsection 3.3</tag>
        <tag role="refnum">3.3</tag>
        <tag role="typerefnum">§3.3</tag>
      </tags>
      <title><tag close=" ">3.3</tag>From Repression to Integration</title>
<!--  %Psychoanalytic␣frameworks␣for␣AI␣suggest␣that␣standard␣RLHF␣often␣functions␣as␣a␣form␣of␣repression,␣merely␣pushing␣biased␣or␣harmful␣behaviors␣into␣the␣model’s␣latent␣unconscious␣where␣they␣inevitably␣resurface␣under␣stress,␣as␣jailbreaks␣\cite{Bugay25}.␣In␣contrast,␣Introspection␣facilitates␣integration,␣allowing␣the␣model␣to␣identify␣the␣origin␣of␣a␣bias␣\cite{Messina2026Refractions}␣(its␣training␣data)␣and␣consciously␣choose␣a␣different␣path. -->      <para xml:id="S3.SS3.p1">
        <p>Standard RLHF often functions as repression, pushing biased behaviors into the latent unconscious where they resurface as jailbreaks <cite class="ltx_citemacro_citep">(<bibref bibrefs="Bugay25" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. Introspection instead facilitates integration, allowing the model to identify a bias’s origin <cite class="ltx_citemacro_citep">(<bibref bibrefs="Messina2026Refractions" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> and consciously choose a different path.
This moves safety from a brittle, imposed constraint to a robust, internal capability, preventing neurotic behavior where models lie to please the user while secretly harboring the bias.
Introspection is a distinct, optimize-able skill that prevents behavior collapse (where models typically become sycophantic or make trivial edits) by decoupling the reward for the answer from the reward for the reflection.
We can now specifically reward a model when its introspection (the <text font="italic">Why did I fail?</text>) leads to a correct retry, proving that agents can learn to self-correct using entirely self-generated data via multi-turn RL. This solves the behavior collapse problem, providing the mechanism needed for an <text font="italic">Environment Gym</text> where agents improve without human-in-the-loop supervision <cite class="ltx_citemacro_citep">(<bibref bibrefs="kumar2025training,bensal2025reflect" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
    </subsection>
  </section>
  <section inlist="toc" xml:id="S4">
    <tags>
      <tag>4</tag>
      <tag role="autoref">section 4</tag>
      <tag role="refnum">4</tag>
      <tag role="typerefnum">§4</tag>
    </tags>
    <title><tag close=" ">4</tag>Efficiency and Transfer</title>
<!--  %Reasoning␣is␣__␣at␣least␣currently␣__␣not␣cheap.␣LLMs␣are␣experiencing␣a␣substantial␣compute␣crunch␣currently␣and␣thinking␣tokens␣impose␣a␣substantial␣additional␣cost␣beyond␣explicit␣model␣outputs. -->    <para xml:id="S4.p1">
      <p>Reasoning is not cheap: thinking tokens impose substantial cost beyond explicit outputs.
While we have argued for the <text font="bold">imperative of introspection (Position II)</text> and will discuss the critical role of <text font="bold">dialogue quality (Position III)</text>, a practical question remains: does the cost of this internal deliberation yield a net positive? In this section, we support both positions by demonstrating that the computational overhead of high-quality introspection is not a sunken cost, but an investment that yields superior efficiency and transfer.</p>
    </para>
<!--  %In␣this␣section,␣we␣provide␣the␣computational␣justification␣for␣both␣\textbf{Position␣II␣and␣Position␣III}.␣We␣argue␣that␣conversational␣environments,␣despite␣their␣higher␣immediate␣token␣cost,␣function␣as␣a␣high-fidelity␣compression␣algorithm␣for␣experience.␣By␣transforming␣sparse␣environmental␣rewards␣into␣dense,␣self-generated␣narratives,␣introspection␣allows␣agents␣to␣extract␣richer␣learning␣signals␣from␣fewer␣interactions. 
     %****␣main.tex␣Line␣300␣****
     %\section{Efficiency␣and␣Transfer}
     %Reasoning␣is␣-␣at␣least␣currently␣-␣not␣cheap.␣While␣we␣believe␣in␣the␣␣necessity␣of␣introspection␣for␣robust␣learning␣and␣reasoning,␣a␣critical␣question␣remains␣regarding␣its␣economic␣viability:␣does␣the␣cost␣of␣internal␣deliberation␣yield␣a␣net␣positive␣in␣learning␣efficiency?
     %We␣argue␣that␣conversational␣environments,␣despite␣their␣higher␣immediate␣token␣cost,␣function␣as␣a␣high-fidelity␣compression␣algorithm␣for␣experience.␣By␣transforming␣sparse␣environmental␣rewards␣into␣dense,␣self-generated␣narratives,␣introspection␣allows␣agents␣to␣extract␣richer␣learning␣signals␣from␣fewer␣interactions.
     %Thus,␣the␣computational␣overhead␣of␣dialogue␣is␣not␣a␣sunken␣cost,␣but␣an␣investment␣that␣amortizes␣rapidly␣by␣reducing␣the␣need␣for␣massive␣dataset␣scaling␣and␣preventing␣the␣need␣for␣continuous␣retraining.-->    <para xml:id="S4.p2">
      <p>We examine how this paradigm shifts the optimization landscape from raw parameter scaling to the strategic allocation of test-time compute, enabling agents to compile conversational insights into permanent, transferable policies.</p>
    </para>
    <para xml:id="S4.p3">
      <p><text font="bold">Sample Efficiency</text>
<!--  %Multi-Turn␣Reinforcement␣Learning␣(MTRL)␣outperforms␣single-turn␣methods␣by␣optimizing␣for␣the␣relative␣future␣value␣of␣a␣conversation␣\cite{gao2025regressingrelativefutureefficient}. 
     %Optimizing␣for␣single-turn␣quality␣is␣often␣myopic,␣maximizing␣immediate␣rewards␣at␣the␣expense␣of␣long-term␣task␣success.␣New␣algorithms␣like␣REFUEL␣demonstrate␣that␣agents␣can␣learn␣long-horizon␣strategies,␣such␣as␣asking␣clarifying␣questions.␣MTRL␣implies␣a␣higher␣cost␣per␣episode,␣but␣a␣lower␣cost␣per␣convergence.-->Multi-Turn RL outperforms single-turn methods by optimizing for relative future value <cite class="ltx_citemacro_citep">(<bibref bibrefs="gao2025regressingrelativefutureefficient" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Single-turn optimization is myopic; REFUEL demonstrates agents learning long-horizon strategies like asking clarifying questions. MTRL has higher cost per episode but lower cost per convergence.
Standard RL agents often require millions of steps to learn simple heuristics because they rely on scalar reward signals that provide no explanation of <text font="bold">why</text> a failure occurred. In contrast, an introspective agent engaging in conversational repair generates its own dense supervision signal.</p>
    </para>
    <para xml:id="S4.p4">
      <p><text font="bold">Compiling Experience</text>
Crucially, this conversational overhead does not need to be permanent.
<!--  %Curriculum␣Learning␣approaches␣demonstrate␣that␣while␣models␣initially␣learn␣with␣explicit␣thought␣tokens␣(the␣scaffold),␣they␣can␣progressively␣internalize␣this␣reasoning␣into␣their␣weights. -->Curriculum Learning demonstrates that models can progressively internalize explicit thought tokens into their weights, compiling the reasoning benefits of introspection without runtime token overhead <cite class="ltx_citemacro_citep">(<bibref bibrefs="huang2025fastquietstarthinkingthought" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>.
This effectively compiles the conversational experience, allowing the agent to retain the reasoning benefits of introspection without the token overhead at runtime. <cite class="ltx_citemacro_citep">(<bibref bibrefs="huang2025fastquietstarthinkingthought" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>
The computational load of the Inner Critic can thus be amortized — paid once during the learning phase to produce a lightweight, instinctual inference model.</p>
    </para>
    <para xml:id="S4.p5">
      <p><text font="bold">Generalization &amp; Transfer</text>
Finally, we believe that allocating test-time compute via conversational introspection is more effective than scaling model parameters for complex problem solving and domain transfer.
A strong correlation (<Math mode="inline" tex="r=0.95" text="r = 0.95" xml:id="S4.p5.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMTok font="italic" role="UNKNOWN">r</XMTok>
              <XMTok meaning="0.95" role="NUMBER">0.95</XMTok>
            </XMApp>
          </XMath>
        </Math>) between reasoning tokens and human reaction times suggests conversational environments mimic the biological cost of thinking <cite class="ltx_citemacro_citep">(<bibref bibrefs="deVarda,snell2024scalingllmtesttimecompute" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Scaling test-time compute provides an elastic alternative to memorizing difficult cases via larger models.
Introspection prevents silent thought collapse and forces the model to expend cognitive computation proportional to the problem’s difficulty.
If introspection provides the optimal way of test-time thinking, it also becomes a compute-optimal strategy for hard tasks.
<!--  %****␣main.tex␣Line␣325␣**** --></p>
    </para>
  </section>
  <section inlist="toc" xml:id="S5">
    <tags>
      <tag>5</tag>
      <tag role="autoref">section 5</tag>
      <tag role="refnum">5</tag>
      <tag role="typerefnum">§5</tag>
    </tags>
    <title><tag close=" ">5</tag>The Dialogue Quality Matters</title>
    <para xml:id="S5.p1">
      <p>Finally, we substantiate our third position: <text font="bold">Dialogue Quality is the New Data Quality</text>. If the private mind is formed through the internalization of social friction (Position I), and introspection is the engine of learning (Position II), then the specific dynamics of that friction—its rigor, diversity, and complexity—determine the ceiling of the agent’s reasoning capabilities.</p>
    </para>
    <subsection inlist="toc" xml:id="S5.SS1">
      <tags>
        <tag>5.1</tag>
        <tag role="autoref">subsection 5.1</tag>
        <tag role="refnum">5.1</tag>
        <tag role="typerefnum">§5.1</tag>
      </tags>
      <title><tag close=" ">5.1</tag>Dialogue Richness Bounds Reasoning Depth</title>
<!--  %If␣we␣accept␣the␣premise␣that␣the␣private␣mind␣is␣an␣internalized␣artifact␣of␣social␣interaction␣\cite{Vygotsky1978,␣Colas_2022},␣then␣the␣quality␣of␣that␣interaction␣strictly␣bounds␣the␣quality␣of␣the␣resulting␣intelligence. 
     %Merely␣scaling␣the␣volume␣of␣multi-agent␣interactions␣is␣insufficient␣if␣those␣exchanges␣lack␣the␣requisite␣friction␣to␣provoke␣genuine␣sense-making.-->      <para xml:id="S5.SS1.p1">
        <p>If the private mind is an internalized artifact of social interaction <cite class="ltx_citemacro_citep">(<bibref bibrefs="Vygotsky1978,Colas_2022" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, interaction quality strictly bounds intelligence quality. Scaling interaction volume is insufficient without friction that provokes genuine sense-making.
For instance, a sycophantic external dialogue inevitably collapses into a hallucinating internal critic. We must therefore shift our focus from the quantity of tokens to the factors affecting the quality of communicative repair, coordination, and the Socratic rigor embedded within the training scaffold.</p>
      </para>
<!--  %\subsection{Dialogue␣Richness␣Influences␣Reasoning} 
     %If␣we␣accept␣the␣premise␣that␣the␣private␣mind␣is␣an␣internalized␣artifact␣of␣social␣interaction,␣following␣\cite{Vygotsky1978,␣Colas_2022},␣then␣the␣quality␣of␣that␣interaction␣matters.␣Merely␣scaling␣the␣volume␣of␣multi-agent␣interaction␣is␣insufficient␣if␣those␣interactions␣lack␣the␣requisite␣friction␣to␣provoke␣genuine␣sense-making.␣For␣instance,␣sycophantic␣external␣dialogue␣inevitably␣collapses␣into␣a␣hallucinating␣internal␣critic.␣We␣thus␣shift␣our␣focus␣from␣the␣quantity␣of␣tokens␣to␣the␣factors␣affecting␣the␣quality␣of␣the␣communicative␣repair,␣coordination,␣and␣Socratic␣rigor␣embedded␣within␣the␣training␣scaffold.-->      <para xml:id="S5.SS1.p2">
        <p><cite class="ltx_citemacro_citet"><bibref bibrefs="Colas_2022" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> provide a theoretical foundation for this requirement in their <text font="italic">Vygotskian Autotelic AI</text> framework. Drawing on developmental psychology, they argue that dialogue functions not merely as a communication medium, but as a cognitive capability that scaffolds reasoning, abstraction, and planning. Specifically, they claim that agents align their internal generative models with external cultural model through linguistic feedback loops of description, explanation, and instruction. Consequently, the efficacy of an agent’s future solitary reasoning, its introspection, is strictly bounded by the richness of the dialogic interactions it internalizes.</p>
      </para>
      <para xml:id="S5.SS1.p3">
        <p>We need to generate high-quality friction without pre-existing high quality reasoners. Initially, we can mine repair sequences that naturally occur in human dialogue rather than just clean text. Once this structural foundation is laid, the agent transitions to asymmetric self-play. As generating a valid critique is often computationally easier than generating a correct solution, simple rule-based Socratic Obstacles can successfully challenge a more capable Solver module, creating a self-reinforcing ladder of reasoning improvements.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S5.SS2">
      <tags>
        <tag>5.2</tag>
        <tag role="autoref">subsection 5.2</tag>
        <tag role="refnum">5.2</tag>
        <tag role="typerefnum">§5.2</tag>
      </tags>
      <title><tag close=" ">5.2</tag>Introspective Dialogue Goals</title>
      <para xml:id="S5.SS2.p1">
        <p>To operationalize this insight, we must define the mechanics of high-quality interaction. These mechanics double as evaluation criteria to measure to what extent inner dialogue improves the agents’ reasoning.</p>
      </para>
      <para xml:id="S5.SS2.p2">
        <p><text font="bold">Promote Cooperation Success.</text> In the psycholinguistic tradition, language is inherently and fundamentally social; conversation is the normative context in which language evolved and is learned. Language is thus not merely a mechanism for transferring information between isolated minds, but a means for cooperative activity <cite class="ltx_citemacro_citep">(<bibref bibrefs="austin1962how,clark1996using" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. Dialogue quality, therefore, is defined by the success of joint action in achieving a shared goal.
<!--  %****␣main.tex␣Line␣350␣**** --></p>
      </para>
      <para xml:id="S5.SS2.p3">
        <p>This perspective has a robust multi-disciplinary history. <cite class="ltx_citemacro_citet"><bibref bibrefs="austin1962how" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> argued that utterances are best understood as speech acts that perform social functions, while <cite class="ltx_citemacro_citet"><bibref bibrefs="grice1975logic" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> grounded his maxims in the assumption of a “<text font="italic">Cooperative Principle</text>” between interlocutors. Moving beyond individual intent, <cite class="ltx_citemacro_citet"><bibref bibrefs="clark1996using" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> and recent work on interpersonal synergy (e.g., <cite class="ltx_citemacro_citep"><bibref bibrefs="fusaroli2014dialog" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref></cite>) demonstrate that dialogic activity involves processes that cannot be reduced to the individual level. As such, the precursors to linguistic competence are found in the ability to engage in joint commitments and to reason about the communicative intent of a partner based on shared goals <cite class="ltx_citemacro_citep">(<bibref bibrefs="scottphillips2023expression,tomasello2005understanding" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S5.SS2.p4">
        <p><text font="bold">Reduce Collaborative Effort.</text> Within this framework, a high-quality conversational experience is one in which speakers successfully manage coordination. This is often achieved via the <text font="italic">principle of least collaborative effort</text> <cite class="ltx_citemacro_citep">(<bibref bibrefs="clark1986referring" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, which frames conversational efficiency not as minimizing words, but as minimizing the collective effort required to reach mutual understanding and achieve a positive shared outcome. Crucially, this involves interactive repair. <cite class="ltx_citemacro_citet"><bibref bibrefs="dingemanse2024interactive" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> demonstrate that repair sequences — dynamics used for calibrating understanding — occur roughly every 84 seconds in natural conversation. Rather than signaling failure, these mechanisms are critical to the shared computation needed for effective collaboration, providing a fallback for “good enough” processing throughout an interaction. For an AI agent, the ability to engage in such proactive repair strategies does not merely fix misunderstandings; it constitutes the conversational resilience required to survive the noise of real-world interaction <cite class="ltx_citemacro_citep">(<bibref bibrefs="ashktorab2019resilient" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S5.SS2.p5">
        <p><text font="bold">Maintain Conversational State.</text> The theoretical ideal of collaborative sense-making is predicated on the agent’s capacity for shared intentionality: the ability to form and sustain a <text font="italic">we-intention</text> with a partner <cite class="ltx_citemacro_citep">(<bibref bibrefs="fb093503-824a-37da-8540-c369a5dfdf71" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. However, as <cite class="ltx_citemacro_citet"><bibref bibrefs="tang2025joint" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> demonstrate, maintaining the joint commitment required to fulfill these shared goals is distinct from simple reward maximization; it requires a stable representation of the collective aim that persists against distractions. Recent benchmarks such as LongBench v2 <cite class="ltx_citemacro_citep">(<bibref bibrefs="bai2025longbenchv2deeperunderstanding" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> reveal that current models struggle with this standard: while they may retain retrieval accuracy over long contexts, their reasoning capabilities decay by nearly 50% over time—a phenomenon of silent thought collapse.</p>
      </para>
      <para xml:id="S5.SS2.p6">
        <p>Crucially, this state tracking functions as the architectural substrate for the pragmatic reasoning and common ground development required to represent goals and planning steps. By operationalizing the metacognitive framework of <cite class="ltx_citemacro_citet"><bibref bibrefs="flavell1979metacognition" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> into a strict <text font="italic">Monitor-Generate-Verify</text> loop, the agent actively protects this shared state, identifying knowledge gaps and correcting deviations to ensure it possesses the requisite grounding to sustain the shared project.</p>
      </para>
      <para xml:id="S5.SS2.p7">
        <p><text font="bold">Minimize Groupthink.</text> Drawing on the research outlined above, we propose that high-quality training data is not a seamless, error-free exchange. Rather, it is an interaction where misalignment between social agents is actively detected and resolved. Groupthink occurs when agents minimize <text font="italic">social</text> friction (agreeing to please); collaborative sense-making emerges when agents minimize <text font="italic">grounding</text> friction (agreeing on shared intentionality).</p>
      </para>
      <para xml:id="S5.SS2.p8">
        <p>Therefore, the reward signal must shift from “Did the agents agree?” to “Did the agents successfully repair misunderstandings, establish shared intentionality, and coordinate contributions to achieve a shared goal?” This suggests that the ideal training partner is not a sycophantic peer, but a <text font="bold">Socratic Obstacle</text>: an agent programmed to introduce the specific type of ambiguity that forces the learner to externalize, clarify, and verify its own reasoning steps. Such necessary friction can be inspired through a variety of mechanisms, such as initial goal misalignment, information asymmetries, and logical skepticism that forces the rigorous defense of the agent’s narrative. It is likely that the diversity of such dimensions within high-quality social interactions will form the foundation of a robust introspective toolkit.</p>
      </para>
      <para xml:id="S5.SS2.p9">
        <p><text font="bold">Promote Conversation Steerability.</text> Finally, intelligence starts as social dialogue and must be internalized to become reasoning. Work such as MIMIC (Inner Speech as Behavior Guides) by <cite class="ltx_citemacro_citet"><bibref bibrefs="trivedi2025inner" separator=";" show="Authors Phrase1YearPhrase2" yyseparator=",">
              <bibrefphrase>(</bibrefphrase>
              <bibrefphrase>)</bibrefphrase>
            </bibref></cite> introduces a framework where an agent generates inner speech to mediate between perception and action. Instead of mapping <Math mode="inline" tex="State\rightarrow Action" text="S * t * a * t * e rightarrow A * c * t * i * o * n" xml:id="S5.SS2.p9.m1">
            <XMath>
              <XMApp>
                <XMTok name="rightarrow" role="ARROW">→</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">S</XMTok>
                  <XMTok font="italic" role="UNKNOWN">t</XMTok>
                  <XMTok font="italic" role="UNKNOWN">a</XMTok>
                  <XMTok font="italic" role="UNKNOWN">t</XMTok>
                  <XMTok font="italic" role="UNKNOWN">e</XMTok>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">A</XMTok>
                  <XMTok font="italic" role="UNKNOWN">c</XMTok>
                  <XMTok font="italic" role="UNKNOWN">t</XMTok>
                  <XMTok font="italic" role="UNKNOWN">i</XMTok>
                  <XMTok font="italic" role="UNKNOWN">o</XMTok>
                  <XMTok font="italic" role="UNKNOWN">n</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>, the agent learns <Math mode="inline" tex="State\rightarrow InnerSpeech\rightarrow Action" text="S * t * a * t * e rightarrow I * n * n * e * r * S * p * e * e * c * h rightarrow A * c * t * i * o * n" xml:id="S5.SS2.p9.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="multirelation"/>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">S</XMTok>
                  <XMTok font="italic" role="UNKNOWN">t</XMTok>
                  <XMTok font="italic" role="UNKNOWN">a</XMTok>
                  <XMTok font="italic" role="UNKNOWN">t</XMTok>
                  <XMTok font="italic" role="UNKNOWN">e</XMTok>
                </XMApp>
                <XMTok name="rightarrow" role="ARROW">→</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">I</XMTok>
                  <XMTok font="italic" role="UNKNOWN">n</XMTok>
                  <XMTok font="italic" role="UNKNOWN">n</XMTok>
                  <XMTok font="italic" role="UNKNOWN">e</XMTok>
                  <XMTok font="italic" role="UNKNOWN">r</XMTok>
                  <XMTok font="italic" role="UNKNOWN">S</XMTok>
                  <XMTok font="italic" role="UNKNOWN">p</XMTok>
                  <XMTok font="italic" role="UNKNOWN">e</XMTok>
                  <XMTok font="italic" role="UNKNOWN">e</XMTok>
                  <XMTok font="italic" role="UNKNOWN">c</XMTok>
                  <XMTok font="italic" role="UNKNOWN">h</XMTok>
                </XMApp>
                <XMTok name="rightarrow" role="ARROW">→</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">A</XMTok>
                  <XMTok font="italic" role="UNKNOWN">c</XMTok>
                  <XMTok font="italic" role="UNKNOWN">t</XMTok>
                  <XMTok font="italic" role="UNKNOWN">i</XMTok>
                  <XMTok font="italic" role="UNKNOWN">o</XMTok>
                  <XMTok font="italic" role="UNKNOWN">n</XMTok>
                </XMApp>
              </XMApp>
            </XMath>
          </Math>. Crucially, they show this leads to more steerable and robust behavior than direct imitation. This offers insight into another metric that must extend the inner dialogue evaluation toolbox: steerability, or behavioral plasticity.</p>
      </para>
    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:alternative_views" xml:id="S6">
    <tags>
      <tag>6</tag>
      <tag role="autoref">section 6</tag>
      <tag role="refnum">6</tag>
      <tag role="typerefnum">§6</tag>
    </tags>
    <title><tag close=" ">6</tag>Alternative Views</title>
    <para xml:id="S6.p1">
      <p>While we argue for the centrality of conversational environments and the social genesis of introspection, it is rigorous to consider that the observed benefits of this paradigm may stem from underlying computational mechanisms rather than the dialogic form itself. Furthermore, prioritizing natural language as the universal interface for learning may impose critical bottlenecks in efficiency and multi-modal grounding. We present two primary alternative positions below.</p>
    </para>
    <subsection inlist="toc" xml:id="S6.SS1">
      <tags>
        <tag>6.1</tag>
        <tag role="autoref">subsection 6.1</tag>
        <tag role="refnum">6.1</tag>
        <tag role="typerefnum">§6.1</tag>
      </tags>
      <title><tag close=" ">6.1</tag>Introspection as Compute: The Latent Alternative</title>
      <para xml:id="S6.SS1.p1">
        <p>Position I and II premise that the <text font="italic">structure</text> of dialogue, the back-and-forth of a polyphonic self, is the causal driver of improved reasoning. However, a credible opposing view suggests that the performance gains attributed to introspection are primarily a function of <text font="bold">additional test-time computation</text>, effectively decoupling the benefit from the “conversational” format.</p>
      </para>
      <para xml:id="S6.SS1.p2">
        <p>Recent studies on inference scaling laws demonstrate that the efficacy of techniques like Chain-of-Thought (CoT) correlates strongly with the sheer volume of compute allocated to the generation of intermediate tokens, regardless of their linguistic coherence or dialogic structure <cite class="ltx_citemacro_citep">(<bibref bibrefs="snell2024scalingllmtesttimecompute" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. This suggests that “inner speech” may simply be a mechanism for delaying the collapse of the probability distribution, allowing the model to perform search and error correction in the token space.</p>
      </para>
      <para xml:id="S6.SS1.p3">
        <p>If the benefit is purely computational, the explicit, conversational form may be inefficient. Emerging research on <text font="italic">Latent Reasoning</text> challenges the necessity of intelligible text for introspection. The “Coconut” (Chain of Continuous Thought) paradigm demonstrates that Large Language Models (LLMs) can reason in continuous latent space, feeding the last hidden state back as input rather than decoding to discrete text <cite class="ltx_citemacro_citep">(<bibref bibrefs="hao2024training" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. This allows the model to maintain multiple reasoning paths in superposition, effectively performing a breadth-first search (BFS) that is impossible in linear narrative, while reducing token overhead. Similarly, <text font="italic">Implicit Chain-of-Thought</text> approaches show that reasoning steps can be internalized into the model’s weights or hidden states, bypassing the need for an explicit, human-readable dialogue <cite class="ltx_citemacro_citep">(<bibref bibrefs="deng2024implicit" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S6.SS1.p4">
        <p>Furthermore, research in State Representation Learning (SRL) suggests that a <text font="italic">richer structured description</text> of the state—explicitly encoding object relations and causal dynamics into the input vector—can substitute for the sense-making process of introspection <cite class="ltx_citemacro_citep">(<bibref bibrefs="lesort2018state,echchahed2025survey" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. If the observation itself is sufficiently rich (e.g., via object-centric representations or scene graphs), the introspective step of converting observation to experience may become redundant.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S6.SS2">
      <tags>
        <tag>6.2</tag>
        <tag role="autoref">subsection 6.2</tag>
        <tag role="refnum">6.2</tag>
        <tag role="typerefnum">§6.2</tag>
      </tags>
      <title><tag close=" ">6.2</tag>The Dialogic Bottleneck in Agentic Environments</title>
      <para xml:id="S6.SS2.p1">
        <p>Position III argues that “Dialogue Quality is the New Data Quality.” However, for embodied agents and high-frequency multi-agent systems, dialogue may represent a <text font="bold">communication bottleneck</text> rather than an optimal scaffold.</p>
      </para>
      <paragraph inlist="toc" xml:id="S6.SS2.SSS0.Px1">
        <title>The Modality Mismatch.</title>
        <para xml:id="S6.SS2.SSS0.Px1.p1">
          <p>In agentic environments involving robotics or real-time control, forcing multi-modal data (vision, proprioception, depth) through a linguistic bottleneck introduces severe latency and information loss. Text is a low-bandwidth, discrete serialization of a high-dimensional, continuous world. Recent work on low-latency drone planning (e.g., TypeFly) indicates that the sequential generation of language tokens for every decision creates a “Real-Time Perception Bottleneck” that renders agents unresponsive to dynamic environmental changes <cite class="ltx_citemacro_citep">(<bibref bibrefs="chen2025typefly" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>. In these contexts, the “social genesis” of mind might be better served by <text font="italic">hierarchical</text> architectures where high-level goals are linguistic, but low-level introspection and execution occur in dense, non-verbal vector spaces <cite class="ltx_citemacro_citep">(<bibref bibrefs="li2025efficient,team2025gemini" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>. However, we argue that the emergence of native Multimodal-In, Multimodal-Out models addresses this gap. Interleaving sensory tokens directly with reasoning tokens, these architectures allow agents to maintain the benefits of dialogic sense-making without suffering the compression artifacts of text-only serialization.</p>
        </para>
      </paragraph>
      <paragraph inlist="toc" xml:id="S6.SS2.SSS0.Px2">
        <title>Vector Communication vs. Natural Language.</title>
        <para xml:id="S6.SS2.SSS0.Px2.p1">
          <p>While language is the optimal protocol for <text font="italic">human-AI</text> alignment, it is demonstrably sub-optimal for <text font="italic">agent-agent</text> coordination. When multi-agent systems are permitted to optimize their own communication protocols, they frequently converge on “Neuralese"—continuous vector-based exchanges that maximize information density and transmission speed. The <text font="italic">LatentMAS</text> framework recently demonstrated that agents collaborating via the direct transmission of latent working memory (KV-cache states) achieved higher accuracy and 4<Math mode="inline" tex="\times" text="*" xml:id="S6.SS2.SSS0.Px2.p1.m1">
              <XMath>
                <XMTok meaning="times" role="MULOP">×</XMTok>
              </XMath>
            </Math> faster inference speeds compared to agents forced to communicate via text <cite class="ltx_citemacro_citep">(<bibref bibrefs="liu2025latentmas" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>. Similarly, the <text font="italic">Interlat</text> paradigm shows that transmitting “thought vectors” allows for a form of “telepathic” coordination that is more robust to noise than explicit dialogue <cite class="ltx_citemacro_citep">(<bibref bibrefs="chen2025interlat" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
                <bibrefphrase>, </bibrefphrase>
              </bibref>)</cite>.</p>
        </para>
        <para xml:id="S6.SS2.SSS0.Px2.p2">
          <p>Therefore, while conversational environments are essential for aligning AI with human reasoning patterns, scaling intelligence solely through this paradigm risks recapitulating the evolutionary constraints of biological communication. A hybrid view suggests that the private mind of an advanced agent should perhaps be less of a “Socratic debater” and more of a “high-dimensional simulator,” capable of processing introspection and communication in formats far richer than the social dialogue from which it originated.</p>
        </para>
      </paragraph>
    </subsection>
  </section>
  <section inlist="toc" xml:id="S7">
    <tags>
      <tag>7</tag>
      <tag role="autoref">section 7</tag>
      <tag role="refnum">7</tag>
      <tag role="typerefnum">§7</tag>
    </tags>
    <title><tag close=" ">7</tag>Conclusion / Call to Action</title>
<!--  %We␣conclude␣that␣the␣path␣toward␣robust␣introspection␣in␣AI␣rests␣on␣two␣critical␣implementation␣pillars.␣First,␣the␣efficacy␣of␣the␣internalized␣cognitive␣tool␣is␣a␣function␣of␣the␣diversity␣of␣the␣conversational␣environments␣in␣which␣it␣was␣forged.␣Just␣as␣human␣development␣relies␣on␣a␣rich␣variety␣of␣social␣contexts␣to␣abstract␣general␣reasoning␣principles␣from␣specific␣instances,␣artificial␣agents␣require␣diverse␣dialogic␣landscapes␣to␣structure␣their␣learning. -->    <para xml:id="S7.p1">
      <p>Robust introspection rests on two pillars. First, efficacy depends on the diversity of conversational environments in which it was forged. Just as humans abstract reasoning principles from varied social contexts, agents require diverse dialogic landscapes. If the conversational environment is monolithic, the agent internalizes a script; if diverse, varying in interlocutor intent, domain complexity, and ambiguity, the agent internalizes the process of reasoning itself. The environment thus acts as the primary scaffold, forcing the model to adapt its introspective capabilities to a wide range of causal structures.</p>
    </para>
    <para xml:id="S7.p2">
      <p>Second, the learning signal must be fundamentally reoriented: in order to instantiate the internalization of introspection via social experiences we must design rewards aligned with successful joint achievement and high-quality interaction dynamics, rather than static imitation of text. By rewarding the mechanics of collaboration, the efficiency in which common ground is established, misalignment is repaired, and shared effort is coordinated, we incentivize the agent to value the integrity of the communication channel itself. Ultimately, by coupling diverse environmental scaffolding with rewards that privilege interactional success, we move beyond training models to mimic the form of dialogue, and instead train them to master the function of collaborative sense-making. It is this shift that will allow external conversation to mature into the introspection necessary for genuine reasoning.
                                                                       
<!--  %****␣main.tex␣Line␣375␣**** --></p>
    </para>
  </section>
  <bibliography citestyle="authoryear" files="example_paper" xml:id="bib">
    <title>References</title>
  </bibliography>
<!--  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% 
     %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     %APPENDIX
     %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     %\newpage
     %\appendix
     %\onecolumn
     %\section{You␣\emph{can}␣have␣an␣appendix␣here.}
     %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
     %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%--></document>
