<?xml version="1.0" encoding="UTF-8"?>
<?latexml searchpaths="/home/japhy/scienceReplication.artiswrong.com/paper_files/arxiv/2310.06983/latex_extracted"?>
<?latexml class="article" options="twocolumn"?>
<!--  %Language setting --><!--  %Replace ‘english’ with e.g. ‘spanish’ to change the document language --><?latexml package="babel" options="english"?>
<?latexml package="csquotes"?>
<?latexml package="biblatex" options="
backend=biber,
style=numeric,
"?>
<!--  %style=draft __¿ more transparency, numeric for final product --><!--  %Set page size and margins --><!--  %Replace ‘letterpaper’ with ‘a4paper’ for UK/EU standard size --><?latexml package="geometry" options="letterpaper,top=2cm,bottom=2cm,left=3cm,right=3cm,marginparwidth=1.75cm"?>
<!--  %Useful packages --><?latexml package="amsmath"?>
<?latexml package="graphicx"?>
<?latexml package="hyperref" options="colorlinks=true, allcolors=blue"?>
<!--  %**** main.tex Line 25 **** --><!--  %“author–Courtland Leer“textsuperscript–“emoji–goggles˝˝, Vincent Trost“textsuperscript–“emoji–goggles˝˝, Vineeth Voruganti“textsuperscript–“emoji–goggles˝˝““[1ex] --><!--  %“textit–“emoji–goggles˝ Plastic Labs Inc.˝ ““ --><!--  %“texttt–[courtland, vince, vineeth]@plasticlabs.ai˝ --><!--  %˝ --><?latexml RelaxNGSchema="LaTeXML"?>
<document xmlns="http://dlmf.nist.gov/LaTeXML" class="ltx_authors_1line">
  <resource src="LaTeXML.css" type="text/css"/>
  <resource src="ltx-article.css" type="text/css"/>
  <title>Violation of Expectation via Metacognitive Prompting Reduces Theory of Mind Prediction Error in Large Language Models</title>
  <creator role="author" xml:lang="en">
    <personname>Courtland Leer, Vincent Trost, Vineeth Voruganti<break/>
<text font="italic">Plastic Labs Inc.</text> <break/><text font="typewriter">[courtland, vince, vineeth]@plasticlabs.ai</text>
</personname>
  </creator>
  <abstract name="Abstract" xml:lang="en">
    <p>Recent research shows that Large Language Models (LLMs) exhibit a compelling level of proficiency in Theory of Mind (ToM) tasks. This ability to impute unobservable mental states to others is vital to human social cognition and may prove equally important in principal-agent relations between individual humans and Artificial Intelligences (AIs). In this paper, we explore how a mechanism studied in developmental psychology known as Violation of Expectation (VoE) can be implemented to reduce errors in LLM prediction about users by leveraging emergent ToM affordances. And we introduce a <text font="italic">metacognitive prompting</text> framework to apply VoE in the context of an AI tutor. By storing and retrieving facts derived in cases where LLM expectation about the user was violated, we find that LLMs are able to learn about users in ways that echo theories of human learning. Finally, we discuss latent hazards and augmentative opportunities associated with modeling user psychology and propose ways to mitigate risk along with possible directions for future inquiry.</p>
  </abstract>
  <ERROR class="undefined">\addbibresource</ERROR>
  <para xml:id="p1">
    <p>sample.bib <!--  %Imports bibliography file --><text xml:lang="en"></text></p>
  </para>
  <section inlist="toc" xml:id="S1" xml:lang="en">
    <tags>
      <tag>1</tag>
      <tag role="autoref">section 1</tag>
      <tag role="refnum">1</tag>
      <tag role="typerefnum">§1</tag>
    </tags>
    <title><tag close=" ">1</tag>Motivation</title>
    <para xml:id="S1.p1">
      <p>Plastic Labs is a research-driven product company whose mission is to eliminate the principal-agent problem <cite class="ltx_citemacro_cite">[<bibref bibrefs="jensen2019theory" separator="," yyseparator=","/>]</cite> horizontally across human-AI interaction. In a near future of abundant intelligence, every human becomes a potent principal and every service an agentic AI. Alignment of incentives and information, then, must occur at the scale of the individual. Enabling models to deeply understand and cohere to user psychology will be critical and underscores the importance of research at the intersection of human and machine learning.</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S2" xml:lang="en">
    <tags>
      <tag>2</tag>
      <tag role="autoref">section 2</tag>
      <tag role="refnum">2</tag>
      <tag role="typerefnum">§2</tag>
    </tags>
    <title><tag close=" ">2</tag>Introduction</title>
    <para xml:id="S2.p1">
      <p>Large Language Models (LLMs) have been shown to have a number of emergent abilities <cite class="ltx_citemacro_cite">[<bibref bibrefs="wei2022emergent" separator="," yyseparator=","/>]</cite>. Among those is Theory of Mind (ToM), defined as “the ability to impute unobservable mental states to others” <cite class="ltx_citemacro_cite">[<bibref bibrefs="kosinski2023theory" separator="," yyseparator=","/>]</cite>. The emergence of this specific capability is of significant interest, as it promises LLMs with the ability to empathize and develop strong psychological models of others, as humans do naturally.
<!--  %**** main.tex Line 50 **** --></p>
    </para>
    <para xml:id="S2.p2">
      <p>But how do you best position LLMs to demonstrate these qualities? Typical methods posit that connecting data sources deemed personal (e.g. email, documents, notes, activity, etc.) is sufficient for learning about a user. Yet these methods assume individual persons are merely the aggregate of their intentionally produced, often superficial, digital artifacts. Critical context is lacking — the kind of psychological data humans automatically glean from social cognition and use in ToM (e.g. beliefs, emotions, desires, thoughts, intentions, knowledge, history, etc.).</p>
    </para>
    <para xml:id="S2.p3">
      <p>We propose an entirely passive approach to collect this data, informed by how developmental psychology suggests humans begin constructing models of the world from the earliest stages <cite class="ltx_citemacro_cite">[<bibref bibrefs="onishi200515" separator="," yyseparator=","/>]</cite>. This cognitive mechanism, known as Violation of Expectation (VoE) <cite class="ltx_citemacro_cite">[<bibref bibrefs="brod2022explicitly" separator="," yyseparator=","/>]</cite>, compares predictions about environments against sense data from experience to learn from the difference, i.e. errors in prediction.</p>
    </para>
    <para xml:id="S2.p4">
      <p>Inspired by prompting methodologies like Chain-of-Thought <cite class="ltx_citemacro_cite">[<bibref bibrefs="wei2022chain" separator="," yyseparator=","/>]</cite> and Metaprompt Programming <cite class="ltx_citemacro_cite">[<bibref bibrefs="reynolds2021prompt" separator="," yyseparator=","/>]</cite>, we design a <text font="italic">metacognitive prompting</text> framework for LLMs to mimic the VoE learning process. And we show that VoE-data-informed social reasoning about users results in less ToM prediction error.</p>
    </para>
    <para xml:id="S2.p5">
      <p>This paper has the following two objectives:</p>
      <enumerate xml:id="S2.I1">
        <item xml:id="S2.I1.i1">
          <tags>
            <tag>1.</tag>
            <tag role="autoref">item 1</tag>
            <tag role="refnum">1</tag>
            <tag role="typerefnum">item 1</tag>
          </tags>
          <para xml:id="S2.I1.i1.p1">
            <p>Demonstrate the general utility of a metacognitve prompting framework for VoE in reducing ToM prediction error in a domain-specific application — <ref class="ltx_href" href="https://chat.bloombot.ai">Bloom</ref>, a free AI tutor available on the web and via Discord.</p>
          </para>
        </item>
        <item xml:id="S2.I1.i2">
          <tags>
            <tag>2.</tag>
            <tag role="autoref">item 2</tag>
            <tag role="refnum">2</tag>
            <tag role="typerefnum">item 2</tag>
          </tags>
          <para xml:id="S2.I1.i2.p1">
            <p>Discuss at length opportunities for future work, including the practical and philosophical implications of this emergent capability to create psychological renderings of humans and ways to leverage confidential computing environments to secure them.</p>
          </para>
        </item>
      </enumerate>
    </para>
    <para xml:id="S2.p6">
      <p>We use OpenAI’s GPT-4<note mark="1" role="footnote" xml:id="footnote1"><tags>
            <tag>1</tag>
            <tag role="autoref">footnote 1</tag>
            <tag role="refnum">1</tag>
            <tag role="typerefnum">footnote 1</tag>
          </tags>GPT-4 32k version: 0613</note> API in the entirety of this experiment and its evaluation.</p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:diagram" xml:id="S2.F1">
      <tags>
        <tag>Figure 1</tag>
        <tag role="autoref">Figure 1</tag>
        <tag role="refnum">1</tag>
        <tag role="typerefnum">Figure 1</tag>
      </tags>
      <graphics candidates="voe-powered_bloom_paper_fig.png" class="ltx_centering" graphic="voe-powered_bloom_paper_fig.png" options="width=433.62pt" xml:id="S2.F1.g1"/>
      <toccaption class="ltx_centering"><tag close=" ">1</tag>Framework. Contained in the grey dotted box is an application’s core conversation loop (e.g. our AI tutor, <ref class="ltx_href" href="https://chat.bloombot.ai">Bloom</ref>) and drawn in blue is the metacognitive prompting framework described in section <ref labelref="LABEL:sec:methods"/>.</toccaption>
      <caption class="ltx_centering"><tag close=": ">Figure 1</tag>Framework. Contained in the grey dotted box is an application’s core conversation loop (e.g. our AI tutor, <ref class="ltx_href" href="https://chat.bloombot.ai">Bloom</ref>) and drawn in blue is the metacognitive prompting framework described in section <ref labelref="LABEL:sec:methods"/>.</caption>
    </figure>
  </section>
  <section inlist="toc" labels="LABEL:sec:prompting" xml:id="S3" xml:lang="en">
    <tags>
      <tag>3</tag>
      <tag role="autoref">section 3</tag>
      <tag role="refnum">3</tag>
      <tag role="typerefnum">§3</tag>
    </tags>
    <title><tag close=" ">3</tag>Framing and Related Work</title>
    <para xml:id="S3.p1">
      <p><text font="bold">Predictive Coding and Theory of Mind</text>.
While not yet a complete theory, Predictive Coding (PC) continues to gain traction as a framework for understanding how modeling and learning occur in biological brains. At a high level, PC hypothesizes that mental models of reality are built and employed by comparing predictions about environments with sensory perception <cite class="ltx_citemacro_cite">[<bibref bibrefs="schultz1997neural" separator="," yyseparator=","/>]</cite>. PC-inspired approaches to machine learning show great initial promise as biologically plausible AI training methodologies <cite class="ltx_citemacro_cite">[<bibref bibrefs="salvatori2023braininspired" separator="," yyseparator=","/>]</cite>.</p>
    </para>
    <para xml:id="S3.p2">
      <p>ToM is the ability of some organisms to, despite lacking direct access to any experience but their own, ascribe mental states to others. Notably, PC “may provide an important new window on the neural computations underlying theory of mind” as ToM “exhibit[s] a key signature of predictive coding: reduced activity to predictable stimuli” <cite class="ltx_citemacro_cite">[<bibref bibrefs="koster2013theory" separator="," yyseparator=","/>]</cite>. That is, when others behave in line with our predictions (i.e. our ToM projections are accurate) less is learned. And the inverse applies — the prediction errors enhance our capacity for high-fidelity ToM over time.</p>
    </para>
<!--  %**** main.tex Line 75 **** -->    <para xml:id="S3.p3">
      <p><text font="bold">Emergent Behaviors</text>.
Researchers have long been interested in getting large language models to exhibit “thinking” and “reasoning” behaviors. A number of papers have been influential in pioneering ways to elicit these via prompting <cite class="ltx_citemacro_cite">[<bibref bibrefs="brown2020language,wei2022chain,kojima2023large,yao2023react" separator="," yyseparator=","/>]</cite>. As model architectures have scaled, these abilities appear to have emerged without explicit training <cite class="ltx_citemacro_cite">[<bibref bibrefs="wei2022emergent" separator="," yyseparator=","/>]</cite>. While there’s considerable debate concerning the distinction between “emergent abilities” and “in-context learning,” <cite class="ltx_citemacro_cite">[<bibref bibrefs="lu2023emergent" separator="," yyseparator=","/>]</cite> these phenomena display clear utility, regardless of taxonomy.</p>
    </para>
    <para xml:id="S3.p4">
      <p>Quantifying just how vast the space of latent “overhung” LLM capabilities really is constitutes a major area of formal and enthusiast-driven inquiry. ToM is one such highly compelling research domain. Kocinski <cite class="ltx_citemacro_cite">[<bibref bibrefs="kosinski2023theory" separator="," yyseparator=","/>]</cite> shows that the OpenAI GPT-series of models possess the ability to pass fundamental developmental behavior tests. Some papers demonstrate how to improve these abilities <cite class="ltx_citemacro_cite">[<bibref bibrefs="moghaddam2023boosting" separator="," yyseparator=","/>]</cite> and others analyze these methods critically, questioning the premise of ToM emerging in LLMs <cite class="ltx_citemacro_cite">[<bibref bibrefs="ullman2023large,shapira2023clever" separator="," yyseparator=","/>]</cite>.</p>
    </para>
    <para xml:id="S3.p5">
      <p>Adjacently, there’s a clear trend of researchers pushing the limit of what types of cognitive tasks can be offloaded to LLMs. In order to scale supervision, eliminate human feedback, avoid evasive responses, and have transparent governing principles, Anthropic has experimented with delegating the work of human feedback to the LLM itself in their “constitutional” approach <cite class="ltx_citemacro_cite">[<bibref bibrefs="bai2022constitutional" separator="," yyseparator=","/>]</cite>. Other papers looking to achieve similar types of outcomes, without needing to update model weights, rely on in-context methods entirely <cite class="ltx_citemacro_cite">[<bibref bibrefs="shinn2023reflexion,zhou2023solving" separator="," yyseparator=","/>]</cite>.</p>
    </para>
    <para xml:id="S3.p6">
      <p><text font="bold">Violation of Expectation</text>.
One prime task candidate, which leverages emergent ToM abilities, is VoE. Similar to explanations from PC theories of cognition, VoE is an explicit mechanism that reduces prediction errors to learn about reality.</p>
    </para>
    <para xml:id="S3.p7">
      <p>While much of VoE happens in the unconscious mind and from an early age <cite class="ltx_citemacro_cite">[<bibref bibrefs="onishi200515" separator="," yyseparator=","/>]</cite>, research suggests that deliberate prediction making and error reduction also leads to enhanced learning outcomes <cite class="ltx_citemacro_cite">[<bibref bibrefs="brod2022explicitly" separator="," yyseparator=","/>]</cite>.</p>
    </para>
    <para xml:id="S3.p8">
      <p>Just as PC may play a role in ToM, VoE is a lightweight framework for identifying the data needed to minimize ToM error. Predicts are generated, compared against percepts, and learning is derived from the difference.</p>
    </para>
    <para xml:id="S3.p9">
      <p><text font="bold">Prompting Paradigms</text>.

Chain-of-Thought <cite class="ltx_citemacro_cite">[<bibref bibrefs="wei2022chain" separator="," yyseparator=","/>]</cite> prompting clearly shows that LLMs are capable “reasoning” generators and that this species of prompting can reduce the probability of generating incorrect answers. Yet, as this method is limited to one inference, the model often disregards that reasoning, especially during ToM-related tasks.</p>
    </para>
    <para xml:id="S3.p10">
      <p>Metaprompt Programming <cite class="ltx_citemacro_cite">[<bibref bibrefs="reynolds2021prompt" separator="," yyseparator=","/>]</cite> seeks to solve the laborious process of manually generating task-specific prompts (which are more efficacious than general ones) by leveraging LLMs’ ability to few-shot prompt themselves dynamically.</p>
    </para>
    <para xml:id="S3.p11">
      <p>Deliberate VoE as learning method, ToM, and these prompting approaches all echo the human phenomenon of metacognition — put simply, thinking about thought. In the next section we introduce a <text font="italic">metacognitive prompting</text> framework in which the LLM generates ToM “thoughts” to be used in further generation as part of a VoE framework to passively acquire psychological data about the user.</p>
    </para>
  </section>
  <section inlist="toc" labels="LABEL:sec:methods" xml:id="S4" xml:lang="en">
    <tags>
      <tag>4</tag>
      <tag role="autoref">section 4</tag>
      <tag role="refnum">4</tag>
      <tag role="typerefnum">§4</tag>
    </tags>
    <title><tag close=" ">4</tag>Methods</title>
    <para xml:id="S4.p1">
      <p>The cognitive mechanism VoE can be broken down into two circular steps:
<!--  %**** main.tex Line 100 **** --></p>
      <enumerate xml:id="S4.I1">
        <item xml:id="S4.I1.i1">
          <tags>
            <tag>1.</tag>
            <tag role="autoref">item 1</tag>
            <tag role="refnum">1</tag>
            <tag role="typerefnum">item 1</tag>
          </tags>
          <para xml:id="S4.I1.i1.p1">
            <p>Making predictions about reality based on past learning.</p>
          </para>
        </item>
        <item xml:id="S4.I1.i2">
          <tags>
            <tag>2.</tag>
            <tag role="autoref">item 2</tag>
            <tag role="refnum">2</tag>
            <tag role="typerefnum">item 2</tag>
          </tags>
          <para xml:id="S4.I1.i2.p1">
            <p>Learning from the delta between predictions and reality.</p>
          </para>
        </item>
      </enumerate>
      <p>In the typical chat setting of a conversational LLM application, this means making a prediction about the next user input and comparing that with the actual input in order to derive psychological facts about the user at each conversational turn. We employ metacognitive prompting across both core parts of our framework shown in Figure <ref labelref="LABEL:fig:diagram"/>: our <text font="italic">user prediction task</text> and our <text font="italic">violation of expectation task</text>.</p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:results" placement="ht" xml:id="S4.F2">
      <tags>
        <tag>Figure 2</tag>
        <tag role="autoref">Figure 2</tag>
        <tag role="refnum">2</tag>
        <tag role="typerefnum">Figure 2</tag>
      </tags>
      <tabular class="ltx_centering ltx_guessed_headers" vattach="middle">
        <thead>
          <tr>
            <td align="left" border="l r t" thead="column row"><text font="bold">Assessment</text></td>
            <td align="right" border="r t" thead="column"><text font="bold">VoE N</text></td>
            <td align="right" border="r t" thead="column"><text font="bold">VoE Pct</text></td>
            <td align="right" border="r t" thead="column"><text font="bold">Non-VoE N</text></td>
            <td align="right" border="r t" thead="column"><text font="bold">Non-VoE Pct</text></td>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td align="left" border="l r t" thead="row">1. Very</td>
            <td align="right" border="r t">35</td>
            <td align="right" border="r t">0.106</td>
            <td align="right" border="r t">96</td>
            <td align="right" border="r t">0.151</td>
          </tr>
          <tr>
            <td align="left" border="l r" thead="row">2. Somewhat</td>
            <td align="right" border="r">78</td>
            <td align="right" border="r">0.237</td>
            <td align="right" border="r">77</td>
            <td align="right" border="r">0.121</td>
          </tr>
          <tr>
            <td align="left" border="l r" thead="row">3. Neutral</td>
            <td align="right" border="r">17</td>
            <td align="right" border="r">0.052</td>
            <td align="right" border="r">22</td>
            <td align="right" border="r">0.035</td>
          </tr>
          <tr>
            <td align="left" border="l r" thead="row">4. Poorly</td>
            <td align="right" border="r">90</td>
            <td align="right" border="r">0.274</td>
            <td align="right" border="r">170</td>
            <td align="right" border="r">0.267</td>
          </tr>
          <tr>
            <td align="left" border="b l r" thead="row">5. Wrong</td>
            <td align="right" border="b r">109</td>
            <td align="right" border="b r">0.331</td>
            <td align="right" border="b r">272</td>
            <td align="right" border="b r">0.427</td>
          </tr>
        </tbody>
      </tabular>
      <toccaption class="ltx_centering"><tag close=" ">2</tag>Results from A/B test in the <ref class="ltx_href" href="https://chat.bloombot.ai">Bloom</ref> Web UI.</toccaption>
      <caption class="ltx_centering"><tag close=": ">Figure 2</tag>Results from A/B test in the <ref class="ltx_href" href="https://chat.bloombot.ai">Bloom</ref> Web UI.</caption>
    </figure>
    <subsection inlist="toc" xml:id="S4.SS1">
      <tags>
        <tag>4.1</tag>
        <tag role="autoref">subsection 4.1</tag>
        <tag role="refnum">4.1</tag>
        <tag role="typerefnum">§4.1</tag>
      </tags>
      <title><tag close=" ">4.1</tag>Metacognitive Prompting</title>
      <para xml:id="S4.SS1.p1">
        <p>Synthesized from the influences mentioned in Section <ref labelref="LABEL:sec:prompting"/>, we introduce the concept of <text font="italic">metacognitive prompting</text>. The core idea is prompting the model to generate “thoughts” about an assigned task, then using those “thoughts” as useful context in the following inference steps. We find that in practice, this method of forced metacogntion enhances LLM ability to take context into account for ToM tasks (more discussion in Section <ref labelref="LABEL:sec:measuring_coherence"/>, “Measuring Coherence”).</p>
      </para>
<!--  %**** main.tex Line 125 **** -->      <para xml:id="S4.SS1.p2">
        <p><text font="bold">Task 1: User Prediction and Revision</text>.
Given history of the current conversation, we prompt the LLM to generate a ToM thought including:</p>
        <itemize xml:id="S4.I2">
          <item xml:id="S4.I2.i1">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">1st item</tag>
            </tags>
            <para xml:id="S4.I2.i1.p1">
              <p>Reasoning about the user’s internal mental state</p>
            </para>
          </item>
          <item xml:id="S4.I2.i2">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">2nd item</tag>
            </tags>
            <para xml:id="S4.I2.i2.p1">
              <p>Likely possibilities for the next user input</p>
            </para>
          </item>
          <item xml:id="S4.I2.i3">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">3rd item</tag>
            </tags>
            <para xml:id="S4.I2.i3.p1">
              <p>A list of any additional data that would be useful to improve the prediction</p>
            </para>
          </item>
        </itemize>
        <p>The list serves as a query over a vector store to retrieve relevant VoE derived user facts from prior interactions.</p>
      </para>
      <para xml:id="S4.SS1.p3">
        <p>We then prompt the model in a separate inference to revise the original ToM thought given new information, i.e. the retrieved facts that have been derived and stored by VoE. These facts are psychological in nature and taken into account to produce a revision with reduced prediction error.</p>
      </para>
      <para xml:id="S4.SS1.p4">
        <p><text font="bold">Task 2: Violation of Expectation and Revision</text>.
We employ the same prompting paradigm again in the VoE implementation.</p>
      </para>
      <para xml:id="S4.SS1.p5">
        <p>The first step is to generate a “thought” about the difference between prediction and reality in the previous user prediction task. This compares <text font="italic">expectation</text> — the revised user prediction — with <text font="italic">violation</text> — the actual user input. That is, how was expectation violated? If there were errors in the user predictions, what were they and why?</p>
      </para>
      <para xml:id="S4.SS1.p6">
        <p>This thought is sent to the next step, which generates a fact (or list of facts). In this step, we include the following:</p>
        <itemize xml:id="S4.I3">
          <item xml:id="S4.I3.i1">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">1st item</tag>
            </tags>
            <para xml:id="S4.I3.i1.p1">
              <p>Most recent LLM message sent to the user</p>
            </para>
          </item>
          <item xml:id="S4.I3.i2">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">2nd item</tag>
            </tags>
            <para xml:id="S4.I3.i2.p1">
              <p>Revised user prediction thought</p>
            </para>
          </item>
          <item xml:id="S4.I3.i3">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">3rd item</tag>
            </tags>
            <para xml:id="S4.I3.i3.p1">
              <p>Actual user response</p>
            </para>
          </item>
          <item xml:id="S4.I3.i4">
            <tags>
              <tag>•</tag>
              <tag role="autoref">item </tag>
              <tag role="typerefnum">4th item</tag>
            </tags>
            <para xml:id="S4.I3.i4.p1">
              <p>Thought about how expectation was violated</p>
            </para>
          </item>
        </itemize>
        <p>Given this context, fact(s) relevant to the user’s actual response are generated. This generation constitutes what was learned from VoE, i.e. prediction errors in ToM.</p>
      </para>
<!--  %**** main.tex Line 150 **** -->      <para xml:id="S4.SS1.p7">
        <p>Finally, we run a simple redundancy check on the derived facts, then write them to a vector store. We used the OpenAI Embeddings API for the experiment in this paper.</p>
      </para>
    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:experiment" xml:id="S5" xml:lang="en">
    <tags>
      <tag>5</tag>
      <tag role="autoref">section 5</tag>
      <tag role="refnum">5</tag>
      <tag role="typerefnum">§5</tag>
    </tags>
    <title><tag close=" ">5</tag>Experiments</title>
    <para xml:id="S5.p1">
      <p>Our experiment aims to show that using VoE derived data reduces error in LLM prediction about the next user input. This is especially useful and testable in conversations, so we use data from our AI tutor, Bloom, which is specifically prompted to keep a conversation moving forward to produce learning outcomes for users.</p>
    </para>
    <para xml:id="S5.p2">
      <p>Traditional conversation datasets often lean toward trivial dialogue, while instruction-following datasets are predominantly one-sided and transactional. Such datasets lack interpersonal dynamics, offering limited scope for substantive social cognition. Thus, our experiment employs an A/B test with two versions of our AI tutor, conversations with which more closely reflect psychologically-informative social interactions between humans.</p>
    </para>
    <para xml:id="S5.p3">
      <p>The first version — the control — relies solely on past conversation to predict what the user will say next. Yet the second version — the experimental — uses our metacognitive prompting framework in the background to make predictions. Crucially, and as described in Section <ref labelref="LABEL:sec:methods"/>, the framework leverages VoE to increase the amount of information at the model’s disposal to predict user responses. These VoE facts are introduced to the AI tutor through the additional “thought revision” phase in the conversational loop, allowing it to reduce prediction error and psychologically cohere itself more closely to the user.</p>
    </para>
    <para xml:id="S5.p4">
      <p>We use the same LLM — GPT-4 — to classify how well each version predicts each user input. Its assessment is useful to discern whether VoE data can reduce LLM prediction error as LLMs are competent arbiters of token similarity.</p>
    </para>
    <para xml:id="S5.p5">
      <p>We do so by prompting GPT-4 to choose from 5 options that assess the degree to which a generated user prediction thought is accurate. The choices include “very,” “somewhat,” “neutral,” “poorly,” and “wrong.” We include the most recent AI message, thought prediction, and actual user response in the context window. The evaluation scripts can be found on GitHub<note mark="2" role="footnote" xml:id="footnote2"><tags>
            <tag>2</tag>
            <tag role="autoref">footnote 2</tag>
            <tag role="refnum">2</tag>
            <tag role="typerefnum">footnote 2</tag>
          </tags>https://github.com/plastic-labs/voe-paper-eval</note>.</p>
    </para>
  </section>
  <section inlist="toc" labels="LABEL:sec:dataset" xml:id="S6" xml:lang="en">
    <tags>
      <tag>6</tag>
      <tag role="autoref">section 6</tag>
      <tag role="refnum">6</tag>
      <tag role="typerefnum">§6</tag>
    </tags>
    <title><tag close=" ">6</tag>Results</title>
    <para xml:id="S6.p1">
      <p><text font="bold">Dataset</text>.

This experiment uses a dataset of conversations users had with Bloom. We built it by running an A/B test on the backend of Bloom’s web interface. Only conversations of 3 or more turns are included. We recorded 59 conversations where the VoE version was active and 55 conversations where it was not. Within those, we collected 329 message examples from the VoE version and 637 from the non-VoE version. More on that difference in the “Considerations” paragraph in this section.</p>
    </para>
    <para xml:id="S6.p2">
      <p><text font="bold">Chi Square Test</text>.
We chose to give the model freedom to choose more granular assessments like values “somewhat”, “neutral”, and “poorly” rather than forcing it into a binary classification, but we found it barely used the “neutral” option. On a five-point scale, the top two ratings (“very” and “somewhat” predictions) are grouped as “good”, neutral ratings are omitted from the analysis, and the lowest two ratings (“poorly” and “wrong”) are grouped as “bad”.</p>
    </para>
    <para xml:id="S6.p3">
      <p>We want to test the independence of two categorical variables: <text font="italic">assessment</text> (good or bad) and <text font="italic">group</text> (VoE or non-VoE). The observed frequencies are given in the following table:</p>
    </para>
<!--  %**** main.tex Line 175 **** -->    <para align="center" xml:id="S6.p4">
      <tabular class="ltx_guessed_headers" vattach="middle">
        <thead>
          <tr>
            <td border="l r t" thead="column row"/>
            <td align="center" border="r t" thead="column">VoE</td>
            <td align="center" border="r t" thead="column">Non-VoE</td>
          </tr>
        </thead>
        <tbody>
          <tr>
            <td align="center" border="l r t" thead="row">Good</td>
            <td align="center" border="r t">113</td>
            <td align="center" border="r t">173</td>
          </tr>
          <tr>
            <td align="center" border="b l r" thead="row">Bad</td>
            <td align="center" border="b r">199</td>
            <td align="center" border="b r">442</td>
          </tr>
        </tbody>
      </tabular>
    </para>
    <para xml:id="S6.p5">
      <p>The Chi-square test statistic is calculated as:</p>
    </para>
    <para xml:id="S6.p6">
      <equation xml:id="S6.Ex1">
        <Math mode="display" tex="\chi^{2}=\sum\frac{(O_{ij}-E_{ij})^{2}}{E_{ij}}" text="chi ^ 2 = sum@((O _ (i * j) - E _ (i * j)) ^ 2 / E _ (i * j))" xml:id="S6.Ex1.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" name="chi" role="UNKNOWN">χ</XMTok>
                <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
              </XMApp>
              <XMApp>
                <XMTok mathstyle="display" meaning="sum" role="SUMOP" scriptpos="mid">∑</XMTok>
                <XMApp>
                  <XMTok mathstyle="display" meaning="divide" role="FRACOP"/>
                  <XMApp>
                    <XMTok role="SUPERSCRIPTOP" scriptpos="post2"/>
                    <XMDual>
                      <XMRef idref="S6.Ex1.m1.1"/>
                      <XMWrap>
                        <XMTok role="OPEN" stretchy="false">(</XMTok>
                        <XMApp xml:id="S6.Ex1.m1.1">
                          <XMTok meaning="minus" role="ADDOP">-</XMTok>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                            <XMTok font="italic" role="UNKNOWN">O</XMTok>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                            </XMApp>
                          </XMApp>
                          <XMApp>
                            <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                            <XMTok font="italic" role="UNKNOWN">E</XMTok>
                            <XMApp>
                              <XMTok meaning="times" role="MULOP">⁢</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                              <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                            </XMApp>
                          </XMApp>
                        </XMApp>
                        <XMTok role="CLOSE" stretchy="false">)</XMTok>
                      </XMWrap>
                    </XMDual>
                    <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                  </XMApp>
                  <XMApp>
                    <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                    <XMTok font="italic" role="UNKNOWN">E</XMTok>
                    <XMApp>
                      <XMTok meaning="times" role="MULOP">⁢</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                      <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                    </XMApp>
                  </XMApp>
                </XMApp>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>
      </equation>
    </para>
    <para xml:id="S6.p7">
      <p>where <Math mode="inline" tex="O_{ij}" text="O _ (i * j)" xml:id="S6.p7.m1">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">O</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> are the observed frequencies and <Math mode="inline" tex="E_{ij}" text="E _ (i * j)" xml:id="S6.p7.m2">
          <XMath>
            <XMApp>
              <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
              <XMTok font="italic" role="UNKNOWN">E</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
              </XMApp>
            </XMApp>
          </XMath>
        </Math> are the expected frequencies under the null hypothesis of independence. The expected frequencies are calculated as:</p>
    </para>
    <para xml:id="S6.p8">
      <equation xml:id="S6.Ex2">
        <Math mode="display" tex="E_{ij}=\frac{(row\ total_{i})(column\ total_{j})}{grand\ total}" text="E _ (i * j) = (r * o * w * t * o * t * a * l _ i * c * o * l * u * m * n * t * o * t * a * l _ j) / (g * r * a * n * d * t * o * t * a * l)" xml:id="S6.Ex2.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok role="SUBSCRIPTOP" scriptpos="post1"/>
                <XMTok font="italic" role="UNKNOWN">E</XMTok>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                  <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                </XMApp>
              </XMApp>
              <XMApp>
                <XMTok mathstyle="display" meaning="divide" role="FRACOP"/>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMDual>
                    <XMRef idref="S6.Ex2.m1.1"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S6.Ex2.m1.1">
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="italic" role="UNKNOWN">r</XMTok>
                        <XMTok font="italic" role="UNKNOWN">o</XMTok>
                        <XMTok font="italic" role="UNKNOWN" rpadding="5.0pt">w</XMTok>
                        <XMTok font="italic" role="UNKNOWN">t</XMTok>
                        <XMTok font="italic" role="UNKNOWN">o</XMTok>
                        <XMTok font="italic" role="UNKNOWN">t</XMTok>
                        <XMTok font="italic" role="UNKNOWN">a</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                          <XMTok font="italic" role="UNKNOWN">l</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">i</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                  <XMDual>
                    <XMRef idref="S6.Ex2.m1.2"/>
                    <XMWrap>
                      <XMTok role="OPEN" stretchy="false">(</XMTok>
                      <XMApp xml:id="S6.Ex2.m1.2">
                        <XMTok meaning="times" role="MULOP">⁢</XMTok>
                        <XMTok font="italic" role="UNKNOWN">c</XMTok>
                        <XMTok font="italic" role="UNKNOWN">o</XMTok>
                        <XMTok font="italic" role="UNKNOWN">l</XMTok>
                        <XMTok font="italic" role="UNKNOWN">u</XMTok>
                        <XMTok font="italic" role="UNKNOWN">m</XMTok>
                        <XMTok font="italic" role="UNKNOWN" rpadding="5.0pt">n</XMTok>
                        <XMTok font="italic" role="UNKNOWN">t</XMTok>
                        <XMTok font="italic" role="UNKNOWN">o</XMTok>
                        <XMTok font="italic" role="UNKNOWN">t</XMTok>
                        <XMTok font="italic" role="UNKNOWN">a</XMTok>
                        <XMApp>
                          <XMTok role="SUBSCRIPTOP" scriptpos="post2"/>
                          <XMTok font="italic" role="UNKNOWN">l</XMTok>
                          <XMTok font="italic" fontsize="70%" role="UNKNOWN">j</XMTok>
                        </XMApp>
                      </XMApp>
                      <XMTok role="CLOSE" stretchy="false">)</XMTok>
                    </XMWrap>
                  </XMDual>
                </XMApp>
                <XMApp>
                  <XMTok meaning="times" role="MULOP">⁢</XMTok>
                  <XMTok font="italic" role="UNKNOWN">g</XMTok>
                  <XMTok font="italic" role="UNKNOWN">r</XMTok>
                  <XMTok font="italic" role="UNKNOWN">a</XMTok>
                  <XMTok font="italic" role="UNKNOWN">n</XMTok>
                  <XMTok font="italic" role="UNKNOWN" rpadding="5.0pt">d</XMTok>
                  <XMTok font="italic" role="UNKNOWN">t</XMTok>
                  <XMTok font="italic" role="UNKNOWN">o</XMTok>
                  <XMTok font="italic" role="UNKNOWN">t</XMTok>
                  <XMTok font="italic" role="UNKNOWN">a</XMTok>
                  <XMTok font="italic" role="UNKNOWN">l</XMTok>
                </XMApp>
              </XMApp>
            </XMApp>
          </XMath>
        </Math>
      </equation>
    </para>
    <para xml:id="S6.p9">
      <p>For each cell, we calculate the expected frequency and then the contribution to the Chi-square statistic. The degrees of freedom for the test are <Math mode="inline" tex="(R-1)(C-1)" text="(R - 1) * (C - 1)" xml:id="S6.p9.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="times" role="MULOP">⁢</XMTok>
              <XMDual>
                <XMRef idref="S6.p9.m1.1"/>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp xml:id="S6.p9.m1.1">
                    <XMTok meaning="minus" role="ADDOP">-</XMTok>
                    <XMTok font="italic" role="UNKNOWN">R</XMTok>
                    <XMTok meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
              <XMDual>
                <XMRef idref="S6.p9.m1.2"/>
                <XMWrap>
                  <XMTok role="OPEN" stretchy="false">(</XMTok>
                  <XMApp xml:id="S6.p9.m1.2">
                    <XMTok meaning="minus" role="ADDOP">-</XMTok>
                    <XMTok font="italic" role="UNKNOWN">C</XMTok>
                    <XMTok meaning="1" role="NUMBER">1</XMTok>
                  </XMApp>
                  <XMTok role="CLOSE" stretchy="false">)</XMTok>
                </XMWrap>
              </XMDual>
            </XMApp>
          </XMath>
        </Math>, where <Math mode="inline" tex="R" text="R" xml:id="S6.p9.m2">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">R</XMTok>
          </XMath>
        </Math> is the number of rows and <Math mode="inline" tex="C" text="C" xml:id="S6.p9.m3">
          <XMath>
            <XMTok font="italic" role="UNKNOWN">C</XMTok>
          </XMath>
        </Math> is the number of columns.</p>
    </para>
    <para xml:id="S6.p10">
      <p>The Chi-Square Test indicated a significant relationship between assessment and group, <Math mode="inline" tex="X^{2}(1,927)=5.97" text="X ^ 2 * open-interval@(1, 927) = 5.97" xml:id="S6.p10.m1">
          <XMath>
            <XMApp>
              <XMTok meaning="equals" role="RELOP">=</XMTok>
              <XMApp>
                <XMTok meaning="times" role="MULOP">⁢</XMTok>
                <XMApp>
                  <XMTok role="SUPERSCRIPTOP" scriptpos="post1"/>
                  <XMTok font="italic" role="UNKNOWN">X</XMTok>
                  <XMTok fontsize="70%" meaning="2" role="NUMBER">2</XMTok>
                </XMApp>
                <XMDual>
                  <XMApp>
                    <XMTok meaning="open-interval"/>
                    <XMRef idref="S6.p10.m1.1"/>
                    <XMRef idref="S6.p10.m1.2"/>
                  </XMApp>
                  <XMWrap>
                    <XMTok role="OPEN" stretchy="false">(</XMTok>
                    <XMTok meaning="1" role="NUMBER" xml:id="S6.p10.m1.1">1</XMTok>
                    <XMTok role="PUNCT">,</XMTok>
                    <XMTok meaning="927" role="NUMBER" xml:id="S6.p10.m1.2">927</XMTok>
                    <XMTok role="CLOSE" stretchy="false">)</XMTok>
                  </XMWrap>
                </XMDual>
              </XMApp>
              <XMTok meaning="5.97" role="NUMBER">5.97</XMTok>
            </XMApp>
          </XMath>
        </Math>, <Math mode="inline" tex="p&lt;.05" text="p less .05" xml:id="S6.p10.m2">
          <XMath>
            <XMApp>
              <XMTok meaning="less-than" role="RELOP">&lt;</XMTok>
              <XMTok font="italic" role="UNKNOWN">p</XMTok>
              <XMTok meaning=".05" role="NUMBER">.05</XMTok>
            </XMApp>
          </XMath>
        </Math>, such that VoE predictions were evaluated as good more often than expected and bad less often than expected. These results support our hypothesis that augmenting the Bloom chatbot with VoE reasoning reduces the model’s error in predicting user inputs.
<!--  %**** main.tex Line 200 **** --></p>
    </para>
    <para xml:id="S6.p11">
      <p><text font="bold">Reducing Prediction Errors</text>.
The VoE version showed a significant reduction in prediction errors, resulting in fewer “wrong” values being generated. Overall, the VoE version exhibited a smoothing effect, enhancing the consistency of predictions. Although there was a slight decrease in “very” predictions, a relative increase of 51% in “somewhat” values was observed. This shift suggests an improvement in prediction fidelity, balancing out extreme predictions with more moderate ones. Notably, the VoE version generated 22.4% fewer “wrong” predictions compared to the Non-VoE version.</p>
    </para>
    <figure inlist="lof" labels="LABEL:fig:ggplot-line LABEL:sec:considerations" xml:id="S6.F3">
      <tags>
        <tag>Figure 3</tag>
        <tag role="autoref">Figure 3</tag>
        <tag role="refnum">3</tag>
        <tag role="typerefnum">Figure 3</tag>
      </tags>
      <graphics candidates="summary_plot.png" class="ltx_centering" graphic="summary_plot.png" options="width=173.448pt" xml:id="S6.F3.g1"/>
      <toccaption class="ltx_centering"><tag close=" ">3</tag>Plot of results found in Figure <ref labelref="LABEL:fig:results"/>. VoE smooths the distribution of predictions, reducing prediction error by learning from prior generations. This echoes accounts of human learning described in PC and VoE theories.</toccaption>
      <caption class="ltx_centering"><tag close=": ">Figure 3</tag>Plot of results found in Figure <ref labelref="LABEL:fig:results"/>. VoE smooths the distribution of predictions, reducing prediction error by learning from prior generations. This echoes accounts of human learning described in PC and VoE theories.</caption>
    </figure>
    <para xml:id="S6.p12">
      <p><text font="bold">Considerations</text>.

The inherent nature of VoE is to improve and refine over time. As the vector store becomes populated with more data, the accuracy and relevance of VoE’s outputs are expected to increase, enabling more valuable responses for users.</p>
    </para>
    <para xml:id="S6.p13">
      <p>It’s important to note the presence of latency in VoE Bloom. This likely contributed to the reduction in conversation turns to nearly half that of the non-VoE Bloom. Nevertheless, the fact we observe a statistical difference between the groups given this discrepancy in data size is noteworthy.</p>
    </para>
    <para xml:id="S6.p14">
      <p>There are a number of other practical factors in our data that might inhibit our ability to accurately measure the degree to which user prediction error was minimized. We used our conversational AI tutor’s data for this study, which is subject to various issues that are being faced by all consumer-facing AI applications. This technology is new, and people are still learning how to interface with it. Many users ask Bloom to search the internet, do mathematical computations, or other things that aren’t well served by the prompting framework around GPT-4.</p>
    </para>
    <para xml:id="S6.p15">
      <p>Finally, it’s of conceptual interest that LLMs can, from prompting alone, reduce prediction errors via mechanisms similar to those posited by PC and VoE theories of human cognition.</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S7" xml:lang="en">
    <tags>
      <tag>7</tag>
      <tag role="autoref">section 7</tag>
      <tag role="refnum">7</tag>
      <tag role="typerefnum">§7</tag>
    </tags>
    <title><tag close=" ">7</tag>Future Work and Beyond</title>
    <subsection inlist="toc" xml:id="S7.SS1">
      <tags>
        <tag>7.1</tag>
        <tag role="autoref">subsection 7.1</tag>
        <tag role="refnum">7.1</tag>
        <tag role="typerefnum">§7.1</tag>
      </tags>
      <title><tag close=" ">7.1</tag>Improvements</title>
      <para xml:id="S7.SS1.p1">
        <p><text font="bold">Retrieval Augmented Generation</text>.
Currently, our VoE fact retrieval schemes are quite naive. The “thought” generation steps are prompted to generate thoughts <text font="italic">and</text> additional data points that would help improve the prediction. Those additional data points serve as a basic semantic similarity query over a vector store of OpenAI embeddings, and we select top <text font="italic">k</text> entries. Much could be done to improve this workflow, from training custom embedding models to improving the retrieval method. We also draw inspiration from the FLARE paper <cite class="ltx_citemacro_cite">[<bibref bibrefs="jiang2023active" separator="," yyseparator=","/>]</cite> and note the improved generation results that come from forecasting a conversation and incorporating that into the context window.</p>
      </para>
<!--  %**** main.tex Line 225 **** -->      <para xml:id="S7.SS1.p2">
        <p><text font="bold">Training/Fine-Tuning</text>.
Similar to how instruction tuning yielded much improved results in decoder-only LLMs, we believe that ToM tuning is a task that could yield better psychological models. The task of following instructions is a sufficiently abstract idea. Making ToM predictions falls into the same category.</p>
      </para>
    </subsection>
    <subsection inlist="toc" labels="LABEL:sec:measuring_coherence" xml:id="S7.SS2">
      <tags>
        <tag>7.2</tag>
        <tag role="autoref">subsection 7.2</tag>
        <tag role="refnum">7.2</tag>
        <tag role="typerefnum">§7.2</tag>
      </tags>
      <title><tag close=" ">7.2</tag>Evaluation</title>
      <para xml:id="S7.SS2.p1">
        <p><text font="bold">Assessing Theory of Mind</text>.
The authors of “Clever Hans or Neural Theory of Mind? Stress Testing Social Reasoning in Large Language Models” <cite class="ltx_citemacro_cite">[<bibref bibrefs="shapira2023clever" separator="," yyseparator=","/>]</cite> explicitly state that “the consequences of the success of these tests do not straightforwardly transfer from humans to models” and speak at length to the evolving landscape of datasets and evaluation methods aimed at machines instead of humans. The debate about whether or not LLMs “have” ToM is likely to continue and more semantic definitional work also needs to be done, but what’s undeniable is the utility of this capability. Specifically interesting is boosting the performance of LLMs to minimize user prediction error, as much may become possible as a result of gains in that domain.</p>
      </para>
      <para xml:id="S7.SS2.p2">
        <p><text font="bold">Measuring Coherence</text>.

For this paper, we exclusively leverage OpenAI’s closed-source models behind their API endpoints. Because of this, we are fundamentally limited in the ways in which we can measure user prediction error. In order to remain consistent, we have the same LLM that is generating the ToM predictions generate a naive assessment of its accuracy, which is described more in Section <ref labelref="LABEL:sec:experiment"/>.</p>
      </para>
      <para xml:id="S7.SS2.p3">
        <p>Experiments with open source LLMs allow much more granular evaluation. E.g. computing the conditional loss over a sequence of tokens or creating new datasets by employing human labelers to train an evaluation model. Establishing a more rigorous standard around evaluating ToM predictions with multi-turn interpersonal conversation data is an imperative area of work as well.</p>
      </para>
      <para xml:id="S7.SS2.p4">
        <p>The space of open source models is relatively untested in regard to ToM abilities. Comprehensive study of how the open source model stable performs on already existing tasks is a crucial next step.</p>
      </para>
      <para xml:id="S7.SS2.p5">
        <p>Still further challenges exist in establishing reliable evaluation methods for measuring LLM coherence to users. Each user possesses not only unique psychological properties, but varying levels of awareness of that psychological profile. These subjective limitations demand novel approaches, research into which is only now becoming possible.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S7.SS3">
      <tags>
        <tag>7.3</tag>
        <tag role="autoref">subsection 7.3</tag>
        <tag role="refnum">7.3</tag>
        <tag role="typerefnum">§7.3</tag>
      </tags>
      <title><tag close=" ">7.3</tag>Utility</title>
      <para xml:id="S7.SS3.p1">
        <p><text font="bold">Infrastructure</text>.
In a world of abundant synthetic intelligence, if vertical-specific AI applications remain viable, they will seek to outperform foundational models within their narrow purview. Redundantly solving personalization and psychological modeling problems represents unnecessary development and data governance overhead <text font="italic">and</text> risks contaminating datasets. Nor is it in the security or temporal interest of users to share such data. Horizontal frameworks and protocols are needed to safely and efficiently manage these data flows, improve user experience, and align incentives.</p>
      </para>
      <para xml:id="S7.SS3.p2">
        <p><text font="bold">Products</text>.
Ability to robustly model user psychology and make ToM predictions about internal mental states represents novel opportunity for the frontier of goods and services. Bespoke multi-modal content generation, high-fidelity human social simulation, on-demand disposable software, atomization of services, instant personalization, and more could all become possible. Much work will be needed to explore this design space.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S7.SS4">
      <tags>
        <tag>7.4</tag>
        <tag role="autoref">subsection 7.4</tag>
        <tag role="refnum">7.4</tag>
        <tag role="typerefnum">§7.4</tag>
      </tags>
      <title><tag close=" ">7.4</tag>Security</title>
<!--  %**** main.tex Line 250 **** -->      <para xml:id="S7.SS4.p1">
        <p>While ToM data holds powerful personalization potential, the management and use of that data entails profound responsibility and promises significant hazards. Such data, rich with insights into internal user identity and future behavior suggests immense utility. Yet, this utility makes it a likely target for misuse or object of mishandling — more so given the remarkable inferential capabilities of LLMs.</p>
      </para>
      <para xml:id="S7.SS4.p2">
        <p>Security implications are far-reaching, from privacy invasion and identity theft to manipulation and discrimination. Moreover, any breach of trust impacts not just individual users, but the reputation and success of organizations employing it. Below is a non-exhaustive list of future work needed to secure such data throughout its lifecycle.</p>
      </para>
      <para xml:id="S7.SS4.p3">
        <p><text font="bold">Encryption and Custody</text>.
Due to the sensitive, individual nature of ToM data, encryption is a bare minimum security requirement, and there are strong arguments to be made for direct user key ownership. Formal investigations into appropriate solutions to both are needed.</p>
      </para>
      <para xml:id="S7.SS4.p4">
        <p>The process of transforming plaintext to ciphertext safeguards the data from keyless access. Several methods of encryption, including symmetric methods like the Advanced Encryption Standard, which uses the same key for encryption and decryption, and asymmetric encryption methods like RSA, which uses two keys, a public key for encryption and a private key for decryption <cite class="ltx_citemacro_cite">[<bibref bibrefs="rivest1978method" separator="," yyseparator=","/>]</cite>, are plausible candidates.</p>
      </para>
      <para xml:id="S7.SS4.p5">
        <p>Models for key management will dictate the exact implementation of encryption against the data. A method such as Shamir’s secret sharing can be used to split the decryption key between a user and a trusted platform hosting the data <cite class="ltx_citemacro_cite">[<bibref bibrefs="dawson1994breadth" separator="," yyseparator=","/>]</cite>. However, the intimate nature of the data may still warrant user ownership, preventing even the platform from accessing the data.</p>
      </para>
      <para xml:id="S7.SS4.p6">
        <p><text font="bold">Confidential Computing</text>.
This relatively new technology encrypts data in use (i.e. during processing). Confidential computing is a step beyond traditional methods that encrypt data at rest and in transit, thus providing a more comprehensive data protection framework. It leverages hardware-based Trusted Execution Environments (TEEs) to protect data during computation, enabling sensitive data to be processed in the cloud or third-party environments without exposing it to the rest of the system <cite class="ltx_citemacro_cite">[<bibref bibrefs="confidential2020confidential" separator="," yyseparator=","/>]</cite>.</p>
      </para>
      <para xml:id="S7.SS4.p7">
        <p>Further work can determine architectures for safely mounting user data into TEEs, decrypting, and then using it to improve interactions between users and LLMs. Work to explore how to create a scalable and performant design that does not sacrifice security is needed. Additional considerations need to be made for securely using data with third-party LLM APIs such as OpenAI’s GPT-4 as opposed to self-hosted models.</p>
      </para>
      <para xml:id="S7.SS4.p8">
        <p><text font="bold">Policy-Based Access Control</text>.
Policy-Based Access Control (or Attribute Based Policy Control) is a method used to regulate who or what can view or use resources in a computing environment <cite class="ltx_citemacro_cite">[<bibref bibrefs="hu2013guide" separator="," yyseparator=","/>]</cite>. It’s based on creating, managing, and enforcing rules for accessing resources to define the conditions under which access is granted or denied.</p>
      </para>
      <para xml:id="S7.SS4.p9">
        <p>Policies that can be applied on the data to ensure principles of least privilege to client applications and prevent data leakage are directions for further inquiry. LLM applications could be used to extend the policies to allow attributes based on the content of the data, such as grouping by topic.</p>
      </para>
      <para xml:id="S7.SS4.p10">
        <p><text font="bold">Frontier Security</text>.
LLMs’ powerful inference abilities place them in a new category of digital actors. New paradigms of protection and security will be required. LLMs themselves might be leveraged to proactively monitor and obfuscate user activity or destroy unwanted statistical relationships. The advent of instant personalization may even make persistent application-side user accounts irrelevant or unsustainably hazardous.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S7.SS5">
      <tags>
        <tag>7.5</tag>
        <tag role="autoref">subsection 7.5</tag>
        <tag role="refnum">7.5</tag>
        <tag role="typerefnum">§7.5</tag>
      </tags>
      <title><tag close=" ">7.5</tag>Philosophy</title>
<!--  %**** main.tex Line 275 **** -->      <para xml:id="S7.SS5.p1">
        <p><text font="bold">Extended Self</text>.
Chalmers and Clark argued in 1998 that minds can be said to extend into the physical world and still legitimately be considered part of personal cognition <cite class="ltx_citemacro_cite">[<bibref bibrefs="Clark1998-CLATEM" separator="," yyseparator=","/>]</cite>. High-fidelity human psychological renderings in AI agents suggest the potential for human agency and identity to extend in similar ways. Unanswered legal, metaphysical, and ethical questions arise from this prospect.</p>
      </para>
      <para xml:id="S7.SS5.p2">
        <p><text font="bold">Phenomenology</text>.
When humans impute mental states to others, presumably that assignment is grounded in lived personal experience. That is, we can imagine other people having experiences because we have had similar experiences ourselves. Additionally, we share with the objects of our ToM a genetic schema and physical substrate for intelligence and social cognition.</p>
      </para>
      <para xml:id="S7.SS5.p3">
        <p>While LLMs display ToM abilities and may well have access to orders of magnitude more accounts of internal mental states via the massive corpus of their pretraining data, none of that has been experienced first hand. Leaving aside that current LLMs likely have no mechanism for experience as we conceive of it <cite class="ltx_citemacro_cite">[<bibref bibrefs="chalmers2023large" separator="," yyseparator=","/>]</cite>, what are we to make of ToM in such alien minds?</p>
      </para>
      <para xml:id="S7.SS5.p4">
        <p><text font="bold">Game Theory</text>.
Our experiments and testing protocol assume users are unwise to model predictions about them. As users become aware that models are actively predicting their mental states and behavior, those predictions may become harder to make. Similarly, as LLMs take this into account, simulations will become still more complex.</p>
      </para>
    </subsection>
  </section>
  <section inlist="toc" xml:id="S8" xml:lang="en">
    <tags>
      <tag>8</tag>
      <tag role="autoref">section 8</tag>
      <tag role="refnum">8</tag>
      <tag role="typerefnum">§8</tag>
    </tags>
    <title><tag close=" ">8</tag>Discussion</title>
    <para xml:id="S8.p1">
      <p>Principal-agent problems are a set of well understood coordination failures that emerge from interest misalignment and information asymmetry between persons or groups and their proxies. In normal political and economic life, delegating an agent incurs costs and efforts to minimize that risk reduce the efficiency of the agent.</p>
    </para>
    <para xml:id="S8.p2">
      <p>We view our very early work in modeling user psychology as ultimately in service of eliminating the certitude of principal-agent problems from economic relations. As LLMs or other AI systems become increasingly capable and autonomous, they offer enormous economic potential. However, their alignment to human principals is not a foregone conclusion. On the contrary, we may instead see an <text font="italic">exaggeration</text> of existing asymmetries between principals and agents, as well as the introduction of new concerns around latency, intelligence, and digital nativity.</p>
    </para>
    <para xml:id="S8.p3">
      <p>In order to achieve trustworthy and efficient agentic AI, <text font="italic">individual</text> alignment is required. Human agents and deterministic software are already capable of operating <text font="italic">like</text> their principals. LLMs promise massive reductions in marginal cost along that axis, but hardly class better than the status quo (and often much worse) with regard to user alignment. Yet the unique potential here is agents <text font="italic">who are</text> the principals themselves, that is, there is no meaningful practical or philosophical difference between discrete humans and the psychologically-aligned AIs acting on their behalf.</p>
    </para>
    <para xml:id="S8.p4">
      <p>LLMs are excellent simulators capable of assuming myriad identities <cite class="ltx_citemacro_cite">[<bibref bibrefs="janus_2022" separator="," yyseparator=","/>]</cite>. They also excel at ToM tasks, and we’ve shown, can passively harvest and reason about user psychological data. These two interrelated qualities may very well make possible high-fidelity renderings of principals capable of flawlessly <text font="italic">originating</text> and executing intent as their proxies with zero marginal agency cost. In this way LLMs may become more augmentation than tool, more appendage than agent.</p>
    </para>
  </section>
  <section inlist="toc" xml:id="S9" xml:lang="en">
    <tags>
      <tag>9</tag>
      <tag role="autoref">section 9</tag>
      <tag role="refnum">9</tag>
      <tag role="typerefnum">§9</tag>
    </tags>
    <title><tag close=" ">9</tag>Acknowledgements</title>
    <para xml:id="S9.p1">
      <p>The authors are grateful to Ayush Paul and Jacob Van Meter for their work on the Bloom development team, Thomas Howell of Forum Education for extensive conceptual review and ideation, and Zach Seward for invaluable advice and mentoring. We are additionally grateful to Ben Bowman for advising the machine learning aspects of this paper and Lee Ahern from the Bellisario College of Communications at Pennsylvania State University for feedback on the statistical tests and results section.</p>
    </para>
    <ERROR class="undefined">\printbibliography</ERROR>
  </section>
</document>
