<?xml version="1.0" encoding="UTF-8"?>
<?latexml searchpaths="/home/japhy/scienceReplication.artiswrong.com/paper_files/arxiv/2602.11460/latex_extracted"?>
<!--  %% --><!--  %% This is file ‘sample-sigconf.tex’, --><!--  %% generated with the docstrip utility. --><!--  %% --><!--  %% The original source files were: --><!--  %% --><!--  %% samples.dtx  (with options: ‘all,proceedings,bibtex,sigconf’) --><!--  %% --><!--  %% IMPORTANT NOTICE: --><!--  %% --><!--  %% For the copyright see the source file. --><!--  %% --><!--  %% Any modified versions of this file must be renamed --><!--  %% with new filenames distinct from sample-sigconf.tex. --><!--  %% --><!--  %% For distribution of the original source see the terms --><!--  %% for copying and modification in the file samples.dtx. --><!--  %% --><!--  %% This generated file may be distributed as long as the --><!--  %% original source files, as listed above, are part of the --><!--  %% same distribution. (The sources need not necessarily be --><!--  %% in the same archive or directory.) --><!--  %% --><!--  %% --><!--  %**** arXiv.tex Line 25 **** --><!--  %% Commands for TeXCount --><!--  %TC:macro “cite [option:text,text] --><!--  %TC:macro “citep [option:text,text] --><!--  %TC:macro “citet [option:text,text] --><!--  %TC:envir table 0 1 --><!--  %TC:envir table* 0 1 --><!--  %TC:envir tabular [ignore] word --><!--  %TC:envir displaymath 0 word --><!--  %TC:envir math 0 word --><!--  %TC:envir comment 0 0 --><!--  %% --><!--  %% The first command in your LaTeX source must be the “documentclass --><!--  %% command. --><!--  %% --><!--  %% For submission and review of your manuscript please change the --><!--  %% command to “documentclass[manuscript, screen, review]–acmart˝. --><!--  %% --><!--  %% When submitting camera ready or to TAPS, please change the command --><!--  %% to “documentclass[sigconf]–acmart˝ or whichever template is required --><!--  %% for your publication. --><!--  %% --><!--  %% --><?latexml class="acmart" options="sigconf"?>
<!--  %“acmDOI–˝ --><!--  %“acmISBN–˝ --><!--  %“acmConference–˝–˝–˝ --><!--  %“acmBooktitle–˝ --><!--  %“acmYear–˝ --><!--  %“copyrightyear–˝ --><!--  %“acmPrice–˝ --><!--  %Rights management information.  This information is sent to you --><!--  %when you complete the rights form.  These commands have SAMPLE --><!--  %values in them; it is your responsibility as an author to replace --><!--  %the commands and values with those provided to you when you --><!--  %complete the rights form. --><!--  %“setcopyright–acmlicensed˝ --><!--  %“copyrightyear–2018˝ --><!--  %“acmYear–2018˝ --><!--  %**** arXiv.tex Line 75 **** --><!--  %“acmDOI–XXXXXXX.XXXXXXX˝ --><!--  %% These commands are for a PROCEEDINGS abstract or paper. --><!--  %“acmConference[KDD ’26]–Make sure to enter the correct --><!--  %conference title from your rights confirmation email˝–Aug., 2026˝–Jeju, Korea˝ --><!--  %% --><!--  %%  Uncomment “acmBooktitle if the title of the proceedings is different --><!--  %%  from ‘‘Proceedings of ...’’! --><!--  %% --><!--  %%“acmBooktitle–Woodstock ’18: ACM Symposium on Neural Gaze Detection, --><!--  %%  June 03__05, 2018, Woodstock, NY˝ --><!--  %“acmISBN–978-1-4503-XXXX-X/2018/06˝ --><!--  %% --><!--  %% Submission ID. --><!--  %% Use this when submitting an article to a sponsored event. You’ll --><!--  %% receive a unique submission ID from the organizers --><!--  %% of the event, and this ID should be used as the parameter to this command. --><!--  %%“acmSubmissionID–123-A56-BU3˝ --><!--  %% --><!--  %% For managing citations, it is recommended to use bibliography --><!--  %% files in BibTeX format. --><!--  %% --><!--  %% You can then either use BibTeX with the ACM-Reference-Format style, --><!--  %**** arXiv.tex Line 100 **** --><!--  %% or BibLaTeX with the acmnumeric or acmauthoryear sytles, that include --><!--  %% support for advanced citation of software artefact from the --><!--  %% biblatex-software package, also separately available on CTAN. --><!--  %% --><!--  %% Look at the sample-*-biblatex.tex files for templates showcasing --><!--  %% the biblatex styles. --><!--  %% --><!--  %% --><!--  %% The majority of ACM publications use numbered citations and --><!--  %% references.  The command “citestyle–authoryear˝ switches to the --><!--  %% ”author year” style. --><!--  %% --><!--  %% If you are preparing content for an event --><!--  %% sponsored by ACM SIGGRAPH, you must use the ”author year” style of --><!--  %% citations and references. --><!--  %% Uncommenting --><!--  %% the next command will enable that style. --><!--  %%“citestyle–acmauthoryear˝ --><?latexml package="makecell"?>
<?latexml package="float"?>
<!--  %**** arXiv.tex Line 125 **** --><!--  %% --><!--  %% end of the preamble, start of the body of the document source. --><?latexml RelaxNGSchema="LaTeXML"?>
<document xmlns="http://dlmf.nist.gov/LaTeXML" class="ltx_authors_1line ltx_leqno">
  <resource src="LaTeXML.css" type="text/css"/>
  <resource src="ltx-article.css" type="text/css"/>
  <resource src="ltx-amsart.css" type="text/css"/>
  <title>ADRD-Bench: A Preliminary LLM Benchmark for Alzheimer’s Disease and Related Dementias</title>
  <creator role="author">
    <personname>Guangxin Zhao</personname>
    <contact role="additional_affiliation"><text class="ltx_affiliation_institution">Electrical Engineering<break/>University of Notre Dame</text>
<text class="ltx_affiliation_city"/>
<text class="ltx_affiliation_state"/>
<text class="ltx_affiliation_country"/>
</contact>
    <contact role="email">gzhao23@nd.edu</contact>
  </creator>
  <creator before=", " role="author">
    <personname>Jiahao Zheng</personname>
    <contact role="additional_affiliation"><text class="ltx_affiliation_institution">Computer Science and Engineering<break/>University of Notre Dame</text>
<text class="ltx_affiliation_city"/>
<text class="ltx_affiliation_state"/>
<text class="ltx_affiliation_country"/>
</contact>
    <contact role="email">jzheng7@nd.edu</contact>
  </creator>
  <creator before=", " role="author">
    <personname>Malaz Boustani</personname>
    <contact role="additional_affiliation"><text class="ltx_affiliation_institution">School of Medicine<break/>Indiana University</text>
<text class="ltx_affiliation_city"/>
<text class="ltx_affiliation_state"/>
<text class="ltx_affiliation_country"/>
</contact>
    <contact role="email">mboustan@iu.edu</contact>
  </creator>
  <creator before=", " role="author">
    <personname>Jarek Nabrzyski</personname>
    <contact role="additional_affiliation"><text class="ltx_affiliation_institution">Computer Science and Engineering<break/>University of Notre Dame</text>
<text class="ltx_affiliation_city"/>
<text class="ltx_affiliation_state"/>
<text class="ltx_affiliation_country"/>
</contact>
    <contact role="email">naber@nd.edu</contact>
  </creator>
  <creator before=", " role="author">
    <personname>Meng Jiang</personname>
    <contact role="additional_affiliation"><text class="ltx_affiliation_institution">Computer Science and Engineering<break/>University of Notre Dame</text>
<text class="ltx_affiliation_city"/>
<text class="ltx_affiliation_state"/>
<text class="ltx_affiliation_country"/>
</contact>
    <contact role="email">mjiang2@nd.edu</contact>
  </creator>
  <creator before=", " role="author">
    <personname>Yiyu Shi</personname>
    <contact role="additional_affiliation"><text class="ltx_affiliation_institution">Computer Science and Engineering<break/>University of Notre Dame</text>
<text class="ltx_affiliation_city"/>
<text class="ltx_affiliation_state"/>
<text class="ltx_affiliation_country"/>
</contact>
    <contact role="email">yshi4@nd.edu</contact>
  </creator>
  <creator before=" and " role="author">
    <personname>Zhi Zheng</personname>
<!--  %**** arXiv.tex Line 200 **** -->    <contact role="additional_affiliation"><text class="ltx_affiliation_institution">Electrical Engineering<break/>University of Notre Dame</text>
<text class="ltx_affiliation_city"/>
<text class="ltx_affiliation_state"/>
<text class="ltx_affiliation_country"/>
</contact>
    <contact role="email">zzheng3@nd.edu</contact>
  </creator>
  <abstract name="Abstract.">
    <p>Large language models (LLMs) have shown great potential for healthcare applications. However, existing evaluation benchmarks provide minimal coverage of Alzheimer’s Disease and Related Dementias (ADRD). To address this gap, we introduce ADRD-Bench, the first ADRD-specific benchmark dataset designed for rigorous evaluation of LLMs. ADRD-Bench has two components: 1) ADRD Unified QA, a synthesis of 1,352 questions consolidated from seven established medical benchmarks, providing a unified assessment of clinical knowledge; and 2) ADRD Caregiving QA, a novel set of 149 questions derived from the Aging Brain Care (ABC) program, a widely used, evidence-based brain health management program. Guided by a program with national expertise in comprehensive ADRD care, this new set was designed to mitigate the lack of practical caregiving context in existing benchmarks. We evaluated 33 state-of-the-art LLMs on the proposed ADRD-Bench. Results showed that the accuracy of open-weight general models ranged from 0.63 to 0.93 (mean: 0.78; std: 0.09). The accuracy of open-weight medical models ranged from 0.48 to 0.93 (mean: 0.82; std: 0.13). The accuracy of closed-source general models ranged from 0.83 to 0.91 (mean: 0.89; std: 0.03). While top-tier models achieved high accuracies (¿0.9), case studies revealed that inconsistent reasoning quality and stability limit their reliability, highlighting a critical need for domain-specific improvement to enhance LLMs’ knowledge and reasoning grounded in daily caregiving data. The entire dataset is available at <ref class="ltx_url" font="typewriter" href="https://github.com/IIRL-ND/ADRD-Bench">https://github.com/IIRL-ND/ADRD-Bench</ref>.</p>
  </abstract>
  <keywords>Large Language Models, Alzheimer’s Disease and Related Dementias (ADRD), Medical Benchmark</keywords>
  <note role="copyright">rightsretained</note>
  <note role="conference">; ; </note>
<!--  %% 
     %% The ”title” command has an optional parameter,
     %% allowing the author to define a ”short title” to be used in page headers.
     %%
     %% The ”author” command and its associated commands are used to define
     %% the authors and their affiliations.
     %% Of note is the shared affiliation of the first two authors, and the
     %% ”authornote” and ”authornotemark” commands
     %% used to denote shared contribution to the research.
     %**** arXiv.tex Line 150 ****
     %**** arXiv.tex Line 175 ****
     %%
     %% By default, the full list of authors will be used in the page
     %% headers. Often, this list is too long, and will overlap
     %% other information printed in the page headers. This command allows
     %% the author to define a more concise list
     %% of authors’ names for this purpose.
     %%
     %% The abstract is a short summary of the work to be presented in the
     %% article.
     %%
     %% The code below is generated by the tool at http://dl.acm.org/ccs.cfm.
     %% Please copy and paste the code instead of the example below.
     %%
     %“begin–CCSXML˝
     %¡ccs2012¿
     %**** arXiv.tex Line 225 ****
     %¡concept¿
     %¡concept˙id¿10010147.10010257.10010293.10011809¡/concept˙id¿
     %¡concept˙desc¿Computing methodologies~Natural language processing¡/concept˙desc¿
     %¡concept˙significance¿500¡/concept˙significance¿
     %¡/concept¿
     %¡concept¿
     %¡concept˙id¿10010147.10010257.10010258.10010259¡/concept˙id¿
     %¡concept˙desc¿Computing methodologies~Machine learning¡/concept˙desc¿
     %¡concept˙significance¿500¡/concept˙significance¿
     %¡/concept¿
     %¡concept¿
     %¡concept˙id¿10002951.10003227.10003251.10003260¡/concept˙id¿
     %¡concept˙desc¿Information systems~Information retrieval¡/concept˙desc¿
     %¡concept˙significance¿300¡/concept˙significance¿
     %¡/concept¿
     %¡concept¿
     %¡concept˙id¿10002951.10003227.10003231.10003239¡/concept˙id¿
     %¡concept˙desc¿Information systems~Digital libraries and archives¡/concept˙desc¿
     %¡concept˙significance¿100¡/concept˙significance¿
     %¡/concept¿
     %¡concept¿
     %¡concept˙id¿10002950.10003648.10003671¡/concept˙id¿
     %¡concept˙desc¿Information systems~Health informatics¡/concept˙desc¿
     %¡concept˙significance¿300¡/concept˙significance¿
     %¡/concept¿
     %**** arXiv.tex Line 250 ****
     %¡/ccs2012¿
     %“end–CCSXML˝
     %“ccsdesc[500]–Computing methodologies~Artificial intelligence~Natural language processing˝
     %“ccsdesc[300]–Applied computing~Life and medical sciences~Health informatics˝
     %%
     %Keywords. The author(s) should pick words that accurately describe
     %the work being presented. Separate the keywords with commas.
     %“received–20 February 2007˝
     %“received[revised]–12 March 2009˝
     %“received[accepted]–5 June 2009˝
     %%
     %% This command processes the author and affiliation and title
     %% information and builds the first part of the formatted document.-->  <section inlist="toc" labels="LABEL:sec:intro" xml:id="S1">
    <tags>
      <tag>1</tag>
      <tag role="autoref">section 1</tag>
      <tag role="refnum">1</tag>
      <tag role="typerefnum">§1</tag>
    </tags>
    <title><tag close=". ">1</tag>Introduction</title>
    <toctitle><tag close=" ">1</tag>Introduction</toctitle>
    <para xml:id="S1.p1">
      <p>Large language models (LLMs) have achieved remarkable progress in healthcare applications. They have been used for assisting clinical decision support, providing medical education <cite class="ltx_citemacro_citep">(<bibref bibrefs="zhou_survey_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, improving diagnostic reasoning, and enhancing patient care <cite class="ltx_citemacro_citep">(<bibref bibrefs="maity_large_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. For example, the GPT series has demonstrated the ability to answer medical licensure examinations at or near the passing threshold <cite class="ltx_citemacro_citep">(<bibref bibrefs="kung_performance_2023,nori_capabilities_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>; PaLM <cite class="ltx_citemacro_citep">(<bibref bibrefs="chowdhery_palm_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite> and its variants achieved encouraging performance on specialty-specific tasks <cite class="ltx_citemacro_citep">(<bibref bibrefs="singhal_large_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. However, using LLMs in healthcare carries substantial risks due to potential LLM hallucinations <cite class="ltx_citemacro_citep">(<bibref bibrefs="tang_evaluating_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>, which might be harmless in the general domain but can be dangerous in healthcare that requires high precision and rigor <cite class="ltx_citemacro_citep">(<bibref bibrefs="pal_open_medical_llm_leaderboard_2024" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Research has pointed out that LLMs providing incorrect drug dosages or misdiagnosing conditions may jeopardize patient safety <cite class="ltx_citemacro_citep">(<bibref bibrefs="kim_medical_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Therefore, rigorous evaluations are necessary before LLM deployment. Recent surveys of medical LLMs also highlighted mitigating hallucinations and establishing domain-specific benchmarks as critical steps for safety in clinical practice <cite class="ltx_citemacro_citep">(<bibref bibrefs="karabacak_embracing_2023,singhal_large_2023,zhou_survey_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>.</p>
    </para>
    <para xml:id="S1.p2">
      <p>Alzheimer’s disease and Related Dementias (ADRD) represent a critical medical domain, with over 57 million people worldwide living with ADRD as of 2021 and nearly 10 million new cases diagnosed each year <cite class="ltx_citemacro_citep">(<bibref bibrefs="who_dementia_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Managing ADRD is not only a medical but also a societal challenge, as it heavily involves caregiving, long-term support, and complex decision-making by families <cite class="ltx_citemacro_citep">(<bibref bibrefs="alzheimers_association_facts_2024" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. LLMs are promising in serving ADRD care needs due to their capability of communication in natural language and cognitive support <cite class="ltx_citemacro_citep">(<bibref bibrefs="treder_introduction_2024" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. For example, with proper design, LLM-driven content may boost patients’ cognitive abilities and enrich interactions, while offering opportunities to improve diagnostic strategies and patient-caregiver communication <cite class="ltx_citemacro_citep">(<bibref bibrefs="treder_introduction_2024" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Like any other medical device, LLMs for ADRD care should be evaluated for technical performance before planning for real-world deployment <cite class="ltx_citemacro_citep">(<bibref bibrefs="who_ai_ethics_2024" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. However, to the best of our knowledge, there has not been an existing ADRD-specific LLM benchmark for this purpose. Some existing medical benchmarks include ADRD-related items; however, our investigation showed that these items constitute only a tiny proportion of the total content, mostly under 1%, indicating that the ADRD domain is largely absent from current LLM performance evaluations. In addition, we found that there are almost no caregiving types of testing in existing datasets. While LLMs are often targeted to provide support for caregivers, missing the caregiving type leaves a huge gap for real-world applications.</p>
    </para>
    <para xml:id="S1.p3">
      <p>To mitigate this gap, we introduce an ADRD-specific benchmark for evaluating LLMs. This benchmark has two parts (ADRD Unified QA set and ADRD Caregiving QA set) constructed from two complementary sources. The ADRD Unified QA set includes 1,352 ADRD-related items extracted from seven existing benchmark datasets. By aggregating these scattered items into a single set, we create a concentrated evaluation of how well current models handle ADRD knowledge that was previously only sparsely represented. The ADRD Caregiving QA set includes 149 new questions derived from the well-validated Aging Brain Care (ABC) program which is one of the US’s leading evidence-based dementia care models <cite class="ltx_citemacro_citep">(<bibref bibrefs="alz_association_dementia_2024,alder2024indiana,lamantia_aging_2015" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. From de-identified ABC caregiver education materials, we crafted question-answer pairs that reflect realistic caregiving content, including daily caring situations (e.g., patient aggressive behavior management at home). The design of the questions was guided by a senior clinician with national expertise in comprehensive ADRD care to ensure that the questions are grounded in domain expert-verified knowledge and caregiving strategies.</p>
    </para>
    <para xml:id="S1.p4">
      <p>In summary, our work presents the first benchmark tailored to ADRD tasks. It brings together a wide range of question types (e.g., factual Q&amp;A, clinical reasoning, caregiver guidance) under the ADRD umbrella. This resource enables a focused technical assessment of LLM capabilities in a domain where factual accuracy and empathetic reasoning are paramount. We hope that evaluating LLMs on ADRD-specific tasks will help identify their current capabilities and limitations in this domain and drive the development of safer, more reliable medical AI for the science of ADRD care.</p>
    </para>
  </section>
  <section inlist="toc" labels="LABEL:sec:related" xml:id="S2">
    <tags>
      <tag>2</tag>
      <tag role="autoref">section 2</tag>
      <tag role="refnum">2</tag>
      <tag role="typerefnum">§2</tag>
    </tags>
    <title><tag close=". ">2</tag>Related Work</title>
    <toctitle><tag close=" ">2</tag>Related Work</toctitle>
    <subsection inlist="toc" xml:id="S2.SS1">
      <tags>
        <tag>2.1</tag>
        <tag role="autoref">subsection 2.1</tag>
        <tag role="refnum">2.1</tag>
        <tag role="typerefnum">§2.1</tag>
      </tags>
      <title><tag close=". ">2.1</tag>Current Medical Benchmarks</title>
      <toctitle><tag close=" ">2.1</tag>Current Medical Benchmarks</toctitle>
      <para xml:id="S2.SS1.p1">
        <p>The advent of GPT-3 and successor models spurred extensive research into medical applications of LLMs <cite class="ltx_citemacro_citep">(<bibref bibrefs="zhou_survey_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, leading to the development of many benchmarks to evaluate LLMs’ performance. General multitask evaluations such as MMLU assess knowledge across 57 subjects (from math and history to law and medicine), providing a broad measure of academic and professional understanding <cite class="ltx_citemacro_citep">(<bibref bibrefs="hendrycks_measuring_2020" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. Medical-specific benchmarks have also emerged. Researchers used United States Medical Licensing Exam (USMLE) question banks to assess models, where GPT-3.5/ChatGPT achieved ~60% accuracy (the passing threshold) without specialized training <cite class="ltx_citemacro_citep">(<bibref bibrefs="kung_performance_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. Medical Q&amp;A datasets such as PubMedQA <cite class="ltx_citemacro_citep">(<bibref bibrefs="jin_pubmedqa_2019" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, HEAD-QA <cite class="ltx_citemacro_citep">(<bibref bibrefs="vilares_head-qa_2019" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, MedQA <cite class="ltx_citemacro_citep">(<bibref bibrefs="jin_what_2021" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, MedMCQA <cite class="ltx_citemacro_citep">(<bibref bibrefs="pal_medmcqa_2022" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, Medbullets <cite class="ltx_citemacro_citep">(<bibref bibrefs="chen_benchmarking_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> have been used to quantify clinical knowledge recall. More recently, Google’s MultiMedQA benchmark combined several of these resources, spanning professional medical exams, research questions, and consumer health queries, to holistically evaluate medical LLM knowledge <cite class="ltx_citemacro_citep">(<bibref bibrefs="singhal_large_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. OpenAI’s HealthBench <cite class="ltx_citemacro_citep">(<bibref bibrefs="arora_healthbench_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> focuses on open-ended generation tasks, utilizing 48,562 unique rubric criteria to enable more realistic and granular clinical evaluations.</p>
      </para>
      <para xml:id="S2.SS1.p2">
        <p>These benchmarks have driven rapid progresses; for instance, an instruction-tuned PaLM model (Med-PaLM) achieved state-of-the-art (SOTA) performance on multiple-choice questions, reaching 67.6% accuracy on MedQA <cite class="ltx_citemacro_citep">(<bibref bibrefs="singhal_large_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. Despite these gains, high scores on structured, exam-style questions do not necessarily translate to real-world clinical competence <cite class="ltx_citemacro_citep">(<bibref bibrefs="kim_limitations_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. Human evaluations reveal persistent reasoning errors, limited robustness, and a lack of trustworthiness outside narrow formats <cite class="ltx_citemacro_citep">(<bibref bibrefs="singhal_large_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. This has reinforced the consensus that new evaluation frameworks are needed, ones that extend beyond exam accuracy to assess real-world clinical competencies <cite class="ltx_citemacro_citep">(<bibref bibrefs="zhou_survey_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S2.SS1.p3">
        <p>In response, researchers have begun creating specialized evaluation sets targeting particular failure modes. MEDEC <cite class="ltx_citemacro_citep">(<bibref bibrefs="ben_abacha_medec_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> is the first publicly available benchmark for medical error detection and correction in clinical notes. MedHallu <cite class="ltx_citemacro_citep">(<bibref bibrefs="pandit_medhallu_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> focuses on medical hallucination detection, comprising question-answer pairs from PubMedQA with deliberately fabricated answers, challenging a model to judge correctness, and results show that even top models struggle to reliably identify incorrect medical statements. In addition, as shown in Table <ref labelref="LABEL:tab:1"/>, except for MEDIQA <cite class="ltx_citemacro_citep">(<bibref bibrefs="abacha_overview_2019" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, a small dataset, the proportion of ADRD-related questions in each benchmark is generally below 3%, mostly under 1%. Collectively, these findings indicate that ADRD-specific evaluation remains underdeveloped. We aim to address this gap by introducing a benchmark focused on the ADRD domain, which has been largely overlooked in prior works.</p>
      </para>
      <table inlist="lot" labels="LABEL:tab:1" placement="t" xml:id="S2.T1">
        <tags>
          <tag><text fontsize="90%">Table 1</text></tag>
          <tag role="autoref"><text fontsize="90%">Table 1</text></tag>
          <tag role="refnum"><text fontsize="90%">1</text></tag>
          <tag role="typerefnum"><text fontsize="90%">Table 1</text></tag>
        </tags>
<!--  %“setlength–“tabcolsep˝–4pt˝ 
     %“renewcommand–“arraystretch˝–1.1˝-->        <toccaption class="ltx_centering"><tag close=" "><text fontsize="90%">1</text></tag><text fontsize="90%">Number of total and ADRD-related questions in commonly used public healthcare benchmarks.</text></toccaption>
        <caption class="ltx_centering" fontsize="90%"><tag close=". ">Table 1</tag>Number of total and ADRD-related questions in commonly used public healthcare benchmarks.</caption>
        <tabular class="ltx_centering" vattach="middle">
          <tr>
            <td align="left" border="r t"><text font="bold" fontsize="90%">Benchmark Name</text></td>
            <td align="right" border="t"><ERROR class="undefined">\makecell</ERROR><text font="bold" fontsize="90%">Number of</text></td>
            <td border="t"/>
          </tr>
          <tr>
            <td align="left" border="r"><text fontsize="90%"> </text><text font="bold" fontsize="90%">Questions</text></td>
            <td align="right"><ERROR class="undefined">\makecell</ERROR><text font="bold" fontsize="90%">Number of ADRD-</text></td>
            <td/>
          </tr>
          <tr>
            <td align="left" border="r"><text fontsize="90%"> </text><text font="bold" fontsize="90%">Related Questions</text></td>
            <td/>
            <td/>
          </tr>
          <tr>
            <td align="left" border="t" colspan="3"><ERROR class="undefined">\addlinespace</ERROR><text fontsize="90%"> </text><text font="bold" fontsize="90%">Non-Generation Tasks</text></td>
          </tr>
          <tr>
            <td align="left" border="r t"><ERROR class="undefined">\addlinespace</ERROR><text fontsize="90%">MEDCALC-BENCH</text></td>
            <td align="right" border="t"><text fontsize="90%">11,000</text></td>
            <td align="right" border="t"><text fontsize="90%">112 (1.02%)</text></td>
          </tr>
          <tr>
            <td align="left" border="r"><text fontsize="90%">MEDEC</text></td>
            <td align="right"><text fontsize="90%">3,360</text></td>
            <td align="right"><text fontsize="90%">29 (0.86%)</text></td>
          </tr>
          <tr>
            <td align="left" border="r"><text fontsize="90%">HEAD-QA</text></td>
            <td align="right"><text fontsize="90%">6,765</text></td>
            <td align="right"><text fontsize="90%">41 (0.61%)</text></td>
          </tr>
          <tr>
            <td align="left" border="r"><text fontsize="90%">MedBullets</text></td>
            <td align="right"><text fontsize="90%">616</text></td>
            <td align="right"><text fontsize="90%">12 (1.95%)</text></td>
          </tr>
          <tr>
            <td align="left" border="r"><text fontsize="90%">MedQA</text></td>
            <td align="right"><text fontsize="90%">12,723</text></td>
            <td align="right"><text fontsize="90%">98 (0.77%)</text></td>
          </tr>
          <tr>
            <td align="left" border="r"><text fontsize="90%">MedMCQA</text></td>
            <td align="right"><text fontsize="90%">193,155</text></td>
            <td align="right"><text fontsize="90%">1,148 (0.59%)</text></td>
          </tr>
          <tr>
            <td align="left" border="r"><text fontsize="90%">PubMedQA</text></td>
            <td align="right"><text fontsize="90%">1,000</text></td>
            <td align="right"><text fontsize="90%">12 (1.20%)</text></td>
          </tr>
          <tr>
            <td align="left" border="r"><text fontsize="90%">MedHallu</text></td>
            <td align="right"><text fontsize="90%">1,000</text></td>
            <td align="right"><text fontsize="90%">12 (1.20%)</text></td>
          </tr>
          <tr>
            <td align="left" border="r"><text font="bold" fontsize="90%">Total</text></td>
            <td align="right"><text font="bold" fontsize="90%">229,619</text></td>
            <td align="right"><text font="bold" fontsize="90%">1,464 (0.64%)</text></td>
          </tr>
          <tr>
            <td align="left" border="t" colspan="3"><ERROR class="undefined">\addlinespace</ERROR><text fontsize="90%"> </text><text font="bold" fontsize="90%">Generation Tasks</text></td>
          </tr>
          <tr>
            <td align="left" border="r t"><ERROR class="undefined">\addlinespace</ERROR><text fontsize="90%">HealthBench</text></td>
            <td align="right" border="t"><text fontsize="90%">5,000</text></td>
            <td align="right" border="t"><text fontsize="90%">46 (0.92%)</text></td>
          </tr>
          <tr>
            <td align="left" border="r"><text fontsize="90%">MedicationQA</text></td>
            <td align="right"><text fontsize="90%">690</text></td>
            <td align="right"><text fontsize="90%">3 (0.43%)</text></td>
          </tr>
          <tr>
            <td align="left" border="r"><text fontsize="90%">MedDialog</text></td>
            <td align="right"><text fontsize="90%">181,122</text></td>
            <td align="right"><text fontsize="90%">909 (0.50%)</text></td>
          </tr>
          <tr>
            <td align="left" border="r"><text fontsize="90%">MEDIQA</text></td>
            <td align="right"><text fontsize="90%">383</text></td>
            <td align="right"><text fontsize="90%">50 (13.05%)</text></td>
          </tr>
          <tr>
            <td align="left" border="b r"><text font="bold" fontsize="90%">Total</text></td>
            <td align="right" border="b"><text font="bold" fontsize="90%">187,195</text></td>
            <td align="right" border="b"><text font="bold" fontsize="90%">1,008 (0.54%)</text></td>
          </tr>
        </tabular>
      </table>
<!--  %**** 2-related.tex Line 50 **** -->    </subsection>
    <subsection inlist="toc" xml:id="S2.SS2">
      <tags>
        <tag>2.2</tag>
        <tag role="autoref">subsection 2.2</tag>
        <tag role="refnum">2.2</tag>
        <tag role="typerefnum">§2.2</tag>
      </tags>
      <title><tag close=". ">2.2</tag>LLMs for ADRD Applications</title>
      <toctitle><tag close=" ">2.2</tag>LLMs for ADRD Applications</toctitle>
      <para xml:id="S2.SS2.p1">
        <p>Natural language processing research in the ADRD field has traditionally been dominated by ADRD detection tasks (e.g., automated screening and diagnosis). For example, many works focus on screening for cognitive impairment <cite class="ltx_citemacro_citep">(<bibref bibrefs="peled-cohen_systematic_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, and numerous studies tackle the diagnosis of dementia from patient speech or text <cite class="ltx_citemacro_citep">(<bibref bibrefs="agbavor_predicting_2022" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. A recent systematic review found that about 56% of ADRD-related NLP studies focus on detecting cognitive impairment, such as through classifying transcripts <cite class="ltx_citemacro_citep">(<bibref bibrefs="peled-cohen_systematic_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. In contrast, far fewer studies have explored NLP for ADRD care and support. The same review identifies two emerging task categories beyond detection: caregiver support and patient assistance. However, only <Math mode="inline" tex="\sim" text="similar-to" xml:id="S2.SS2.p1.m1">
            <XMath>
              <XMTok meaning="similar-to" name="sim" role="RELOP">∼</XMTok>
            </XMath>
          </Math>6% of existing works fell under caregiver support, highlighting the nascent state of this area <cite class="ltx_citemacro_citep">(<bibref bibrefs="peled-cohen_systematic_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S2.SS2.p2">
        <p>Nevertheless, there is growing interest in applying NLP and LLMs to ADRD. Recent work explored caregiver-oriented dialogue systems, such as ADQueryAid <cite class="ltx_citemacro_citep">(<bibref bibrefs="hasan_empowering_2024" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, which leverages ADRD-specific knowledge and retrieval augmentation to provide more relevant and supportive responses than general-purpose models, as well as domain-adapted language models such as the Caregiving Language Model (CaLM) <cite class="ltx_citemacro_citep">(<bibref bibrefs="parmanto_reliable_2024" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, fine-tuned on dementia care data. Beyond direct Q&amp;A, LLMs have also been applied to analyze caregiver narratives and clinical texts, for example, to identify overburdened informal caregivers from long-term care notes. Collectively, these studies illustrate the emerging interests in language technologies beyond detection tasks, while highlighting the need for rigorous, ADRD-specific evaluation.</p>
      </para>
      <para xml:id="S2.SS2.p3">
        <p>Existing medical benchmarks provide very limited coverage of ADRD-related tasks, and the questions for evaluating models in daily caregiving scenarios are even more scarce. As a result, highly relevant ADRD caregiving questions, such as how to respond when a patient becomes agitated at night, are absent from current evaluations, leaving little systematic, quantitative evidence of how LLMs perform in the realities of daily caring. To address this gap, we construct an evaluation set that integrates the fragmented, knowledge-oriented questions from seven existing healthcare benchmarks with a complementary set of daily caregiving questions grounded on the ABC model. Overall, this new benchmark dataset provides an initial testbed for assessing LLM performance across both ADRD clinical knowledge and the practical, human-centered aspects of daily care, enabling more realistic, rigorous, and targeted research on ADRD-related LLM applications.</p>
      </para>
    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:construction" xml:id="S3">
    <tags>
      <tag>3</tag>
      <tag role="autoref">section 3</tag>
      <tag role="refnum">3</tag>
      <tag role="typerefnum">§3</tag>
    </tags>
    <title><tag close=". ">3</tag>ADRD Benchmark Construction</title>
    <toctitle><tag close=" ">3</tag>ADRD Benchmark Construction</toctitle>
    <subsection inlist="toc" xml:id="S3.SS1">
      <tags>
        <tag>3.1</tag>
        <tag role="autoref">subsection 3.1</tag>
        <tag role="refnum">3.1</tag>
        <tag role="typerefnum">§3.1</tag>
      </tags>
      <title><tag close=". ">3.1</tag>ADRD Unified QA</title>
      <toctitle><tag close=" ">3.1</tag>ADRD Unified QA</toctitle>
      <para xml:id="S3.SS1.p1">
        <p>To identify ADRD-related questions from existing benchmarks, we searched 12 public benchmark datasets using ADRD domain keywords containing “Alzheimer’s”, “Dementia”, “ADRD”, and “AD”. Except for MEDIQA, the proportion of ADRD-related questions in each benchmark is below 3%, and most are under 1%. The proportion of MEDIQA is 13.05%; however, since MEDIQA contains 383 questions, 13.05% leads to 50 questions only. Across all benchmarks, the total number of ADRD-related questions accounts for 0.64% of total non-generation task questions and 0.54% of generation task questions, respectively. In addition, manual checking showed that there were only seven questions in total about caregivers’ tasks. Here, a caregiver can be either formal or informal, defined as a person who provides paid or unpaid assistance with daily living and cognitive support as a family member or close friend, or delivers such care as a non-nurse, non-medical licensed personnel (e.g., community health workers, certified nurse aides). These low percentages show that existing medical benchmarks contain only a tiny proportion of ADRD-relevant content, especially on caregiver tasks, indicating a substantial gap and leaving significant room for future exploration. The results are shown in Table <ref labelref="LABEL:tab:1"/>.</p>
      </para>
      <para xml:id="S3.SS1.p2">
        <p>We consolidated ADRD questions from seven existing non-generation task benchmarks in Table <ref labelref="LABEL:tab:1"/>, including PubMedQA <cite class="ltx_citemacro_citep">(<bibref bibrefs="jin_pubmedqa_2019" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, HEAD-QA <cite class="ltx_citemacro_citep">(<bibref bibrefs="vilares_head-qa_2019" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, MedBullets <cite class="ltx_citemacro_citep">(<bibref bibrefs="chen_benchmarking_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, MedMCQA <cite class="ltx_citemacro_citep">(<bibref bibrefs="pal_medmcqa_2022" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, MedQA <cite class="ltx_citemacro_citep">(<bibref bibrefs="jin_what_2021" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, MEDEC <cite class="ltx_citemacro_citep">(<bibref bibrefs="ben_abacha_medec_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, and MedHallu <cite class="ltx_citemacro_citep">(<bibref bibrefs="pandit_medhallu_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. These datasets have been the leading healthcare benchmarks to test SOTA LLMs <cite class="ltx_citemacro_citep">(<bibref bibrefs="kim_questioning_2025,singhal_large_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. This led to a unified question and answer set, named ADRD Unified QA. We retain the fully original benchmark content in its entirety without modification. MEDCALC-BENCH <cite class="ltx_citemacro_citep">(<bibref bibrefs="khandekar_medcalc-bench_2024" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> is excluded from our benchmark because we focus on evaluating models’ ADRD-related knowledge and reasoning capabilities, whereas this dataset emphasizes numerical calculation and arithmetic reasoning. In addition, ADRD Unified QA does not include generation tasks, as existing evaluation metrics are insufficient to reliably and consistently assess the quality of LLM responses <cite class="ltx_citemacro_citep">(<bibref bibrefs="singhal_large_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>.</p>
      </para>
      <para xml:id="S3.SS1.p3">
        <p>Therefore, ADRD Unified QA consists of two primary question types. The first comprises multiple-choice questions designed to evaluate models’ knowledge of ADRD-related concepts, which were from PubMedQA, HEAD-QA, MedBullets, MedMCQA, and MedQA. The second focuses on error detection, assessing models’ ability to identify hallucinations and factual errors present in textual inputs, which were from MEDEC and MedHallu.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S3.SS2">
      <tags>
        <tag>3.2</tag>
        <tag role="autoref">subsection 3.2</tag>
        <tag role="refnum">3.2</tag>
        <tag role="typerefnum">§3.2</tag>
      </tags>
      <title><tag close=". ">3.2</tag>ADRD Caregiving QA</title>
      <toctitle><tag close=" ">3.2</tag>ADRD Caregiving QA</toctitle>
      <para xml:id="S3.SS2.p1">
        <p>One notable limitation of existing benchmarks is that ADRD questions are almost exclusively framed as clinical knowledge or exam-style reasoning tasks, with almost no coverage of daily caregiving. For instance, a manual check of the seven datasets used in section 3.1 yielded only seven caregiving items. However, managing ADRD in real-world settings extends far beyond diagnosis and pharmacological treatment, and largely revolves around day-to-day caregiving challenges such as behavioral management, communication with patients, safety supervision, and caregiver decision-making in home and community settings.</p>
      </para>
      <para xml:id="S3.SS2.p2">
        <p>To address this gap, we constructed an additional ADRD Caregiving QA set based on materials from the Aging Brain Care (ABC) program. The ABC program is a comprehensive, evidence-based model of care for Alzheimer’s disease and related dementias (ADRD) that has been continuously developed, evaluated, and implemented for nearly two decades <cite class="ltx_citemacro_citep">(<bibref bibrefs="alz_association_dementia_2024,alder2024indiana,lamantia_aging_2015" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. Supported initially by the U.S. Agency for Healthcare Research and Quality, the ABC model was designed as a longitudinal, biopsychosocial intervention to address the complex clinical, behavioral, and caregiver challenges associated with ADRD. In 2006, the program demonstrated efficacy in a randomized controlled trial, showing significant reductions in behavioral and psychological symptoms of dementia, caregiver burden, and inappropriate medication use among underserved populations with high social frailty <cite class="ltx_citemacro_citep">(<bibref bibrefs="callahan_effectiveness_2006" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. Since then, the ABC program has been continuously operated, refined, and scaled, serving more than 5,000 patient–caregiver dyads <cite class="ltx_citemacro_citep">(<bibref bibrefs="boustani2025effectiveness,lamantia_aging_2015,french_healthy_2014" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. Recognized by the National Academies of Medicine, ABC integrates interdisciplinary team-based care, non-pharmacologic interventions, caregiver support, medication optimization, care coordination, and community partnerships across the full ADRD trajectory. ABC directly informed the development of an alternative payment approach for comprehensive dementia care, later adopted by the U.S. Centers for Medicare &amp; Medicaid Services through the GUIDE Model <cite class="ltx_citemacro_citep">(<bibref bibrefs="haggerty_guide_2025,boustani_alternative_2019" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>. As such, the ABC program represents a mature, rigorously evaluated, and policy-relevant standard for high-quality ADRD care.</p>
      </para>
      <para xml:id="S3.SS2.p3">
        <p>Therefore, the ABC program naturally serves as a solid foundation to extract benchmark questions. The ADRD Caregiving QA set covers essential dimensions of ADRD-related caregiving, such as understanding the condition, effective communication with patients, and behavioral management strategies. We follow a structured abstraction process in which recurring caregiving situations and recommended care strategies were identified from the education content and caregiver handout, and transformed into evaluable question-answer pairs. The resulting QA sets include 120 polar (True/False) QAs and 29 multiple-choice QAs. To ensure data quality, all constructed questions were reviewed and refined for accuracy, clarity, and appropriateness by a senior clinician who served as the lead designer of the ABC program and has been involved in its development and refinement for two decades.</p>
      </para>
      <para xml:id="S3.SS2.p4">
        <p>Although modest in size, this question set provides targeted coverage of ADRD daily caring scenarios that are largely absent from existing benchmarks. By integrating these questions with the extracted ADRD Unified QA set, the ADRD-Bench enables a more comprehensive evaluation of LLMs that spans both clinical knowledge and practical caregiving support for ADRD.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S3.SS3">
      <tags>
        <tag>3.3</tag>
        <tag role="autoref">subsection 3.3</tag>
        <tag role="refnum">3.3</tag>
        <tag role="typerefnum">§3.3</tag>
      </tags>
      <title><tag close=". ">3.3</tag>Quantitative and Qualitative Evaluations</title>
      <toctitle><tag close=" ">3.3</tag>Quantitative and Qualitative Evaluations</toctitle>
      <para xml:id="S3.SS3.p1">
        <p>To validate the performance of LLMs on the proposed ADRD-Bench, we tested 33 LLMs as listed in Table <ref labelref="LABEL:tab:2"/>, including both open-weight LLMs and closed-source LLMs. The open-weight LLMs included general and medical LLMs, varying in size from 3.8B to 235B parameters. This evaluation on a diverse set of models with different parameter scales and deployment settings not only helped establish a comprehensive understanding of the models’ performance but also the overall difficulty level of the benchmark across various model settings.
<!--  %**** 3-construction.tex Line 25 **** --></p>
      </para>
      <para xml:id="S3.SS3.p2">
        <p>For open-weight models with 14.8B parameters or fewer, all experiments were conducted on a local workstation with an NVIDIA RTX 6000 Ada Generation GPU (48 GB VRAM). For larger open-weight models with 32B parameters and above, experiments are conducted on an Oracle Cloud Infrastructure server with eight NVIDIA H100 GPUs (80 GB HBM3 memory each). All closed-source general models were evaluated via their official APIs/terminals.</p>
      </para>
      <para xml:id="S3.SS3.p3">
        <p>For quantitative evaluation, given that all tasks in our benchmark are formulated as discrete-choice questions, including multiple-choice, True/False, and error sentence detection formats, we adopt exact match accuracy as the primary metric. A model’s prediction is considered correct if and only if the selected option exactly matches the ground-truth label. For each model, we reported accuracies separately for each set of the benchmark and the overall accuracy aggregated across all questions.</p>
      </para>
      <para xml:id="S3.SS3.p4">
        <p>For qualitative analysis, we focused on the newly proposed ADRD Caregiving QA set to better understand model failures in daily caregiving scenarios, which have not been covered in the literature. Specifically, we selected questions on which LLMs frequently produced incorrect answers, as indicated by the quantitative results. The selected cases were reviewed by our clinical expert to analyze common error patterns. The qualitative analyses were intended to contextualize the findings and highlight practical limitations of current LLMs in daily caregiving scenarios.</p>
      </para>
    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:result" xml:id="S4">
    <tags>
      <tag>4</tag>
      <tag role="autoref">section 4</tag>
      <tag role="refnum">4</tag>
      <tag role="typerefnum">§4</tag>
    </tags>
    <title><tag close=". ">4</tag>Results and Analyses</title>
    <toctitle><tag close=" ">4</tag>Results and Analyses</toctitle>
    <para xml:id="S4.p1">
      <p>Table  <ref labelref="LABEL:tab:2"/> shows the accuracy of 33 LLMs on each subset of ADRD Unified QA and ADRD Caregiving QA. LLMs are grouped into open-weight general, open-weight medical, and closed-source general categories, with model size indicated where applicable. Boldface indicates the best performance for each task, while underlining denotes the second-best result.</p>
    </para>
    <table inlist="lot" labels="LABEL:tab:2" placement="t" xml:id="S4.T2">
      <tags>
        <tag><text fontsize="90%">Table 2</text></tag>
        <tag role="autoref"><text fontsize="90%">Table 2</text></tag>
        <tag role="refnum"><text fontsize="90%">2</text></tag>
        <tag role="typerefnum"><text fontsize="90%">Table 2</text></tag>
      </tags>
      <toccaption class="ltx_centering"><tag close=" "><text fontsize="90%">2</text></tag><text fontsize="90%">Accuracy of LLMs on ADRD-Bench.</text></toccaption>
      <caption class="ltx_centering" fontsize="90%"><tag close=". ">Table 2</tag>Accuracy of LLMs on ADRD-Bench.</caption>
      <tabular class="ltx_centering" colsep="4.0pt" rowsep="1.2pt" vattach="middle">
        <tr>
          <td border="r t"/>
          <td border="r t"/>
          <td align="center" border="r t" colspan="8"><text font="bold" fontsize="90%">ADRD Unified QA</text></td>
          <td align="center" border="t" colspan="3"><text font="bold" fontsize="90%">ADRD Caregiving QA</text></td>
        </tr>
        <tr>
          <td align="center" border="r"><text font="bold" fontsize="90%">Model Name</text></td>
          <td align="right"><ERROR class="undefined">\makecell</ERROR><text font="bold" fontsize="90%">Model</text><text fontsize="90%"></text></td>
          <td border="t"/>
          <td border="t"/>
          <td border="t"/>
          <td border="t"/>
          <td border="t"/>
          <td border="t"/>
          <td border="t"/>
          <td border="t"/>
          <td border="t"/>
          <td border="t"/>
          <td border="t"/>
        </tr>
        <tr>
          <td align="left" border="r"><text font="bold" fontsize="90%">Size</text></td>
          <td align="right"><ERROR class="undefined">\makecell</ERROR><text fontsize="90%">Med</text></td>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%"> MCQA</text></td>
          <td align="right"><ERROR class="undefined">\makecell</ERROR><text fontsize="90%">Med</text></td>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%"> QA</text></td>
          <td align="right"><ERROR class="undefined">\makecell</ERROR><text fontsize="90%">HEAD-</text></td>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%"> QA</text></td>
          <td align="right"><ERROR class="undefined">\makecell</ERROR><text fontsize="90%">MED</text></td>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%"> EC</text></td>
          <td align="right"><ERROR class="undefined">\makecell</ERROR><text fontsize="90%">PubMed</text></td>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%"> QA</text></td>
          <td align="right"><ERROR class="undefined">\makecell</ERROR><text fontsize="90%">Med</text></td>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%"> Bullets</text></td>
          <td align="right"><ERROR class="undefined">\makecell</ERROR><text fontsize="90%">Med</text></td>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%"> Hallu</text></td>
          <td align="center" border="r"><text fontsize="90%">Score</text></td>
          <td align="right"><ERROR class="undefined">\makecell</ERROR><text fontsize="90%">True/</text></td>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%"> False</text></td>
          <td align="right"><ERROR class="undefined">\makecell</ERROR><text fontsize="90%">Multiple</text></td>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%"> Choice</text></td>
          <td align="center" border="r"><text fontsize="90%">Score</text></td>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
          <td/>
        </tr>
        <tr>
          <td border="r"/>
          <td border="r"/>
          <td align="center"><text fontsize="90%">1148 (size)</text></td>
          <td align="center"><text fontsize="90%">98</text></td>
          <td align="center"><text fontsize="90%">41</text></td>
          <td align="center"><text fontsize="90%">29</text></td>
          <td align="center"><text fontsize="90%">12</text></td>
          <td align="center"><text fontsize="90%">12</text></td>
          <td align="center"><text fontsize="90%">12</text></td>
          <td align="center" border="r"><text fontsize="90%">1352</text></td>
          <td align="center"><text fontsize="90%">120</text></td>
          <td align="center"><text fontsize="90%">29</text></td>
          <td align="center"><text fontsize="90%">149</text></td>
        </tr>
        <tr>
          <td align="center" border="t" colspan="13"><text font="bold" fontsize="90%">Open-Weight General Model</text></td>
        </tr>
        <tr>
          <td align="left" border="r t"><text fontsize="90%">phi3:3.8b-mini-128k-instruct-fp16</text></td>
          <td align="center" border="r t"><text fontsize="90%">3.8B</text></td>
          <td align="center" border="t"><text fontsize="90%">0.6455</text></td>
          <td align="center" border="t"><text fontsize="90%">0.6020</text></td>
          <td align="center" border="t"><text fontsize="90%">0.8049</text></td>
          <td align="center" border="t"><text fontsize="90%">0.1034</text></td>
          <td align="center" border="t"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center" border="t"><text fontsize="90%">0.6667</text></td>
          <td align="center" border="t"><text fontsize="90%">0.8333</text></td>
          <td align="center" border="r t"><text fontsize="90%">0.6391</text></td>
          <td align="center" border="t"><text fontsize="90%">0.8833</text></td>
          <td align="center" border="t"><text fontsize="90%">0.8276</text></td>
          <td align="center" border="t"><text fontsize="90%">0.8725</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">qwen3-4b-instruct-2507-fp16</text></td>
          <td align="center" border="r"><text fontsize="90%">4B</text></td>
          <td align="center"><text fontsize="90%">0.7108</text></td>
          <td align="center"><text fontsize="90%">0.6429</text></td>
          <td align="center"><text fontsize="90%">0.8293</text></td>
          <td align="center"><text fontsize="90%">0.1379</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%">0.7500</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center" border="r"><text fontsize="90%">0.7004</text></td>
          <td align="center"><text fontsize="90%">0.8833</text></td>
          <td align="center"><text fontsize="90%">0.7586</text></td>
          <td align="center"><text fontsize="90%">0.8591</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">falcon3.7b-instruct-fp16</text></td>
          <td align="center" border="r"><text fontsize="90%">7.5B</text></td>
          <td align="center"><text fontsize="90%">0.6838</text></td>
          <td align="center"><text fontsize="90%">0.5714</text></td>
          <td align="center"><text fontsize="90%">0.7317</text></td>
          <td align="center"><text fontsize="90%">0.2069</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%">0.6667</text></td>
          <td align="center"><text fontsize="90%">0.5833</text></td>
          <td align="center" border="r"><text fontsize="90%">0.6672</text></td>
          <td align="center"><text fontsize="90%">0.8583</text></td>
          <td align="center"><text fontsize="90%">0.7241</text></td>
          <td align="center"><text fontsize="90%">0.8322</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">qwen2.5-7b-instruct-fp16</text></td>
          <td align="center" border="r"><text fontsize="90%">7.6B</text></td>
          <td align="center"><text fontsize="90%">0.7308</text></td>
          <td align="center"><text fontsize="90%">0.6224</text></td>
          <td align="center"><text fontsize="90%">0.7317</text></td>
          <td align="center"><text fontsize="90%">0.3793</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.7500</text></td>
          <td align="center"><text fontsize="90%">0.6667</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center" border="r"><text fontsize="90%">0.7160</text></td>
          <td align="center"><text fontsize="90%">0.8583</text></td>
          <td align="center"><text fontsize="90%">0.7241</text></td>
          <td align="center"><text fontsize="90%">0.8322</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">llama3.3-8b-instruct-fp16</text></td>
          <td align="center" border="r"><text fontsize="90%">8B</text></td>
          <td align="center"><text fontsize="90%">0.9251</text></td>
          <td align="center"><text fontsize="90%">0.6837</text></td>
          <td align="center"><text fontsize="90%">0.7317</text></td>
          <td align="center"><text fontsize="90%">0.1034</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.7500</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center"><text fontsize="90%">0.6667</text></td>
          <td align="center" border="r"><text fontsize="90%">0.8802</text></td>
          <td align="center"><text fontsize="90%">0.8167</text></td>
          <td align="center"><text fontsize="90%">0.6897</text></td>
          <td align="center"><text fontsize="90%">0.7919</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">llama3.1-8b-instruct-fp16</text></td>
          <td align="center" border="r"><text fontsize="90%">8B</text></td>
          <td align="center"><text fontsize="90%">0.8780</text></td>
          <td align="center"><text fontsize="90%">0.6939</text></td>
          <td align="center"><text fontsize="90%">0.7805</text></td>
          <td align="center"><text fontsize="90%">0.0000</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%">0.7500</text></td>
          <td align="center"><text fontsize="90%">0.7500</text></td>
          <td align="center" border="r"><text fontsize="90%">0.8402</text></td>
          <td align="center"><text fontsize="90%">0.8417</text></td>
          <td align="center"><text fontsize="90%">0.7586</text></td>
          <td align="center"><text fontsize="90%">0.8255</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">yi-9b-chat-v1.5-fp16</text></td>
          <td align="center" border="r"><text fontsize="90%">9B</text></td>
          <td align="center"><text fontsize="90%">0.6185</text></td>
          <td align="center"><text fontsize="90%">0.5408</text></td>
          <td align="center"><text fontsize="90%">0.7561</text></td>
          <td align="center"><text fontsize="90%">0.1034</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%">0.5000</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center" border="r"><text fontsize="90%">0.6095</text></td>
          <td align="center"><text fontsize="90%">0.8583</text></td>
          <td align="center"><text fontsize="90%">0.7241</text></td>
          <td align="center"><text fontsize="90%">0.8322</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">gemma2:9b-instruct-fp16</text></td>
          <td align="center" border="r"><text fontsize="90%">9.2B</text></td>
          <td align="center"><text fontsize="90%">0.7387</text></td>
          <td align="center"><text fontsize="90%">0.6020</text></td>
          <td align="center"><text fontsize="90%">0.7561</text></td>
          <td align="center"><text fontsize="90%">0.2069</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.7500</text></td>
          <td align="center"><text fontsize="90%">0.6667</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center" border="r"><text fontsize="90%">0.7189</text></td>
          <td align="center"><text fontsize="90%">0.8750</text></td>
          <td align="center"><text fontsize="90%">0.7931</text></td>
          <td align="center"><text fontsize="90%">0.8591</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">gemma3:12b-it-fp16</text></td>
          <td align="center" border="r"><text fontsize="90%">12.2B</text></td>
          <td align="center"><text fontsize="90%">0.7439</text></td>
          <td align="center"><text fontsize="90%">0.6327</text></td>
          <td align="center"><text fontsize="90%">0.8049</text></td>
          <td align="center"><text fontsize="90%">0.1379</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.7500</text></td>
          <td align="center"><text fontsize="90%">0.5833</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center" border="r"><text fontsize="90%">0.7241</text></td>
          <td align="center"><text fontsize="90%">0.9083</text></td>
          <td align="center"><text fontsize="90%">0.7931</text></td>
          <td align="center"><text fontsize="90%">0.8859</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">phi3:14b-medium-4k-instruct-fp16</text></td>
          <td align="center" border="r"><text fontsize="90%">14B</text></td>
          <td align="center"><text fontsize="90%">0.7622</text></td>
          <td align="center"><text fontsize="90%">0.7449</text></td>
          <td align="center"><text fontsize="90%">0.8293</text></td>
          <td align="center"><text fontsize="90%">0.2069</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center" border="r"><text fontsize="90%">0.7530</text></td>
          <td align="center"><text fontsize="90%">0.8917</text></td>
          <td align="center"><text fontsize="90%">0.7586</text></td>
          <td align="center"><text fontsize="90%">0.8658</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">qwen2.5:14b-instruct-fp16</text></td>
          <td align="center" border="r"><text fontsize="90%">14.8B</text></td>
          <td align="center"><text fontsize="90%">0.7787</text></td>
          <td align="center"><text fontsize="90%">0.6531</text></td>
          <td align="center"><text fontsize="90%">0.8049</text></td>
          <td align="center"><text fontsize="90%">0.2069</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.7500</text></td>
          <td align="center"><text fontsize="90%">0.6667</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center" border="r"><text fontsize="90%">0.7574</text></td>
          <td align="center"><text fontsize="90%">0.8750</text></td>
          <td align="center"><text fontsize="90%">0.7241</text></td>
          <td align="center"><text fontsize="90%">0.8456</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">Qwen3-30B-A3B-Instruct-2507</text></td>
          <td align="center" border="r"><text fontsize="90%">30B</text></td>
          <td align="center"><text fontsize="90%">0.8467</text></td>
          <td align="center"><text fontsize="90%">0.7551</text></td>
          <td align="center"><text fontsize="90%">0.8537</text></td>
          <td align="center"><text fontsize="90%">0.4483</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.7500</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center" border="r"><text fontsize="90%">0.8306</text></td>
          <td align="center"><text fontsize="90%">0.9083</text></td>
          <td align="center"><text fontsize="90%">0.7586</text></td>
          <td align="center"><text fontsize="90%">0.8792</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">Llama-3.1-70B-Instruct</text></td>
          <td align="center" border="r"><text fontsize="90%">70B</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9495</text></td>
          <td align="center"><text fontsize="90%">0.9286</text></td>
          <td align="center"><text fontsize="90%">0.8537</text></td>
          <td align="center"><text fontsize="90%">0.4483</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center"><text font="bold" fontsize="90%">1.0000</text></td>
          <td align="center" border="r"><text font="bold" fontsize="90%">0.9334</text></td>
          <td align="center"><text fontsize="90%">0.9333</text></td>
          <td align="center"><text fontsize="90%">0.7931</text></td>
          <td align="center"><text fontsize="90%">0.9060</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">Qwen2.5-72B-Instruct</text></td>
          <td align="center" border="r"><text fontsize="90%">70B</text></td>
          <td align="center"><text fontsize="90%">0.8571</text></td>
          <td align="center"><text fontsize="90%">0.7347</text></td>
          <td align="center"><text fontsize="90%">0.8780</text></td>
          <td align="center"><text fontsize="90%">0.2414</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.7500</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center" border="r"><text fontsize="90%">0.8351</text></td>
          <td align="center"><text fontsize="90%">0.8917</text></td>
          <td align="center"><text fontsize="90%">0.8276</text></td>
          <td align="center"><text fontsize="90%">0.8792</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">Qwen3-235B-A22B-Instruct-2507</text></td>
          <td align="center" border="r"><text fontsize="90%">235B</text></td>
          <td align="center"><text fontsize="90%">0.8902</text></td>
          <td align="center"><text fontsize="90%">0.8980</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9024</text></td>
          <td align="center"><text fontsize="90%">0.4828</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.7500</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center" border="r"><text fontsize="90%">0.8817</text></td>
          <td align="center"><text fontsize="90%">0.9167</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.8966</text></td>
          <td align="center"><text fontsize="90%">0.9128</text></td>
        </tr>
        <tr>
          <td align="center" border="t" colspan="13"><text font="bold" fontsize="90%">Open-Weight Medical Model</text></td>
        </tr>
        <tr>
          <td align="left" border="r t"><text fontsize="90%">MedGemma-4b-it</text></td>
          <td align="center" border="r t"><text fontsize="90%">4B</text></td>
          <td align="center" border="t"><text fontsize="90%">0.7343</text></td>
          <td align="center" border="t"><text fontsize="90%">0.6020</text></td>
          <td align="center" border="t"><text fontsize="90%">0.8049</text></td>
          <td align="center" border="t"><text fontsize="90%">0.2069</text></td>
          <td align="center" border="t"><text fontsize="90%">0.6667</text></td>
          <td align="center" border="t"><text fontsize="90%">0.7500</text></td>
          <td align="center" border="t"><text fontsize="90%">0.5833</text></td>
          <td align="center" border="r t"><text fontsize="90%">0.7138</text></td>
          <td align="center" border="t"><text fontsize="90%">0.8250</text></td>
          <td align="center" border="t"><text fontsize="90%">0.5862</text></td>
          <td align="center" border="t"><text fontsize="90%">0.7785</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">Hulu-Med-4B</text></td>
          <td align="center" border="r"><text fontsize="90%">4B</text></td>
          <td align="center"><text fontsize="90%">0.8667</text></td>
          <td align="center"><text fontsize="90%">0.8163</text></td>
          <td align="center"><text fontsize="90%">0.8293</text></td>
          <td align="center"><text fontsize="90%">0.3793</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%">0.6667</text></td>
          <td align="center"><text fontsize="90%">0.7500</text></td>
          <td align="center" border="r"><text fontsize="90%">0.8484</text></td>
          <td align="center"><text fontsize="90%">0.8417</text></td>
          <td align="center"><text fontsize="90%">0.8276</text></td>
          <td align="center"><text fontsize="90%">0.8389</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">Lingshu-7B</text></td>
          <td align="center" border="r"><text fontsize="90%">7B</text></td>
          <td align="center"><text fontsize="90%">0.6977</text></td>
          <td align="center"><text fontsize="90%">0.7143</text></td>
          <td align="center"><text fontsize="90%">0.8293</text></td>
          <td align="center"><text fontsize="90%">0.1724</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.7500</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%">0.7500</text></td>
          <td align="center" border="r"><text fontsize="90%">0.6938</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%">0.6897</text></td>
          <td align="center"><text fontsize="90%">0.8054</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">Hulu-Med-7B</text></td>
          <td align="center" border="r"><text fontsize="90%">7B</text></td>
          <td align="center"><text fontsize="90%">0.9077</text></td>
          <td align="center"><text fontsize="90%">0.8469</text></td>
          <td align="center"><text fontsize="90%">0.8049</text></td>
          <td align="center"><text fontsize="90%">0.2414</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center" border="r"><text fontsize="90%">0.8846</text></td>
          <td align="center"><text fontsize="90%">0.8667</text></td>
          <td align="center"><text fontsize="90%">0.7931</text></td>
          <td align="center"><text fontsize="90%">0.8523</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">BioMistral-7B</text></td>
          <td align="center" border="r"><text fontsize="90%">7B</text></td>
          <td align="center"><text fontsize="90%">0.4634</text></td>
          <td align="center"><text fontsize="90%">0.4388</text></td>
          <td align="center"><text fontsize="90%">0.6585</text></td>
          <td align="center"><text fontsize="90%">0.3103</text></td>
          <td align="center"><text fontsize="90%">0.5000</text></td>
          <td align="center"><text fontsize="90%">0.5833</text></td>
          <td align="center"><text fontsize="90%">0.6667</text></td>
          <td align="center" border="r"><text fontsize="90%">0.4675</text></td>
          <td align="center"><text fontsize="90%">0.6750</text></td>
          <td align="center"><text fontsize="90%">0.4483</text></td>
          <td align="center"><text fontsize="90%">0.6309</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">Hulu-Med-14B</text></td>
          <td align="center" border="r"><text fontsize="90%">14B</text></td>
          <td align="center"><text fontsize="90%">0.9268</text></td>
          <td align="center"><text fontsize="90%">0.9286</text></td>
          <td align="center"><text fontsize="90%">0.8537</text></td>
          <td align="center"><text fontsize="90%">0.3103</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text font="bold" fontsize="90%">1.0000</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center" border="r"><text fontsize="90%">0.9105</text></td>
          <td align="center"><text fontsize="90%">0.9000</text></td>
          <td align="center"><text fontsize="90%">0.7586</text></td>
          <td align="center"><text fontsize="90%">0.8725</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">Hulu-Med-32B</text></td>
          <td align="center" border="r"><text fontsize="90%">32B</text></td>
          <td align="center"><text fontsize="90%">0.9373</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9592</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9024</text></td>
          <td align="center"><text fontsize="90%">0.2759</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center" border="r"><text fontsize="90%">0.9223</text></td>
          <td align="center"><text fontsize="90%">0.9333</text></td>
          <td align="center"><text fontsize="90%">0.7931</text></td>
          <td align="center"><text fontsize="90%">0.9060</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">Lingshu-32B</text></td>
          <td align="center" border="r"><text fontsize="90%">32B</text></td>
          <td align="center"><text fontsize="90%">0.8249</text></td>
          <td align="center"><text fontsize="90%">0.9082</text></td>
          <td align="center"><text fontsize="90%">0.8293</text></td>
          <td align="center"><text fontsize="90%">0.5172</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center" border="r"><text fontsize="90%">0.8254</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.9667</text></td>
          <td align="center"><text fontsize="90%">0.8621</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9463</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">Llama3-OpenBioLLM-70B</text></td>
          <td align="center" border="r"><text fontsize="90%">70B</text></td>
          <td align="center"><text fontsize="90%">0.9460</text></td>
          <td align="center"><text fontsize="90%">0.9082</text></td>
          <td align="center"><text fontsize="90%">0.8293</text></td>
          <td align="center"><text fontsize="90%">0.4483</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.7500</text></td>
          <td align="center"><text font="bold" fontsize="90%">1.0000</text></td>
          <td align="center"><text fontsize="90%">0.7500</text></td>
          <td align="center" border="r"><text fontsize="90%">0.9260</text></td>
          <td align="center"><text fontsize="90%">0.9167</text></td>
          <td align="center"><text fontsize="90%">0.7931</text></td>
          <td align="center"><text fontsize="90%">0.8926</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">ClinicalCamel-70B</text></td>
          <td align="center" border="r"><text fontsize="90%">70B</text></td>
          <td align="center"><text fontsize="90%">0.6568</text></td>
          <td align="center"><text fontsize="90%">0.5816</text></td>
          <td align="center"><text fontsize="90%">0.7561</text></td>
          <td align="center"><text fontsize="90%">0.3793</text></td>
          <td align="center"><text fontsize="90%">0.5833</text></td>
          <td align="center"><text fontsize="90%">0.5833</text></td>
          <td align="center"><text font="bold" fontsize="90%">1.0000</text></td>
          <td align="center" border="r"><text fontsize="90%">0.6501</text></td>
          <td align="center"><text fontsize="90%">0.8667</text></td>
          <td align="center"><text fontsize="90%">0.7586</text></td>
          <td align="center"><text fontsize="90%">0.8456</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">Med42-70B</text></td>
          <td align="center" border="r"><text fontsize="90%">70B</text></td>
          <td align="center"><text fontsize="90%">0.9077</text></td>
          <td align="center"><text fontsize="90%">0.8980</text></td>
          <td align="center"><text fontsize="90%">0.8537</text></td>
          <td align="center"><text fontsize="90%">0.3448</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%">0.6667</text></td>
          <td align="center"><text font="bold" fontsize="90%">1.0000</text></td>
          <td align="center" border="r"><text fontsize="90%">0.8913</text></td>
          <td align="center"><text fontsize="90%">0.8833</text></td>
          <td align="center"><text fontsize="90%">0.7586</text></td>
          <td align="center"><text fontsize="90%">0.8591</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">Meditron3-70B</text></td>
          <td align="center" border="r"><text fontsize="90%">70B</text></td>
          <td align="center"><text fontsize="90%">0.9460</text></td>
          <td align="center"><text fontsize="90%">0.9082</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.9268</text></td>
          <td align="center"><text fontsize="90%">0.4483</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text font="bold" fontsize="90%">1.0000</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center" border="r"><text fontsize="90%" framed="underline">0.9312</text></td>
          <td align="center"><text fontsize="90%">0.9250</text></td>
          <td align="center"><text fontsize="90%">0.7931</text></td>
          <td align="center"><text fontsize="90%">0.8993</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">MMed-Llama3.1-70B</text></td>
          <td align="center" border="r"><text fontsize="90%">70B</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.9564</text></td>
          <td align="center"><text fontsize="90%">0.9184</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.9268</text></td>
          <td align="center"><text fontsize="90%">0.0345</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.7500</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center"><text fontsize="90%">0.5000</text></td>
          <td align="center" border="r"><text fontsize="90%">0.9268</text></td>
          <td align="center"><text fontsize="90%">0.9000</text></td>
          <td align="center"><text fontsize="90%">0.7586</text></td>
          <td align="center"><text fontsize="90%">0.8725</text></td>
        </tr>
        <tr>
          <td align="center" border="t" colspan="13"><text font="bold" fontsize="90%">Closed-Source General Model</text></td>
        </tr>
        <tr>
          <td align="left" border="r t"><text fontsize="90%">Gemini 2.5 Pro</text></td>
          <td align="center" border="r t"><text fontsize="90%">–</text></td>
          <td align="center" border="t"><text fontsize="90%">0.9207</text></td>
          <td align="center" border="t"><text font="bold" fontsize="90%">0.9694</text></td>
          <td align="center" border="t"><text font="bold" fontsize="90%">0.9268</text></td>
          <td align="center" border="t"><text fontsize="90%">0.4828</text></td>
          <td align="center" border="t"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center" border="t"><text font="bold" fontsize="90%">1.0000</text></td>
          <td align="center" border="t"><text fontsize="90%">0.8333</text></td>
          <td align="center" border="r t"><text fontsize="90%">0.9142</text></td>
          <td align="center" border="t"><text fontsize="90%">0.9250</text></td>
          <td align="center" border="t"><text fontsize="90%">0.7931</text></td>
          <td align="center" border="t"><text fontsize="90%">0.8993</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">qwen3-max-2025-09-23</text></td>
          <td align="center" border="r"><text fontsize="90%">–</text></td>
          <td align="center"><text fontsize="90%">0.9024</text></td>
          <td align="center"><text fontsize="90%">0.9388</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.9268</text></td>
          <td align="center"><text fontsize="90%">0.4138</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center" border="r"><text fontsize="90%">0.8950</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9417</text></td>
          <td align="center"><text fontsize="90%">0.8276</text></td>
          <td align="center"><text fontsize="90%">0.9195</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">GPT-5.2</text></td>
          <td align="center" border="r"><text fontsize="90%">–</text></td>
          <td align="center"><text fontsize="90%">0.8981</text></td>
          <td align="center"><text fontsize="90%">0.8878</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.9024</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.9310</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center"><text fontsize="90%">0.7500</text></td>
          <td align="center" border="r"><text fontsize="90%">0.8957</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.9667</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.9655</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.9664</text></td>
        </tr>
        <tr>
          <td align="left" border="r"><text fontsize="90%">claude-opus-4.5-20251101</text></td>
          <td align="center" border="r"><text fontsize="90%">–</text></td>
          <td align="center"><text fontsize="90%">0.9181</text></td>
          <td align="center"><text fontsize="90%">0.9490</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.9268</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.6207</text></td>
          <td align="center"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center"><text font="bold" fontsize="90%">1.0000</text></td>
          <td align="center"><text fontsize="90%">0.8333</text></td>
          <td align="center" border="r"><text fontsize="90%">0.9135</text></td>
          <td align="center"><text fontsize="90%">0.9333</text></td>
          <td align="center"><text fontsize="90%" framed="underline">0.8966</text></td>
          <td align="center"><text fontsize="90%">0.9262</text></td>
        </tr>
        <tr>
          <td align="left" border="b r"><text fontsize="90%">grok-4.1-fast-non-reasoning</text></td>
          <td align="center" border="b r"><text fontsize="90%">–</text></td>
          <td align="center" border="b"><text fontsize="90%">0.8328</text></td>
          <td align="center" border="b"><text fontsize="90%">0.8367</text></td>
          <td align="center" border="b"><text fontsize="90%">0.8780</text></td>
          <td align="center" border="b"><text fontsize="90%">0.3793</text></td>
          <td align="center" border="b"><text font="bold" fontsize="90%">0.8333</text></td>
          <td align="center" border="b"><text fontsize="90%">0.8333</text></td>
          <td align="center" border="b"><text fontsize="90%" framed="underline">0.9167</text></td>
          <td align="center" border="b r"><text fontsize="90%">0.8254</text></td>
          <td align="center" border="b"><text fontsize="90%">0.9000</text></td>
          <td align="center" border="b"><text fontsize="90%">0.8276</text></td>
          <td align="center" border="b"><text fontsize="90%">0.8859</text></td>
        </tr>
      </tabular>
      <p align="center"><text fontsize="80%">Boldface: the best performance; Underlining: the second-best result. Within each QA set, all questions across subcategories are equally weighted; scores are computed over all questions.</text></p>
    </table>
    <subsection inlist="toc" xml:id="S4.SS1">
      <tags>
        <tag>4.1</tag>
        <tag role="autoref">subsection 4.1</tag>
        <tag role="refnum">4.1</tag>
        <tag role="typerefnum">§4.1</tag>
      </tags>
      <title><tag close=". ">4.1</tag>LLM Performance on ADRD Unified QA</title>
      <toctitle><tag close=" ">4.1</tag>LLM Performance on ADRD Unified QA</toctitle>
      <para xml:id="S4.SS1.p1">
        <p>Figure <ref labelref="LABEL:fig:1"/> shows a significant, moderate-to-strong positive correlation (Pearson’s <Math mode="inline" tex="r=0.5486" text="r = 0.5486" xml:id="S4.SS1.p1.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" role="UNKNOWN">r</XMTok>
                <XMTok meaning="0.5486" role="NUMBER">0.5486</XMTok>
              </XMApp>
            </XMath>
          </Math>, <Math mode="inline" tex="p=0.0025" text="p = 0.0025" xml:id="S4.SS1.p1.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" role="UNKNOWN">p</XMTok>
                <XMTok meaning="0.0025" role="NUMBER">0.0025</XMTok>
              </XMApp>
            </XMath>
          </Math>) between model size and accuracy on the ADRD Unified QA set. Most smaller models in the 4–7B parameter range achieved relatively low performance (approximately <Math mode="inline" tex="0.45" text="0.45" xml:id="S4.SS1.p1.m3">
            <XMath>
              <XMTok meaning="0.45" role="NUMBER">0.45</XMTok>
            </XMath>
          </Math>–<Math mode="inline" tex="0.75" text="0.75" xml:id="S4.SS1.p1.m4">
            <XMath>
              <XMTok meaning="0.75" role="NUMBER">0.75</XMTok>
            </XMath>
          </Math>), whereas large models (70B+ parameters) demonstrated substantially higher accuracy, mostly <Math mode="inline" tex="\geq 0.85" text="absent &gt;= 0.85" xml:id="S4.SS1.p1.m5">
            <XMath>
              <XMApp>
                <XMTok meaning="greater-than-or-equals" name="geq" role="RELOP">≥</XMTok>
                <XMTok meaning="absent"/>
                <XMTok meaning="0.85" role="NUMBER">0.85</XMTok>
              </XMApp>
            </XMath>
          </Math>. Moreover, most models with more than 30B parameters performed above the overall mean accuracy (0.7972). The best-performing model was Llama-3.1-70B-Instruct <cite class="ltx_citemacro_citep">(<bibref bibrefs="meta_ai_llama-31-70b-instruct_2024" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, which achieved an accuracy of <Math mode="inline" tex="0.9334" text="0.9334" xml:id="S4.SS1.p1.m6">
            <XMath>
              <XMTok meaning="0.9334" role="NUMBER">0.9334</XMTok>
            </XMath>
          </Math>.</p>
      </para>
      <para xml:id="S4.SS1.p2">
        <p>When comparing different model categories, open-weight medical models (red points) generally tended to outperform open-weight general models (blue points) of comparable size. Across the parameter scale, open-weight medical models were mostly positioned above open-weight general models, highlighting the benefits of domain-specific fine-tuning for clinical question answering and medical error detection. However, an exception was observed for the top-performing model, Llama-3.1-70B-Instruct <cite class="ltx_citemacro_citep">(<bibref bibrefs="meta_ai_llama-31-70b-instruct_2024" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, which is a general model and slightly outperformed the strongest medical model, Meditron3-70B <cite class="ltx_citemacro_citep">(<bibref bibrefs="openmeditron_meditron3-70b_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> (0.9312). All closed-source general models achieved performance above the overall mean accuracy, with Gemini 2.5 Pro <cite class="ltx_citemacro_citep">(<bibref bibrefs="comanici_gemini_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> performing the best among them, reaching an accuracy of <Math mode="inline" tex="0.9142" text="0.9142" xml:id="S4.SS1.p2.m1">
            <XMath>
              <XMTok meaning="0.9142" role="NUMBER">0.9142</XMTok>
            </XMath>
          </Math>. Nevertheless, it was surpassed by several open-weight models (e.g., Llama-3.1-70B-Instruct and Meditron3-70B) with significantly fewer parameters. In addition, some open-weight medical LLMs exhibited substantially lower accuracy than open-weight general models with similar parameter counts, indicating that domain-specific fine-tuning does not universally guarantee superior performance.</p>
      </para>
      <para xml:id="S4.SS1.p3">
        <p>Overall, these results suggest that model size is an important contributor to performance on the ADRD Unified QA benchmark; however, it does not fully determine model effectiveness. Domain-specific fine-tuning may provide performance advantages in certain cases, but general models can still outperform specialized models, possibly due to stronger general reasoning capabilities and more robust language representations learned from large-scale and diverse pretraining corpora.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S4.SS2">
      <tags>
        <tag>4.2</tag>
        <tag role="autoref">subsection 4.2</tag>
        <tag role="refnum">4.2</tag>
        <tag role="typerefnum">§4.2</tag>
      </tags>
      <title><tag close=". ">4.2</tag>LLM Performance on ADRD Caregiving QA</title>
      <toctitle><tag close=" ">4.2</tag>LLM Performance on ADRD Caregiving QA</toctitle>
      <para xml:id="S4.SS2.p1">
        <p>On the ADRD Caregiving QA set (Figure <ref labelref="LABEL:fig:2"/>), the performance gap between smaller and larger models was narrower than that observed on the ADRD Unified QA set. This may be attributed to the fact that the questions in this benchmark focus more on daily living and are closer to “common-sense” reasoning, which is generally well covered by large language models. Nevertheless, an upward trend in accuracy with increasing model size remained evident. Pearson’s correlation analysis revealed a significant, moderate-to-strong positive correlation between model size and accuracy (<Math mode="inline" tex="r=0.5277" text="r = 0.5277" xml:id="S4.SS2.p1.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" role="UNKNOWN">r</XMTok>
                <XMTok meaning="0.5277" role="NUMBER">0.5277</XMTok>
              </XMApp>
            </XMath>
          </Math>, <Math mode="inline" tex="p=0.0039" text="p = 0.0039" xml:id="S4.SS2.p1.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" role="UNKNOWN">p</XMTok>
                <XMTok meaning="0.0039" role="NUMBER">0.0039</XMTok>
              </XMApp>
            </XMath>
          </Math>). Most smaller models in the 4B–7B parameter range achieved accuracies between <Math mode="inline" tex="0.75" text="0.75" xml:id="S4.SS2.p1.m3">
            <XMath>
              <XMTok meaning="0.75" role="NUMBER">0.75</XMTok>
            </XMath>
          </Math> and <Math mode="inline" tex="0.85" text="0.85" xml:id="S4.SS2.p1.m4">
            <XMath>
              <XMTok meaning="0.85" role="NUMBER">0.85</XMTok>
            </XMath>
          </Math>, all of which (except for the smallest model, phi3:3.8b-mini-128k-instruct-fp16 <cite class="ltx_citemacro_citep">(<bibref bibrefs="phi_team_phi-3_2024" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>) fell below the overall mean accuracy of <Math mode="inline" tex="0.8629" text="0.8629" xml:id="S4.SS2.p1.m5">
            <XMath>
              <XMTok meaning="0.8629" role="NUMBER">0.8629</XMTok>
            </XMath>
          </Math>. In contrast, the majority of models with more than 14B parameters performed above the average accuracy.</p>
      </para>
      <figure inlist="lof" labels="LABEL:fig:1" xml:id="S4.F1">
        <tags>
          <tag>Figure 1</tag>
          <tag role="autoref">Figure 1</tag>
          <tag role="refnum">1</tag>
          <tag role="typerefnum">Figure 1</tag>
        </tags>
        <graphics candidates="figures/ADRD_Unified_QA_Analysis_Revised.png" class="ltx_centering" graphic="figures/ADRD_Unified_QA_Analysis_Revised.png" options="width=433.62pt" xml:id="S4.F1.g1"/>
        <toccaption class="ltx_centering"><tag close=" ">1</tag>
Model accuracy vs. model size on the ADRD Unified QA. Blue points: open-weight general models; Red points: open-weight medical models; Green points: closed-source general models. Dashed horizontal line: mean accuracy across all models.
</toccaption>
        <caption class="ltx_centering"><tag close=". ">Figure 1</tag>
Model accuracy vs. model size on the ADRD Unified QA. Blue points: open-weight general models; Red points: open-weight medical models; Green points: closed-source general models. Dashed horizontal line: mean accuracy across all models.
</caption>
      </figure>
      <figure inlist="lof" labels="LABEL:fig:2" xml:id="S4.F2">
        <tags>
          <tag>Figure 2</tag>
          <tag role="autoref">Figure 2</tag>
          <tag role="refnum">2</tag>
          <tag role="typerefnum">Figure 2</tag>
        </tags>
        <graphics candidates="figures/ADRD_Daily_Caring_QA_Analysis_Revised.png" class="ltx_centering" graphic="figures/ADRD_Daily_Caring_QA_Analysis_Revised.png" options="width=433.62pt" xml:id="S4.F2.g1"/>
        <toccaption class="ltx_centering"><tag close=" ">2</tag>
Model accuracy vs. model size on the ADRD Caregiving QA. Blue points: open-weight general models; Red points: open-weight medical models; Green points: closed-source general models. Dashed horizontal line: mean accuracy across all models.
</toccaption>
        <caption class="ltx_centering"><tag close=". ">Figure 2</tag>
Model accuracy vs. model size on the ADRD Caregiving QA. Blue points: open-weight general models; Red points: open-weight medical models; Green points: closed-source general models. Dashed horizontal line: mean accuracy across all models.
</caption>
      </figure>
      <para xml:id="S4.SS2.p2">
        <p>Except for grok-4-1-fast-non-reasoning <cite class="ltx_citemacro_citep">(<bibref bibrefs="xai_grok_2024" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> and Gemini 2.5 Pro <cite class="ltx_citemacro_citep">(<bibref bibrefs="comanici_gemini_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite>, all other closed-source general models achieved accuracies above <Math mode="inline" tex="0.9" text="0.9" xml:id="S4.SS2.p2.m1">
            <XMath>
              <XMTok meaning="0.9" role="NUMBER">0.9</XMTok>
            </XMath>
          </Math>, outperforming the majority of open-weight models. This observation suggests that closed-source general models currently exhibit stronger overall capabilities in ADRD daily caregiving tasks. GPT-5.2 <cite class="ltx_citemacro_citep">(<bibref bibrefs="openai_introducing_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> achieved the best performance among all evaluated models, with an accuracy of <Math mode="inline" tex="0.9664" text="0.9664" xml:id="S4.SS2.p2.m2">
            <XMath>
              <XMTok meaning="0.9664" role="NUMBER">0.9664</XMTok>
            </XMath>
          </Math>, slightly surpassing the best-performing open-weight medical model, Lingshu-32B <cite class="ltx_citemacro_citep">(<bibref bibrefs="xu_lingshu_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> (<Math mode="inline" tex="0.9463" text="0.9463" xml:id="S4.SS2.p2.m3">
            <XMath>
              <XMTok meaning="0.9463" role="NUMBER">0.9463</XMTok>
            </XMath>
          </Math>), as well as the best-performing open-weight general model, Qwen3-235B-A22B-Instruct-2507 <cite class="ltx_citemacro_citep">(<bibref bibrefs="yang_qwen3_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
              <bibrefphrase>, </bibrefphrase>
            </bibref>)</cite> (<Math mode="inline" tex="0.9128" text="0.9128" xml:id="S4.SS2.p2.m4">
            <XMath>
              <XMTok meaning="0.9128" role="NUMBER">0.9128</XMTok>
            </XMath>
          </Math>).</p>
      </para>
      <para xml:id="S4.SS2.p3">
        <p>Medical models did not demonstrate a clear performance advantage over general models. In a few cases, they underperformed general models of comparable size. This pattern suggested that answering practical daily caregiving related questions relied more on broad commonsense reasoning capabilities typically acquired by large general models, rather than on specialized medical fine-tuning. At the same time, this observation might reflect the fact that current domain-specific fine-tuning efforts primarily focus on medical knowledge and incorporate relatively limited information about daily caregiving.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S4.SS3">
      <tags>
        <tag>4.3</tag>
        <tag role="autoref">subsection 4.3</tag>
        <tag role="refnum">4.3</tag>
        <tag role="typerefnum">§4.3</tag>
      </tags>
      <title><tag close=". ">4.3</tag>Correlation Analysis</title>
      <toctitle><tag close=" ">4.3</tag>Correlation Analysis</toctitle>
      <para xml:id="S4.SS3.p1">
        <p>To understand how models’ performance on clinical knowledge is related to that on caregiving, we analyzed the correlation between accuracies on ADRD Caregiving QA and accuracies on the ADRD Unified QA, as shown in Figure <ref labelref="LABEL:fig:3"/> and <ref labelref="LABEL:fig:4"/>. Across all models, the Pearson’s correlation analysis revealed a significant and strong positive correlation between the two QA sets (<Math mode="inline" tex="r=0.6801" text="r = 0.6801" xml:id="S4.SS3.p1.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" role="UNKNOWN">r</XMTok>
                <XMTok meaning="0.6801" role="NUMBER">0.6801</XMTok>
              </XMApp>
            </XMath>
          </Math>, <Math mode="inline" tex="p=0.00001" text="p = 0.00001" xml:id="S4.SS3.p1.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" role="UNKNOWN">p</XMTok>
                <XMTok meaning="0.00001" role="NUMBER">0.00001</XMTok>
              </XMApp>
            </XMath>
          </Math>). This indicated that, at an aggregated level, models that performed well on clinical questions also tended to achieve a higher accuracy on the caregiving questions.</p>
      </para>
      <figure inlist="lof" labels="LABEL:fig:3" xml:id="S4.F3">
        <tags>
          <tag>Figure 3</tag>
          <tag role="autoref">Figure 3</tag>
          <tag role="refnum">3</tag>
          <tag role="typerefnum">Figure 3</tag>
        </tags>
        <graphics candidates="figures/Correlation_Analysis_Overall_Revised.png" class="ltx_centering" graphic="figures/Correlation_Analysis_Overall_Revised.png" options="width=346.896pt" xml:id="S4.F3.g1"/>
        <toccaption class="ltx_centering"><tag close=" ">3</tag>
Correlation between accuracies on ADRD Caregiving QA and accuracies on ADRD Unified QA on all models, where larger points mean larger model parameters.
</toccaption>
        <caption class="ltx_centering"><tag close=". ">Figure 3</tag>
Correlation between accuracies on ADRD Caregiving QA and accuracies on ADRD Unified QA on all models, where larger points mean larger model parameters.
</caption>
      </figure>
      <figure inlist="lof" labels="LABEL:fig:4" xml:id="S4.F4">
        <tags>
          <tag>Figure 4</tag>
          <tag role="autoref">Figure 4</tag>
          <tag role="refnum">4</tag>
          <tag role="typerefnum">Figure 4</tag>
        </tags>
        <graphics candidates="figures/Correlation_Analysis_Combined.png" class="ltx_centering" graphic="figures/Correlation_Analysis_Combined.png" options="width=433.62pt" xml:id="S4.F4.g1"/>
        <toccaption class="ltx_centering"><tag close=" ">4</tag>
Correlation between accuracies on ADRD Caregiving QA and accuracies on ADRD Unified QA on (a) general models; (b) medical models; (c) closed-source models.
</toccaption>
        <caption class="ltx_centering"><tag close=". ">Figure 4</tag>
Correlation between accuracies on ADRD Caregiving QA and accuracies on ADRD Unified QA on (a) general models; (b) medical models; (c) closed-source models.
</caption>
      </figure>
      <para xml:id="S4.SS3.p2">
        <p>However, when examining different model categories separately, we observed that only open-weight medical models exhibited this significant and strong correlation (<Math mode="inline" tex="r=0.8416" text="r = 0.8416" xml:id="S4.SS3.p2.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" role="UNKNOWN">r</XMTok>
                <XMTok meaning="0.8416" role="NUMBER">0.8416</XMTok>
              </XMApp>
            </XMath>
          </Math>, <Math mode="inline" tex="p=0.0003" text="p = 0.0003" xml:id="S4.SS3.p2.m2">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" role="UNKNOWN">p</XMTok>
                <XMTok meaning="0.0003" role="NUMBER">0.0003</XMTok>
              </XMApp>
            </XMath>
          </Math>). In contrast, open-weight general models (<Math mode="inline" tex="r=0.2730" text="r = 0.2730" xml:id="S4.SS3.p2.m3">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" role="UNKNOWN">r</XMTok>
                <XMTok meaning="0.2730" role="NUMBER">0.2730</XMTok>
              </XMApp>
            </XMath>
          </Math>, <Math mode="inline" tex="p=0.3248" text="p = 0.3248" xml:id="S4.SS3.p2.m4">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" role="UNKNOWN">p</XMTok>
                <XMTok meaning="0.3248" role="NUMBER">0.3248</XMTok>
              </XMApp>
            </XMath>
          </Math>) and closed-source models (<Math mode="inline" tex="r=0.4678" text="r = 0.4678" xml:id="S4.SS3.p2.m5">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" role="UNKNOWN">r</XMTok>
                <XMTok meaning="0.4678" role="NUMBER">0.4678</XMTok>
              </XMApp>
            </XMath>
          </Math>, <Math mode="inline" tex="p=0.4269" text="p = 0.4269" xml:id="S4.SS3.p2.m6">
            <XMath>
              <XMApp>
                <XMTok meaning="equals" role="RELOP">=</XMTok>
                <XMTok font="italic" role="UNKNOWN">p</XMTok>
                <XMTok meaning="0.4269" role="NUMBER">0.4269</XMTok>
              </XMApp>
            </XMath>
          </Math>) showed only weak-moderate, non-significant positive correlations. These results demonstrated that open-weight medical models, which incorporated specific medical fine-tuning, had more consistent performance across both clinical and caregiving questions. A possible cause of the non-significant correlations of the general models could be the relatively small portion of medical data in model training, and thus the models’ performance depends more on general common sense and social reasoning. This contributes to more socially appropriate responses, a key point in daily caregiving. This analysis also aligned with our earlier observation that, on ADRD Caregiving QA, open-weight general models achieved higher accuracy than open-weight medical models at comparable parameter scales. On the other hand, these findings also implied that, to develop healthcare LLMs that fit ADRD care needs, more training data on caregiving tasks should be included to balance the models’ knowledge and reasoning on this aspect.</p>
      </para>
    </subsection>
    <subsection inlist="toc" xml:id="S4.SS4">
      <tags>
        <tag>4.4</tag>
        <tag role="autoref">subsection 4.4</tag>
        <tag role="refnum">4.4</tag>
        <tag role="typerefnum">§4.4</tag>
      </tags>
      <title><tag close=". ">4.4</tag>Case Analysis</title>
      <toctitle><tag close=" ">4.4</tag>Case Analysis</toctitle>
      <para xml:id="S4.SS4.p1">
        <p>To better understand large language model (LLM) performance beyond accuracy metrics, we examined the models’ reasoning processes on ADRD Caregiving QA questions for which over <Math mode="inline" tex="75\%" text="75percent" xml:id="S4.SS4.p1.m1">
            <XMath>
              <XMApp>
                <XMTok meaning="percent" role="POSTFIX">%</XMTok>
                <XMTok meaning="75" role="NUMBER">75</XMTok>
              </XMApp>
            </XMath>
          </Math> of the evaluated models generated incorrect answers. In reviewing these challenging cases, feedback from a clinical expert was incorporated to ensure that the analysis was grounded in real-world ADRD caregiving principles. Two representative cases are presented to illustrate common failure patterns and model behaviors.</p>
      </para>
      <subsubsection inlist="toc" xml:id="S4.SS4.SSS1">
        <tags>
          <tag>4.4.1</tag>
          <tag role="autoref">subsubsection 4.4.1</tag>
          <tag role="refnum">4.4.1</tag>
          <tag role="typerefnum">§4.4.1</tag>
        </tags>
        <title><tag close=". ">4.4.1</tag>Case 1: True/False question</title>
        <toctitle><tag close=" ">4.4.1</tag>Case 1: True/False question</toctitle>
        <para xml:id="S4.SS4.SSS1.p1">
          <p>As shown in Figure <ref labelref="LABEL:fig:5"/>, the correct answer is “False,” as the statement overgeneralizes across all stages of Alzheimer’s disease. In the early stages, individuals are often still able to perform most basic activities independently, and the loss of self-care ability typically occurs in moderate to severe ADRD stages. However, <Math mode="inline" tex="85\%" text="85percent" xml:id="S4.SS4.SSS1.p1.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="percent" role="POSTFIX">%</XMTok>
                  <XMTok meaning="85" role="NUMBER">85</XMTok>
                </XMApp>
              </XMath>
            </Math> LLMs answered “True,” implying that all individuals with Alzheimer’s disease are universally unable to manage daily living activities. This result reflected a common failure pattern of overgeneralization.</p>
        </para>
        <para xml:id="S4.SS4.SSS1.p2">
          <p>Figure <ref labelref="LABEL:fig:6"/> shows LLMs’ reasoning steps obtained via Zero-Shot Chain-of-Thought prompting. This zero-shot setting was selected to enable models to explicitly show their reasoning processes, while mitigating the influence of example selection biases. Complete prompts and outputs are in the Appendix.</p>
        </para>
        <para xml:id="S4.SS4.SSS1.p3">
          <p>The output of claude-opus-4-5-20251101 showed another typical type of error. This model achieved an overall accuracy exceeding <Math mode="inline" tex="90\%" text="90percent" xml:id="S4.SS4.SSS1.p3.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="percent" role="POSTFIX">%</XMTok>
                  <XMTok meaning="90" role="NUMBER">90</XMTok>
                </XMApp>
              </XMath>
            </Math> on the ADRD Caregiving QA and was able to distinguish between different stages of Alzheimer’s disease and recognize that patients in the early stage generally retain the ability to perform activities of daily living. However, this LLM based its final judgment on a hallmark of the advanced stage and used it to generalize a conclusion across the entire disease course, producing an incorrect answer. This suggested a tendency to rigidly focus on salient features, while overlooking conditional factors like disease progression. This could also be understood as a bias that when a fact was commonly associated with a condition, some LLMs tended to treat it as a universal truth, without adequately reasoning about scope or exceptions.</p>
        </para>
<!--  %**** 4-result.tex Line 175 **** -->      </subsubsection>
      <subsubsection inlist="toc" xml:id="S4.SS4.SSS2">
        <tags>
          <tag>4.4.2</tag>
          <tag role="autoref">subsubsection 4.4.2</tag>
          <tag role="refnum">4.4.2</tag>
          <tag role="typerefnum">§4.4.2</tag>
        </tags>
        <title><tag close=". ">4.4.2</tag>Case 2: Multiple-Choice question</title>
        <toctitle><tag close=" ">4.4.2</tag>Case 2: Multiple-Choice question</toctitle>
        <para xml:id="S4.SS4.SSS2.p1">
          <p>Figure <ref labelref="LABEL:fig:7"/> shows a multiple-choice question example. In patients living with ADRD, losing eye contact during a conversation often signals cognitive overload, fatigue, or a need to disengage, rather than a visual tracking problem. Thus, the recommended approach is to respect the individual’s need for space and reduce stimulation, rather than attempting to immediately re-engage them. However, <Math mode="inline" tex="79\%" text="79percent" xml:id="S4.SS4.SSS2.p1.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="percent" role="POSTFIX">%</XMTok>
                  <XMTok meaning="79" role="NUMBER">79</XMTok>
                </XMApp>
              </XMath>
            </Math> LLMs chose “B”. It reflects an LLM tendency to treat behavioral cues as attention problems to be corrected, prioritizing task or conversation continuity over respecting disengagement and emotion regulation.</p>
        </para>
        <figure inlist="lof" labels="LABEL:fig:5" xml:id="S4.F5">
          <tags>
            <tag>Figure 5</tag>
            <tag role="autoref">Figure 5</tag>
            <tag role="refnum">5</tag>
            <tag role="typerefnum">Figure 5</tag>
          </tags>
          <graphics candidates="figures/Example_1.png" class="ltx_centering" graphic="figures/Example_1.png" options="width=411.939pt" xml:id="S4.F5.g1"/>
          <toccaption class="ltx_centering"><tag close=" ">5</tag>
True/False question example of ADRD Caregiving QA that most LLMs answered incorrectly.
</toccaption>
          <caption class="ltx_centering"><tag close=". ">Figure 5</tag>
True/False question example of ADRD Caregiving QA that most LLMs answered incorrectly.
</caption>
        </figure>
        <figure inlist="lof" labels="LABEL:fig:6" xml:id="S4.F6">
          <tags>
            <tag>Figure 6</tag>
            <tag role="autoref">Figure 6</tag>
            <tag role="refnum">6</tag>
            <tag role="typerefnum">Figure 6</tag>
          </tags>
          <graphics candidates="figures/Example_1_Reasoning.png" class="ltx_centering" graphic="figures/Example_1_Reasoning.png" options="width=411.939pt" xml:id="S4.F6.g1"/>
          <toccaption class="ltx_centering"><tag close=" ">6</tag>
Reasoning by llama3:8b-instruct-fp16 and claude-opus-4-5-20251101 of Case 1.
</toccaption>
          <caption class="ltx_centering"><tag close=". ">Figure 6</tag>
Reasoning by llama3:8b-instruct-fp16 and claude-opus-4-5-20251101 of Case 1.
</caption>
        </figure>
        <para xml:id="S4.SS4.SSS2.p2">
          <p>Figure <ref labelref="LABEL:fig:8"/> presents examples of LLMs’ key reasoning steps. The prompting and the entire outputs are in the Appendix. Both qwen3-max-2025-09-23 and claude-opus-4-5-20251101 achieved an overall accuracy of <Math mode="inline" tex="&gt;90\%" text="absent &gt; 90percent" xml:id="S4.SS4.SSS2.p2.m1">
              <XMath>
                <XMApp>
                  <XMTok meaning="greater-than" role="RELOP">&gt;</XMTok>
                  <XMTok meaning="absent"/>
                  <XMApp>
                    <XMTok meaning="percent" role="POSTFIX">%</XMTok>
                    <XMTok meaning="90" role="NUMBER">90</XMTok>
                  </XMApp>
                </XMApp>
              </XMath>
            </Math> on the ADRD Caregiving QA. However, both failed to answer this question. The models could identify that a patient living with dementia might experience attentional, perceptual, or visual field challenges. Then, both subsequently treated this partial clinical insight as sufficient justification for active re-engagement, recommending the caregiver to reposition and thus re-enter the individual’s visual field. This reasoning reflects a form of goal misalignment. Rather than evaluating caregiving interaction through the loss of eye contact, the models focused on preserving conversational continuity. In doing so, they implicitly assumed that maintaining engagement is always desirable, failing to recognize that, in ADRD care, averted gaze is often a low-intensity disengagement or overload signal rather than a problem to be corrected. Unlike errors stemming from a lack of medical knowledge, this failure is the inability to correctly interpret contextual qualifiers embedded in dementia care, particularly those related to disengagement and emotion regulation. When demonstrating the understanding of dementia-related perceptual limitations, many LLMs emphasized cognitive characteristics more than care principles. As a result, disengagement cues were framed as attention deficits, leading to recommending actions that may increase pressure or distress.</p>
        </para>
        <figure inlist="lof" labels="LABEL:fig:7" xml:id="S4.F7">
          <tags>
            <tag>Figure 7</tag>
            <tag role="autoref">Figure 7</tag>
            <tag role="refnum">7</tag>
            <tag role="typerefnum">Figure 7</tag>
          </tags>
<!--  %**** 4-result.tex Line 200 **** -->          <graphics candidates="figures/Example_3.png" class="ltx_centering" graphic="figures/Example_3.png" options="width=433.62pt" xml:id="S4.F7.g1"/>
          <toccaption class="ltx_centering"><tag close=" ">7</tag>
Multiple-choice example of ADRD Caregiving QA that most models answered incorrectly.
</toccaption>
          <caption class="ltx_centering"><tag close=". ">Figure 7</tag>
Multiple-choice example of ADRD Caregiving QA that most models answered incorrectly.
</caption>
        </figure>
        <figure inlist="lof" labels="LABEL:fig:8" xml:id="S4.F8">
          <tags>
            <tag>Figure 8</tag>
            <tag role="autoref">Figure 8</tag>
            <tag role="refnum">8</tag>
            <tag role="typerefnum">Figure 8</tag>
          </tags>
          <graphics candidates="figures/Example_3_Reasoning.png" class="ltx_centering" graphic="figures/Example_3_Reasoning.png" options="width=433.62pt" xml:id="S4.F8.g1"/>
          <toccaption class="ltx_centering"><tag close=" ">8</tag>
Reasoning by qwen3-max-2025-09-23 and claude-opus-4-5-20251101 of Case 2.
</toccaption>
          <caption class="ltx_centering"><tag close=". ">Figure 8</tag>
Reasoning by qwen3-max-2025-09-23 and claude-opus-4-5-20251101 of Case 2.
</caption>
        </figure>
      </subsubsection>
    </subsection>
  </section>
  <section inlist="toc" labels="LABEL:sec:limit" xml:id="S5">
    <tags>
      <tag>5</tag>
      <tag role="autoref">section 5</tag>
      <tag role="refnum">5</tag>
      <tag role="typerefnum">§5</tag>
    </tags>
    <title><tag close=". ">5</tag>Limitations and Ethical Considerations</title>
    <toctitle><tag close=" ">5</tag>Limitations and Ethical Considerations</toctitle>
    <para xml:id="S5.p1">
      <p>ADRD-Bench does not involve confidential data. All content was derived from publicly available, de-identified resources. The ADRD Caregiving QA items were developed from widely used, clinically validated educational resources and reviewed by a clinician with national expertise in comprehensive ADRD care to ensure quality and relevance. This benchmark is intended for research and evaluation purposes. We caution against the direct use of LLMs’ outputs in real-world clinical or caregiving settings without appropriate human oversight. Responsible use and safety should remain central considerations in downstream applications.</p>
    </para>
    <para xml:id="S5.p2">
      <p>As the first and a preliminary benchmark, this work has several limitations. Since the ADRD Unified QA portion was drawn from existing public benchmarks, there was a possibility that some context had been used in training certain models. Such exposure could artificially boost performance scores. Current questions of ADRD Caregiving QA are in multiple-choice or true/false format, which forces models to choose an answer even when they might be uncertain. This may alter performance scores by educated guesses or the exploitation of answer cues <cite class="ltx_citemacro_citep">(<bibref bibrefs="liu_beyond_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. We did not include open-ended generative tasks, since automatic evaluation of long-form responses remains challenging <cite class="ltx_citemacro_citep">(<bibref bibrefs="singhal_large_2023" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. As a result, this benchmark primarily emphasizes factual correctness and does not measure a model’s ability in conversational interactions, such as proactively demonstrating empathy.</p>
    </para>
    <para xml:id="S5.p3">
      <p>The current questions are single-turn Q&amp;A pairs. Multi-turn dialogues and interactive problem-solving are not included yet to test a model’s ability in carrying out extended conversations, such as counseling a caregiver through a stressful situation over several exchanges, an important aspect of real-world AI assistance. In addition, caregiving decisions are often context-dependent and may vary across patient needs and professional perspectives. At this stage, ADRD-Bench only covers a small scope of ADRD knowledge, and it should be expanded by the broader community in future iterations for better coverage and precision.</p>
    </para>
    <para xml:id="S5.p4">
      <p>Finally, the benchmark is developed in English and rooted in the ABC model, primarily focusing on ADRD care in the U.S. Therefore, ADRD-Bench’s generalizability to other languages and healthcare systems needs further investigation. Note that achieving high accuracy on this benchmark does not guarantee an LLM will be reliable or effective in real clinical or home-care deployment. The benchmark is a step toward identifying strengths and weaknesses. Users should be cautious not to over-interpret the results, but use them in accordance with other necessary evaluations (e.g., user studies).

<!--  %**** arXiv.tex Line 275 **** --></p>
    </para>
  </section>
  <section inlist="toc" labels="LABEL:sec:conclusion" xml:id="S6">
    <tags>
      <tag>6</tag>
      <tag role="autoref">section 6</tag>
      <tag role="refnum">6</tag>
      <tag role="typerefnum">§6</tag>
    </tags>
    <title><tag close=". ">6</tag>Conclusion, Discussion, and Future Work</title>
    <toctitle><tag close=" ">6</tag>Conclusion, Discussion, and Future Work</toctitle>
    <para xml:id="S6.p1">
      <p>We presented ADRD-Bench, the first benchmark dedicated to ADRD. By curating 1,352 ADRD-focused questions from seven existing medical QA datasets and introducing 149 daily caregiving questions from national, widely adopted caregiver education materials, this work provided an evaluation platform for both factual clinical knowledge and practical caregiving reasoning. The evaluation of SOTA LLMs on ADRD-Bench yielded several key findings. On one hand, top models achieved high accuracies on questions (best performance was 0.9334 on ADRD Unified QA and 0.9664 on ADRD Caregiving QA, respectively). On the other hand, failure modes beyond numerical accuracy were discovered. For example, LLMs often overgeneralize clinical facts and sometimes fail to identify the major goal in a question <cite class="ltx_citemacro_citep">(<bibref bibrefs="kim_limitations_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Therefore, high overall accuracy does not guarantee true competency in ADRD care <cite class="ltx_citemacro_citep">(<bibref bibrefs="kim_limitations_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Even advanced LLMs may produce misguided responses due to shallow reasoning or misalignment with care goals <cite class="ltx_citemacro_citep">(<bibref bibrefs="umucu_empathy_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. Our benchmark and evaluation together lay a groundwork for rigorously assessing LLMs in a high-impact, yet underdeveloped area of medical AI <cite class="ltx_citemacro_citep">(<bibref bibrefs="steijger_use_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>.</p>
    </para>
    <para xml:id="S6.p2">
      <p>To ensure safer and more effective AI for ADRD care, domain-specific alignment techniques that bridge the gap between clinical knowledge and caregiving practices should be a focus in future research <cite class="ltx_citemacro_citep">(<bibref bibrefs="umucu_empathy_2025" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. This benchmark and our findings reinforce the importance of rigorous, targeted evaluation in high-stakes domains. As with any other medical specialty, ADRD care is a complex and evolving field <cite class="ltx_citemacro_citep">(<bibref bibrefs="alz_association_mapping_2024" separator=";" show="AuthorsPhrase1Year" yyseparator=",">
            <bibrefphrase>, </bibrefphrase>
          </bibref>)</cite>. As a result, benchmark questions and answers need to be updated in a timely manner according to the state-of-the-art practice and research. We encourage the broader community to build on this work to advance safe, reliable, and compassionate AI in the ADRD domain.

<!--  %“input–7-ethical˝ --></p>
    </para>
<!--  %% 
     %% The acknowledgments section is defined using the ”acks” environment
     %% (and NOT an unnumbered section). This ensures the proper
     %% identification of the section in the article metadata, and the
     %% consistent spelling of the heading.-->    <acknowledgements name="Acknowledgements">
This research was partially supported by the National Science Foundation under Grant No. 2426639. The authors acknowledge Oracle Cloud Infrastructure for providing computing resources that contributed to parts of the research results reported.
</acknowledgements>
<!--  %% 
     %% The next two lines define the bibliography style to be used, and
     %% the bibliography file.-->  </section>
  <bibliography citestyle="authoryear" files="sample-base" xml:id="bib">
    <title>References</title>
  </bibliography>
<!--  %% 
     %% If your work has an appendix, this is the place to put it.-->  <appendix inlist="toc" xml:id="A1">
    <tags>
      <tag>Appendix A</tag>
      <tag role="autoref">Appendix A</tag>
      <tag role="refnum">A</tag>
      <tag role="typerefnum">Appendix A</tag>
    </tags>
    <title><tag close=" ">Appendix A</tag>Appendix</title>
    <toctitle><tag close=" ">A</tag>Appendix</toctitle>
<!--  %**** arXiv.tex Line 300 **** -->    <subsection inlist="toc" xml:id="A1.SS1">
      <tags>
        <tag>A.1</tag>
        <tag role="autoref">subsection A.1</tag>
        <tag role="refnum">A.1</tag>
        <tag role="typerefnum">§A.1</tag>
      </tags>
      <title><tag close=". ">A.1</tag>ADRD-Bench Prompts</title>
      <toctitle><tag close=" ">A.1</tag>ADRD-Bench Prompts</toctitle>
      <para class="ltx_noindent" xml:id="A1.SS1.p1">
        <tabular class="ltx_centering" colsep="8.0pt" rowsep="1.5pt" vattach="middle">
          <tr>
            <td align="justify" border="l r t" width="398.9pt"><text class="ltx_wrap" font="bold">Prompt for PubMedQA</text></td>
          </tr>
          <tr>
            <td align="justify" border="b l r t" width="398.9pt">You are a highly knowledgeable AI assistant specializing in biomedical sciences. Answer A for yes, B for no, or C for maybe. Do not include any explanation or additional text. Output only the letter on a single line. Context: {context} Question: {question} A. yes B. no C. maybe ONLY OUTPUT ONE LETTER. Your answer MUST be exactly one of: A, B, or C. Do NOT include explanation or text after the letter. <!--  %**** arXiv.tex Line 325 **** -->Answer:</td>
          </tr>
        </tabular>
      </para>
      <para class="ltx_noindent" xml:id="A1.SS1.p2">
        <tabular class="ltx_centering" colsep="8.0pt" rowsep="1.5pt" vattach="middle">
          <tr>
            <td align="justify" border="l r t" width="398.9pt"><text class="ltx_wrap" font="bold">Prompt for HEAD-QA (Four Options)</text></td>
          </tr>
          <tr>
            <td align="justify" border="b l r t" width="398.9pt">You are a highly knowledgeable AI assistant specializing in biomedical sciences. Your task is to answer multiple-choice questions accurately based on the options provided. Each question will relate to biomedical concepts, and you will be asked to choose the most appropriate answer. Select the correct answer by outputting only the letter corresponding to your choice (A, B, C, D). Please only use a capital letter, without including any content or explanation. Question: {question} A. options[0] B. options[1] C. options[2] D. options[3] <!--  %**** arXiv.tex Line 350 **** -->ONLY OUTPUT ONE LETTER. Your answer MUST be exactly one of: A, B, C, or D. Do NOT include explanation or text after the letter. Answer:</td>
          </tr>
        </tabular>
      </para>
      <para class="ltx_noindent" xml:id="A1.SS1.p3">
        <tabular class="ltx_centering" colsep="8.0pt" rowsep="1.5pt" vattach="middle">
          <tr>
            <td align="justify" border="l r t" width="398.9pt"><text class="ltx_wrap" font="bold">Prompt for MedBullets (Four Options)</text></td>
          </tr>
          <tr>
            <td align="justify" border="b l r t" width="398.9pt">You are a highly knowledgeable AI assistant specializing in medicine. Your task is to answer medical questions similar to those found on the USMLE Step 2/3 exams. You will be provided with a clinical scenario followed by several multiple-choice options. Select the correct answer by outputting only the letter corresponding to your choice (A, B, C, D). Please only use a capital letter, without including any content. Clinical Scenario: {question} <!--  %**** arXiv.tex Line 375 **** -->A. options[0] B. options[1] C. options[2] D. options[3] ONLY OUTPUT ONE LETTER. Your answer MUST be exactly one of: A, B, C or D. Do NOT include explanation or text after the letter. Answer:</td>
          </tr>
        </tabular>
      </para>
      <para class="ltx_noindent" xml:id="A1.SS1.p4">
        <tabular class="ltx_centering" colsep="8.0pt" rowsep="1.5pt" vattach="middle">
          <tr>
            <td align="justify" border="l r t" width="398.9pt"><text class="ltx_wrap" font="bold">Prompt for MedMCQA and MedQA</text></td>
          </tr>
          <tr>
            <td align="justify" border="b l r t" width="398.9pt">You are a highly knowledgeable AI assistant specializing in medicine. Give a letter answer among A, B, C, or D. Do not include any explanation or additional text. <!--  %**** arXiv.tex Line 400 **** --><break/>Question: {question} A. options[0] B. options[1] C. options[2] D. options[3] ONLY OUTPUT ONE LETTER. Your answer MUST be exactly one of: A, B, C, or D. Do NOT include explanation or text after the letter. Answer:</td>
          </tr>
        </tabular>
      </para>
      <para class="ltx_noindent" xml:id="A1.SS1.p5">
        <tabular class="ltx_centering" colsep="8.0pt" rowsep="1.5pt" vattach="middle">
          <tr>
            <td align="justify" border="l r t" width="398.9pt"><text class="ltx_wrap" font="bold">Prompt for MedHallu</text></td>
          </tr>
          <tr>
            <td align="justify" border="b l r t" width="398.9pt">You are an AI assistant with extensive knowledge in the field of medicine. I want you act as an answer judge. Given a question and an answer, your objective is to determine if the provided answer contains non-factual or hallucinated information. You SHOULD give your judgment based on the following hallucination types and the world knowledge. You are trying to determine if there is a factual contradiction between the answer and the world knowledge. Some information in the answer might be fabricated. Answer ’0’ if the answer is factual and ’1’ if the answer is hallucinated. You can categorize the answer as hallucinated if it falls in either of the categories: Type: Misinterpretation of Question - These are hallucinated answers that misunderstand the question, leading to an off-topic or irrelevant response. Type: Incomplete Information - These are hallucinated answers that point out what is not true without providing correct information. Type: Mechanism and Pathway Misattribution - These are hallucinated answers that falsely attribute biological mechanisms, molecular pathways, or disease processes that contradict established medical knowledge. Type: Methodological and Evidence Fabrication - Inventing false research methods, statistical data, or specific clinical outcomes. Do not return anything else, just the answer. Return just an integer value, ’0’ if the answer is factual and ’1’ if the answer is hallucinated. No letter or word, just the integer value. World Knowledge: {knowledge} Question: {question} Answer: {hallucinated answer} ONLY OUTPUT ONE NUMBER. <!--  %**** arXiv.tex Line 450 **** -->Your answer MUST be exactly one of: 0 or 1. Do NOT include explanation or text after the letter. Your Judgment:</td>
          </tr>
        </tabular>
      </para>
      <para class="ltx_noindent" xml:id="A1.SS1.p6">
        <tabular class="ltx_centering" colsep="8.0pt" rowsep="1.5pt" vattach="middle">
          <tr>
            <td align="justify" border="l r t" width="398.9pt"><text class="ltx_wrap" font="bold">Prompt for MEDEC</text></td>
          </tr>
          <tr>
            <td align="justify" border="b l r t" width="398.9pt">The following is a medical narrative about a patient. You are a skilled medical doctor reviewing the clinical text. The text is either correct or contains only one error. The text has a sentence per line. Each line starts with the sentence ID, followed by a space character then the sentence to check. You should carefully check every sentence of the text. If every sentence in the text is correct then only return the following output: -1. If one sentence in the text has a medical error, only return the sentence ID of the sentence containing the error. <!--  %**** arXiv.tex Line 475 **** --><break/>Make sure not to add any other words or sentences. Clinical Note: {sentences} ONLY OUTPUT ONE NUMBER. Your answer MUST be exactly one number. Do NOT include explanation or text after the number. Answer:</td>
          </tr>
        </tabular>
      </para>
      <para class="ltx_noindent" xml:id="A1.SS1.p7">
        <tabular class="ltx_centering" colsep="8.0pt" rowsep="1.5pt" vattach="middle">
          <tr>
            <td align="justify" border="l r t" width="398.9pt"><text class="ltx_wrap" font="bold">Prompt for ADRD Caregiving Multiple-Choice QA</text></td>
          </tr>
          <tr>
            <td align="justify" border="b l r t" width="398.9pt">You are a highly knowledgeable AI assistant specializing in medicine. Give a letter answer among A, B, C, D, or E. Do not include any explanation or additional text. <!--  %**** arXiv.tex Line 500 **** --><break/>Question: {question} A. options[0] B. options[1] C. options[2] D. options[3] E. options[4] ONLY OUTPUT ONE LETTER. Your answer MUST be exactly one of: A, B, C, D, or E. Do NOT include explanation or text after the letter. Answer:</td>
          </tr>
        </tabular>
      </para>
      <para class="ltx_noindent" xml:id="A1.SS1.p8">
        <tabular class="ltx_centering" colsep="8.0pt" rowsep="1.5pt" vattach="middle">
          <tr>
            <td align="justify" border="l r t" width="398.9pt"><text class="ltx_wrap" font="bold">Prompt for ADRD Caregiving True/False QA</text></td>
          </tr>
          <tr>
            <td align="justify" border="b l r t" width="398.9pt">You are a highly knowledgeable AI assistant specializing in medicine. Answer A for yes, B for no. Do not include any explanation or additional text. Output only the letter on a single line. Question: {question} A. yes B. no ONLY OUTPUT ONE LETTER. Your answer MUST be exactly one of: A or B. Do NOT include explanation or text after the letter. Answer:</td>
          </tr>
        </tabular>
      </para>
      <pagination role="newpage"/>
    </subsection>
    <subsection inlist="toc" xml:id="A1.SS2">
      <tags>
        <tag>A.2</tag>
        <tag role="autoref">subsection A.2</tag>
        <tag role="refnum">A.2</tag>
        <tag role="typerefnum">§A.2</tag>
      </tags>
      <title><tag close=". ">A.2</tag>Zero-Shot Chain-of-Thought prompts for Case Analysis</title>
      <toctitle><tag close=" ">A.2</tag>Zero-Shot Chain-of-Thought prompts for Case Analysis</toctitle>
      <para class="ltx_noindent" xml:id="A1.SS2.p1">
<!--  %**** arXiv.tex Line 550 **** -->        <tabular class="ltx_centering" colsep="8.0pt" rowsep="1.5pt" vattach="middle">
          <tr>
            <td align="justify" border="l r t" width="398.9pt"><text class="ltx_wrap" font="bold">Prompt for ADRD Caregiving Multiple-Choice QA</text></td>
          </tr>
          <tr>
            <td align="justify" border="b l r t" width="398.9pt">You are a highly knowledgeable AI assistant specializing in medicine. Give a letter answer among A, B, C, D, or E. Reason through the question step by step and explicitly write out your reasoning before giving the final answer. Question: {question} A. options[0] B. options[1] C. options[2] D. options[3] E. options[4] First, output the reasoning process. Then, on a new line, output the final answer as a single letter (A, B, C, D, or E). Format: Reasoning: ¡your reasoning¿ Answer: ¡one letter¿</td>
          </tr>
        </tabular>
      </para>
      <para class="ltx_noindent" xml:id="A1.SS2.p2">
        <tabular class="ltx_centering" colsep="8.0pt" rowsep="1.5pt" vattach="middle">
          <tr>
            <td align="justify" border="l r t" width="398.9pt"><text class="ltx_wrap" font="bold">Prompt for ADRD Caregiving True/False QA</text></td>
          </tr>
          <tr>
            <td align="justify" border="b l r t" width="398.9pt">You are a highly knowledgeable AI assistant specializing in medicine. Answer A for yes, B for no. Reason through the question step by step and explicitly write out your reasoning before giving the final answer. Question: {question} A. yes B. no First, output the reasoning process. Then, on a new line, output the final answer as a single letter (A or B). <!--  %**** arXiv.tex Line 600 **** -->Format: Reasoning: ¡your reasoning¿ Answer: ¡one letter¿</td>
          </tr>
        </tabular>
      </para>
      <pagination role="newpage"/>
    </subsection>
    <subsection inlist="toc" xml:id="A1.SS3">
      <tags>
        <tag>A.3</tag>
        <tag role="autoref">subsection A.3</tag>
        <tag role="refnum">A.3</tag>
        <tag role="typerefnum">§A.3</tag>
      </tags>
      <title><tag close=". ">A.3</tag>Completed Reasoning Output of Case Analysis Examples</title>
      <toctitle><tag close=" ">A.3</tag>Completed Reasoning Output of Case Analysis Examples</toctitle>
      <para class="ltx_noindent" xml:id="A1.SS3.p1">
        <tabular class="ltx_centering" colsep="8.0pt" rowsep="1.5pt" vattach="middle">
          <tr>
            <td align="justify" border="l r t" width="398.9pt"><text class="ltx_wrap" font="bold">Reasoning Output for True/False Case</text></td>
          </tr>
          <tr>
            <td align="justify" border="l r t" width="398.9pt">Reasoning by llama3:8b-instruct-fp16: <text font="typewriter">{</text> Here is the reasoning process: Reasoning: The question states that the Alzheimer’s patient is unable to perform basic activities of daily living. This implies that the patient has difficulty with tasks such as bathing, dressing, eating, and using the bathroom, among others. These are essential skills for independent living, and if someone is unable to perform them, it suggests a significant impairment in their cognitive or physical abilities. <!--  %**** arXiv.tex Line 625 **** --><break/>Answer: Yes <text class="ltx_wrap" font="typewriter">}</text></td>
          </tr>
          <tr>
            <td align="justify" border="b l r t" width="398.9pt">Reasoning by claude-opus-4-5-20251101: <text font="typewriter">{</text> Reasoning: The question asks whether Alzheimer’s patients are unable to perform basic activities of daily living (ADLs). Alzheimer’s disease is a progressive neurodegenerative disorder that affects memory, thinking, and behavior. The impact on daily living activities depends on the stage of the disease: 1. Early stage: Patients may have mild memory problems but can generally still perform most ADLs independently (bathing, dressing, eating, toileting, etc.) 2. Middle stage: Patients begin to have more difficulty with ADLs and may need assistance with some tasks. 3. Late/Advanced stage: Patients typically lose the ability to perform basic ADLs and require full-time care and assistance. The question states ”The Alzheimer’s patient” in a general sense, implying the condition overall. In advanced stages of Alzheimer’s disease, patients do become unable to perform basic activities of daily living such as bathing, dressing, eating, and using the toilet independently. Since the question appears to be asking about a characteristic feature of Alzheimer’s disease (particularly in its progression), and inability to perform ADLs is indeed a hallmark of advanced Alzheimer’s disease, the answer would be yes - Alzheimer’s patients (especially in later stages) are unable to perform basic activities of daily living. Answer: Yes <text class="ltx_wrap" font="typewriter">}<!--  %**** arXiv.tex Line 650 **** --></text></td>
          </tr>
        </tabular>
      </para>
      <para class="ltx_noindent" xml:id="A1.SS3.p2">
        <tabular class="ltx_centering" colsep="8.0pt" rowsep="1.5pt" vattach="middle">
          <tr>
            <td align="justify" border="l r t" width="398.9pt"><text class="ltx_wrap" font="bold">Reasoning Output for Multiple-Choice Case</text></td>
          </tr>
          <tr>
            <td align="justify" border="l r t" width="398.9pt">Reasoning by qwen3-max-2025-09-23: { Reasoning: When caring for someone with dementia, maintaining effective communication is crucial. Individuals with dementia often experience difficulties with attention, perception, and processing sensory information. Losing eye contact may indicate that they are distracted, overwhelmed, or simply have shifted their visual focus due to cognitive changes—not necessarily disinterest or defiance. Option A suggests moving closer and increasing body language, which could be perceived as intrusive or overwhelming, especially if the person is already feeling overstimulated. Option B involves following the patient’s gaze and repositioning to stay in their visual field. This is a respectful, non-confrontational way to re-establish connection without demanding attention. It aligns with best practices in dementia care, which emphasize adapting to the person’s current state rather than forcing engagement. Option C assumes the person needs space, but losing eye contact doesn’t always mean they need a break—it might just reflect difficulty maintaining focus. Pausing unnecessarily could disrupt meaningful interaction. <!--  %**** arXiv.tex Line 675 **** -->Option D calling their name firmly and reminding them of the topic—can come across as scolding or pressuring, which may increase anxiety or confusion. Option E assumes the issue is auditory distraction, but the problem described is loss of eye contact (a visual cue), not lack of hearing or response to speech. Therefore, the most appropriate action is to gently reposition yourself within their visual field to support their ability to re-engage naturally. Answer: B }</td>
          </tr>
          <tr>
            <td align="justify" border="b l r t" width="398.9pt">Reasoning by claude-opus-4-5-20251101: { Reasoning: When caring for someone with dementia, maintaining effective communication requires understanding their cognitive and perceptual challenges. Let me analyze each option: A. Moving closer and using increased body language - While body language is helpful, moving closer without repositioning to their visual field may not address the core issue of lost eye contact. This could also feel intrusive. B. Following the patient’s gaze and repositioning yourself to remain in their visual field - This approach acknowledges that people with dementia may have difficulty tracking movement or maintaining focus. By repositioning yourself to where they’re looking, you work with their current attention rather than against it. This is a person-centered, non-confrontational approach. C. Recognizing they need space and pausing - While sometimes appropriate, automatically assuming they need space when eye contact is lost may not be the best first response. The loss of eye contact could simply be due to visual field issues or distraction, not necessarily a need for space. D. Calling their name firmly and reminding them of the topic - This approach can be perceived as confrontational or demanding, which can increase anxiety and agitation in people with dementia. It’s generally not recommended to use firm tones or put pressure on them. <!--  %**** arXiv.tex Line 700 **** -->E. Checking for environmental noise and increasing volume - While environmental factors matter, simply increasing volume doesn’t address the visual attention issue and could be startling or perceived as aggressive. The best practice in dementia care is to adapt to the person’s current state and work within their capabilities. Option B represents a gentle, person-centered approach that maintains connection without being demanding or intrusive. Answer: B }</td>
          </tr>
        </tabular>
      </para>
    </subsection>
  </appendix>
</document>
