KiVA: Kid-inspired Visual Analogies <break/>for Testing Large Multimodal Models

KiVA: Kid-inspired Visual Analogies <break/>for Testing Large Multimodal Models Eunice Yiu

1

Maan Qraitem

2

Anisa Noor Majhi

1

Charlie Wong

1

Yutong Bai

1

Shiry Ginosar

3, 4

Alison Gopnik

1

Kate Saenko

2

1

University of California, Berkeley

2

Boston University

3

Google DeepMind

4

Toyota Technological Institute at Chicago

This paper investigates visual analogical reasoning in large multimodal models (LMMs) compared to human adults and children. A “visual analogy” is an abstract rule inferred from one image and applied to another. While benchmarks exist for testing visual reasoning in LMMs, they require advanced skills and omit basic visual analogies that even young children can make. Inspired by developmental psychology, we propose a new benchmark of 4,300 visual transformations of everyday objects to test LMMs on visual analogical reasoning and compare them to children (ages three to five) and to adults. We structure the evaluation into three stages: identifying what changed (e.g., color, number, etc.), how it changed (e.g., added one object), and applying the rule to new scenarios. Our findings show that while GPT-o1, GPT-4V, LLaVA-1.5, and MANTIS identify the “what” effectively, they struggle with quantifying the “how” and extrapolating this rule to new objects. In contrast, children and adults exhibit much stronger analogical reasoning at all three stages. Additionally, the strongest tested model, GPT-o1, performs better in tasks involving simple surface-level visual attributes like color and size, correlating with quicker human adult response times. Conversely, more complex tasks such as number, rotation, and reflection, which necessitate extensive cognitive processing and understanding of extrinsic spatial properties in the physical world, present more significant challenges. Altogether, these findings highlight the limitations of training models on data that primarily consists of 2D images and text. 1 footnote 1 1 footnote 1 Benchmark (code, data, models) is available at: https://github.com/ey242/KiVA

\iclrfinalcopy

1 section 1 1 §1 <tag close=" ">1</tag>Introduction

What is visual cognition? Humans make countless visual inferences everyday from observing objects and scenes, quickly detecting even subtle visual changes. We generalize common patterns about changes from different observations and use these insights to solve new problems. If we put a wool sweater in the washing machine and it comes out smaller, we might infer that the wash shrinks wool and avoid washing wool coat in the future. If cookies disappear, we might infer that someone is eating our treats and and proceed to hide the chocolate elsewhere. This ability to draw parallels between situations and apply learned patterns to a new scenario is known as analogical reasoning. Formally defined, an analogy is a systematic comparison between structures that uses the properties and relations of objects in a source structure to infer properties and relations of objects in a target structure ( , ). Analogical reasoning is a hallmark of human intelligence and learning ( , ). It is what enables us to be flexible, adaptive and robust learners across a wide variety of settings, finding meaning in patterns and making out-of-distribution generalizations ( , ). Analogical reasoning is already available to young children ( , ), and is crucial for human problem-solving in various contexts, from building scientific models to appreciating metaphors to formulating legal arguments.

Figure 1 Figure 1 1 Figure 1

(a) (a) 0(a) (a)Visual analogy domains. (a)Visual analogy domains.

(b) (b) 0(b) (b)Extrapolation accuracy. (b)Extrapolation accuracy. 1KiVA: Kid-inspired Visual Analogies. (a) 5 visual analogy domains examined in KiVA and KiVA-adults (see Figure for the full task format). Unlike KiVA, the starting color, size, orientation and number of test objects in KiVA-adults further differ from the starting values of the given transformations. (b) Performance of children, adults & LMMs in extrapolating a transformation rule to a novel object in KiVA (top) and KiVA-adults (bottom). Figure 1KiVA: Kid-inspired Visual Analogies. (a) 5 visual analogy domains examined in KiVA and KiVA-adults (see Figure for the full task format). Unlike KiVA, the starting color, size, orientation and number of test objects in KiVA-adults further differ from the starting values of the given transformations. (b) Performance of children, adults & LMMs in extrapolating a transformation rule to a novel object in KiVA (top) and KiVA-adults (bottom).

Today, large multimodal (LMMs) have made significant progress, but they remain data-hungry and require substantial human effort to adapt to new contexts ( , ). As analogical reasoning is instrumental for general-purpose and adaptive machines, it is crucial to examine whether current models have such capabilities. Critically, examining analogical capabilities does not permit models to “cheat” by merely depending on their training data because it requires context-dependent abstraction beyond general object recognition. In KiVA, the same object may undergo different kinds of transformations, requiring models to combine familiar elements in new, trial-specific ways. Reasoning about analogies involves first classifying relationships between object characteristics, specifying similarities and differences, then extrapolating the same relationship to new objects. This paper focuses on visual analogies, testing models’ ability to reason abstractly about visual observations. See Figure for a summary of the KiVA benchmark and results.

There is a growing body of work examining visual reasoning and generalization capabilities in large multimodal models ( , ). Existing benchmarks of visual analogies include (a) ARC ( , ) and ConceptARC ( , ), (b) variations of Raven’s Progressive Matrices ( , ) and (c) abstract spatial reasoning ( , ) (see prior benchmarks in Figure ). These prior benchmarks all have several critical limitations. First, they rely on abstract shapes and grids, lacking real-world relevance. This abstraction of stimuli neither aligns with the training data of large multimodal models nor effectively mimics the complexity and variability found in everyday visual tasks, making it less suitable for assessing how well AI models can perform analogical reasoning in practical contexts. Second, the transformations examined involve conjunctions of visual concepts such as extracting and transposing pixels according to some arbitrary rule, which do not tap into basic visual cognition. Humans do not require the ability to solve these specific tasks to function effectively in their daily lives nor to demonstrate their capacity for visual analogical reasoning. Third, while we know that models often perform poorly on these benchmarks, where they fail in the reasoning process needs to be clarified since existing evaluations focus solely on prediction accuracy rather than the reasoning approach or what is perceived.

Figure 2 Figure 2 2 Figure 2

(a) (a) 1(a) (a)Prior benchmarks. (a)Prior benchmarks.

(b) (b) 1(b) (b)Our benchmark. (b)Our benchmark. 2Prior benchmarks versus KiVA for visual analogies. (a) Prior benchmarks like I. ConceptARC, II. Raven’s Progressive Matrices, and III. CCSE Reasoning involve arbitrary changes of abstract shapes and grids. (b) KiVA examines basic changes that even three-year-olds can solve. Figure 2Prior benchmarks versus KiVA for visual analogies. (a) Prior benchmarks like I. ConceptARC, II. Raven’s Progressive Matrices, and III. CCSE Reasoning involve arbitrary changes of abstract shapes and grids. (b) KiVA examines basic changes that even three-year-olds can solve.

We propose a Kid-inspired Visual Analogies (KiVA) benchmark founded on developmental psychology (Figure (left)) ( , ). We focus our analysis on basic visual analogical capabilities that are present early in human development and are important for understanding the physical world. KiVA isolates the following fundamental capabilities that emerge early in human development: detecting changes in color ( , ) and size ( , ), changes that involve rotation and reflection ( , ), and changes in small numbers of objects ( , ). It is solvable by a three-year-old child. KiVA-adults serves as a more challenging version of KiVA that is not solvable by young children but by adults, requiring deeper generalization from given transformations (the starting values of objects in the given and test transformations are not aligned) and featuring more variations in the above visual domains (see details in Section ). Refer to Figure for sample test trials of KiVA and KiVA-adults. KiVA stands out in the following ways:

First, our dataset utilizes real-world, physically grounded objects curated from established 3D datasets of common household items ( , ) and toys that are familiar to human children ( , ), which align more with the training distribution of computer vision models and visual data of humans more than other visual analogical reasoning datasets (Figure ).

Second, our approach is inspired by developmental psychology, specifically how children learn to perform analogical reasoning not abstractly, but from simple objects in grounded contexts ( , ). We propose a similar approach for large multimodal models, investigating if they can perform like children on basic visual analogical reasoning tasks related to color, size, orientation, and number – as already reported in child development journals ( ) . Starting with simple, real-world relevant tasks in child development allows models to develop robust reasoning abilities before tackling more advanced tasks, providing a clearer pathway for evaluating and improving cognitive functions in AI.

Third, we break down our evaluation to examine the different steps involved in analogical reasoning to determine which steps a model can perform and where it may fail: 1) classifying the domain of a visual transformation, 2) specifying the transformation rule, and 3) extrapolating the inferred rule to a new item. This three-stage evaluation (Figure ) gives us insights into models’ reasoning processes beyond simply selecting a correct or incorrect response at the end.

Results from KiVA and KiVA-adults demonstrate that state-of-the-art large multimodal models, i.e., GPT-o1 ( , ), GPT-4V ( , ), LLaVA-1.5 ( , ) and MANTIS ( , ), still cannot solve visual analogies like humans can. These models do not match even the capabilities of a three-year-old child in reasoning about number and reflection (Figure ). While LMMs can categorize some transformations, they still struggle to extrapolate those transformations to new objects. In particular, GPT-o1 and GPT-4V outperform LLaVA-1.5 and MANTIS but also demonstrates weaker performance in orientation and number changes than in size and color changes which are processed more quickly by humans, at an earlier age ( , ), and in a more primary region of the visual cortex ( , ).

Taken together, KiVA and KiVA-adults not only mirror the natural progression of human cognitive development, but also provides a more structured and comprehensive framework for evaluating the capabilities and growth of LMMs. We also release in our project page code for KiVA-compositionality, which combines multiple object transformations to probe even more complex compositional reasoning. This serves as the next benchmark for models to surpass after KiVA and KiVA-adults.

2 section 2 2 §2 <tag close=" ">2</tag>Related Work

Evaluating human visual analogical reasoning. There is a variety of tasks designed in Developmental Psychology to examine human visual analogical reasoning early on in life. Children are asked to compare simple object and relational matches ( , ) along dimensions such as color ( , ), number ( , ), size ( , ) and spatial orientation ( , ). Older children and adults are evaluated on Raven’s Progressive Matrices (RPMs) ( , ) and Bongard Problems ( , ). Even though they tend to be the most representative and largest testbeds for testing advanced visual analogical reasoning, RPMs and Bongard problems use abstract geometric shapes and test recognition of arbitrary patterns that (1) cannot be solved by children before the age of 6 and (2) are not critical to everyday visual processing. KiVA is the first visual analogical reasoning benchmark that includes common real-world objects and more natural visual cognition skills such as counting and spatial transformations — tasks that even a three-year-old child can handle ( , ). We also examine where people and models fail with more fine-grained evaluation.

Evaluating visuo-linguistic reasoning in AI models. Several proposals for evaluating modern AI systems’ visuo-linguistic reasoning capabilities followed the recent successes of large multimodal models. Many concentrate on a narrow, isolated set of tasks for detecting object properties like size estimation ( , ), color perception ( , ), counting objects ( , ), object viewpoint/pose and chirality ( , ) and visuo-linguistic compositionality ( , ). Typically, the objective of these tasks is to evaluate models’ ability to report a correct property about objects in an image. They lack the depth to probe pattern abstraction and generalization involved in visual analogical reasoning.

Broader benchmarks, such as visual question answering setups ( , ), attempt to investigate the models’ understanding of various visual concepts. One approach taken by ( , ) was to try and push the envelope on various tasks to capture anecdotal and qualitative observations regarding the performance of GPT-4. Perception Test ( , ) proposed a second approach: a visual video-based benchmark including developmentally-inspired tasks such as object permanence, object tracking, spatial relations, etc. Recently, the BLINK benchmark was introduced to show that core visual perception tasks, easily solvable by humans "within a blink," remain challenging for large multimodal models due to their resistance to language-based mediation ( , ). However, all these benchmarks fall short in evaluating the deeper, more complex aspects of visual analogical reasoning and generalization.

Another specific class of benchmarks tests generalization and reasoning within abstract puzzle grids. These include the Abstraction and Reasoning Corpus (ARC) ( , ) and ConceptARC ( , ); a direct translation of RPMs-based human evaluation has previously been applied to models by ( , ) and ( , ) (see these prior benchmarks in Figure ). However, the stimuli are simple, monotonic shapes like squares and circles, lacking real-world complexity and variability. Moreover, they emphasize complex pattern recognition and logical sequencing without real-world context—neglecting basic visual cognition skills even children possess—and this limited scope may render them unsuitable for training data that typically covers a much broader range of real-world visuals.

In summary, although many benchmarks assess advanced visual capabilities in large multimodal models, none evaluate visual cognition that is clearly exhibited by young children—such as predicting simple transformations of real-world objects—or use children as a baseline for comparison.

3 section 3 3 §3 <tag close=" ">3</tag>The KiVA Benchmark for Visual Analogical Reasoning

We introduce KiVA, a Kid-inspired Visual Analogies benchmark, wherein real-world objects undergo common transformations necessary for everyday visual cognition. We focus on isolating and testing basic visual transformations that even a three-year-old child understands ( , ). As we show in Figure , we examine noticing color changes ( , ), size changes ( , ), rotation, reflection ( , ), and number changes such as addition and subtraction of a small number of objects ( , ). We then build upon this benchmark by proposing KiVA‑adults, which involves a greater variety of transformations and demands more abstract forms of generalization. It is solvable by adults but not by children under five.

Figure 3 Figure 3 3 Figure 3 3An example of a trial in KiVA. Models and humans are first asked to classify a given transformation (left). If the classification is correct (green arrow), humans and models are further evaluated on their verbal specification of the transformation (middle) and then on visual extrapolation (right). Otherwise, humans and models skip to make a visual extrapolation (yellow arrow). Figure 3An example of a trial in KiVA. Models and humans are first asked to classify a given transformation (left). If the classification is correct (green arrow), humans and models are further evaluated on their verbal specification of the transformation (middle) and then on visual extrapolation (right). Otherwise, humans and models skip to make a visual extrapolation (yellow arrow). 3.1 subsection 3.1 3.1 §3.1 <tag close=" ">3.1</tag>A Three-Stage Experimental Paradigm

We use our proposed dataset to benchmark computational models’ and human subjects’ visual analogical reasoning capabilities. We utilize the same testing procedure (Figure ) for both kinds of subjects. In each trial, we start by presenting a given transformation of an object that changes by a specific rule, following the experimental paradigm of other analogical reasoning benchmarks for humans and computational models ( , ). Inspired by the component processes model of analogical reasoning ( , ), we evaluate the subject’s ability to determine what changed (Verbal Classification) how it changed (Verbal Specification), and apply the the same transformation rule to predict the outcome of a new object—i.e., a Visual Extrapolation. We break the question down into these three steps to test the different cognitive processes involved in analogical reasoning. The first two assess the necessary prerequisites for accurate analogical reasoning, while the last step represents the core visual analogy task. Critically, KiVA retains the core nonverbal extrapolation task (last step) from previous benchmarks and the verbal questions do not replace the core nonverbal tasks. Even without correct verbal responses, humans and models can still tackle the independently-assessed visual extrapolation tasks. Thus, KiVA doesn’t require specific language skills but provides a window into the analogical reasoning process of humans and models in reaching their final solutions. The first two verbal questions were further paraphrased by developmental psychologists so that it is comprehensible to a three-year-old child (Appendix ); models and adults did not benefit from the child-appropriate prompting so the original prompt in Figure 3 was preserved. We pose all questions in a multiple-choice format for human children, adults and models, which enables automatic scoring. Option labels for correct responses were randomized such that LMMs’ option label bias does not correlate with task accuracy. Furthermore, we provided the opportunity to select “Doesn’t apply” to accommodate responses that the provided choices may not cover. Excluding the “Doesn’t apply” option, chance level is 25% for Verbal Classification (4 choices) and 33% for Verbal Specification and Visual Extrapolation (3 choices). Refer to Figure for the three-stage query pipeline and Appendix for specific prompts.

Verbal classification of transformation (“what”). We first evaluate if the model or human can detect what changed in a given transformation and classify it in the correct visual domain, such as size or number (see Figure ). We randomly sample incorrect multiple-choice options from other possible transformation domains. “No change” and “Doesn’t apply” are always included as options to accommodate for alternative forms of reasoning that are not covered by the choices. Suppose the model fails to identify basic changes, such as distinguishing a numerical change from a color change. It will be unable to predict how new objects change based on the given transformations. This is an inadequacy of existing visual analogical reasoning benchmarks ( , ), which focus solely on advanced predictions without ensuring fundamental change detection capabilities.

Verbal specification of transformation (“how”). If a subject correctly classifies the transformation, we ask them to further specify also in the form of multiple-choice the transformation (see green arrow in ). This step is crucial because it ensures the subject can accurately specify the rule governing the transformation before extrapolating it to a new object. If they fail to identify the specific change, any attempt at extrapolation would more likely be incorrect (see Figure in Appendix for evidence in models). By pinpointing where reasoning fails, we can better understand models’ and humans’ limitations and improve their analogical reasoning capabilities.

Visual Extrapolation of transformation. Finally, we proceed to the step captured by other benchmarks: presenting a new image and asking the model to extrapolate how it will change based on the previously identified transformation (see Figure and other extrapolation examples of other visual domains in Appendix ). We ask models to visually extrapolate independent of their performance in verbal change identification to account for the possibility that models may engage in visual analogical reasoning separately from verbal reasoning and can, therefore, perform well in visual tasks even if they struggle with the prior verbal descriptions. This approach helps us determine if a model’s visual reasoning can function independently of its verbal reasoning skills. It provides a more nuanced evaluation of its cognitive capabilities and identifies specific areas for improvement.

3.2 subsection 3.2 3.2 §3.2 <tag close=" ">3.2</tag>A Dataset of Visual Analogies

We create a dataset of stimuli using everyday objects that better represent real-world visual data and better match the training data of computer vision models (and humans). We take 3D models of household objects from ( ) and objects commonly encountered by infants and children from ( ) . To set up the dataset, we perform five basic visual transformation domains: changing the size, color, and number of objects, rotating and reflecting the objects along different axes (see Figure for the transformation domains examined). Our project page includes code allowing users to perform these transformations on any object image, enabling infinite expansion of the benchmark. Our five types of object transformations are crucial for object and scene recognition, (e.g., ( ) ), scene segmentation (e.g., ( ) ), and detecting significant changes in the environment ( , ). Other visual properties, such as depth ( , ), spatial compositionality ( , ), and physical affordances ( , ) are also crucial for such purposes, but we prioritized these five domains for our benchmark in particular because young children can solve these visual analogies, as already shown in developmental psychology literature ( , ). Below, we outline the five visual transformation domains. There are 100 object transformations for each subdomain of transformation, totaling 1,400 object transformations in KiVA and 2,900 in KiVA-adults.

Color changes. Noticing color changes can signal alterations in an object’s state or presence, which is essential for tasks like identifying ripe fruit or detecting hazards ( , ). The general transformation rule for color is that input objects change to a single color ( , ), namely red, green and blue. KiVA-adults also includes yellow and grey.

Size changes. Size perception allows individuals to understand and interact with their environment accurately, guiding tasks like identifying objects, planning actions, navigating spaces, and avoiding obstacles ( , ). In KiVA, objects undergo transformations in two subdomains: they turn bigger or smaller (in both height and width) as in ( , ) by a factor of 2. KiVA-adults also includes object stretching (changing height or width independently by a factor of 2).

Number changes. Accurately monitoring and comparing quantities is essential in economics and science; it is also important in daily life activities like shopping, cooking, caching and rationing ( , ). Transformations in this domain reflect basic mathematical operations over the number of objects in an image. KiVA contains object addition $(+ 1, + 2)$ and subtraction $(- 1, - 2)$ , whereas KiVA-adults includes multiplication $(× 2, × 3)$ and division $(÷ 2, ÷ 3)$ as well. We restrict the number of objects in an input or output image to under 8.

Rotation. Mental rotation is the ability to recognize and map different views of the same object ( , ). This is essential for object manipulation, spatial orientation and navigation ( , ). KiVA adapts from human psychometric studies (e.g., ( , )), featuring 2D rotation by $90$ degrees (clockwise or counterclockwise) or $180$ degrees. KiVA-adults also includes $45$ -degree and $135$ -degree rotations.

Reflection. Reflection aids in appreciating object symmetry and chirality, essential for distinguishing left and right shoes or gloves, etc. Chiral objects cannot be rotated or translated to align with their reflections, making them non-superimposable ( , ). Chiral objects are reflected along the x-axis or y-axis ( , ) in KiVA and along both in KiVA-adults.

4 section 4 4 §4 <tag close=" ">4</tag>Comparing Analogical Reasoning in LMMs and Humans

Evaluating Large Multimodal Models. We test several LMMs: 1) GPT-o1 (o1-2024-12-17), 2) GPT4-V (gpt-4-vision-preview) ( , ), 3) LLaVA-1.5 ( , ): an open-source model that integrates a vision encoder with a language model, specifically designed to enhance general-purpose visual and language understanding, 4) MANTIS ( , ) which builds on modified architectures from notable models like LLaVA to support interleaved multi-image input. We combine the given transformation with the three choices of new object transformations at the extrapolation step into a single composite image for LLaVA-1.5 (limited to processing a single image), but present the given transformation and three choice transformations as four separate images to MANTIS and the GPT models as proposed in ( ) to reduce the chance of visual binding errors. For all models, the temperature is set to 1 and the maximum token size is set to 300 (no cap for GPT-o1). We randomize each experiment over three seeds and run each trial (Figure ) on a model three times with test choices shuffled in order. We score correct choices as 1 and incorrect choices as 0. We calculate the mean score across its three seeds. To evaluate the performance per transformation domain, we calculate the overall mean and standard error for the average scores of all trials. GPT-o1, GPT-4V, LLaVA-1.5, and MANTIS complete the entire KiVA and KiVA-adult benchmarks. Open-source models ran on an A6000 48 GB single GPU for under 12 hours.

Evaluating Humans. A corresponding visual analogies task, developed using JsPsych ( , ), was administered to two groups of human participants. All methods were approved by IRB (protocol 2020-10-13755) prior to testing both child and adult participants. We recruited 250 adults (21 to 40 years old) on Prolific ( , ) to complete the benchmark such that every trial was annotated by 3-13 adults. We recruited 42 children (aged 3 to 5 years, $⁢ m e a n$ = 4.07 years, $⁢ s e$ = 0.11 years) from early childhood centers and ChildrenHelpingScience ( , ) to complete a random subset of 10 trials (2 trials per transformation domains), totaling 420 responses. We evaluated an additional 10 children and 40 adults on KiVA-adults and found that none of the children performed better than chance. All participants completed a practice trial with an “unrelated” transformation (adding a dot to geometric shapes) and received feedback to ensure understanding. Participants who failed within three attempts were excluded. Those who succeeded proceeded to test trials without feedback, and were told that rewards depended on their performance. Adults were paid at least $12/hour with a bonus of $0.01 per correct response, while children received stickers based on their performance.

4.1 subsection 4.1 4.1 §4.1 <tag close=" ">4.1</tag>Results

Figure 4 Figure 4 4 Figure 4

4Human and model performance in KiVA sorted by Transformation Domain and color coded by Question Type in samples annotated by children (top figure) and in the full benchmark annotated by adults (bottom figure). Error bars represent standard errors across object variations. Chance level is

% 25

for Verbal Classification;

% 33

for Verbal Specification and Visual Extrapolation. Figure 4Human and model performance in KiVA sorted by Transformation Domain and color coded by Question Type in samples annotated by children (top figure) and in the full benchmark annotated by adults (bottom figure). Error bars represent standard errors across object variations. Chance level is

% 25

for Verbal Classification;

% 33

for Verbal Specification and Visual Extrapolation.

Figure 5 Figure 5 5 Figure 5 5Adult and model performance in KiVA-adults sorted by Transformation Domain and color coded by Question Type. Error bars and chance levels are as described in Figure . Figure 5Adult and model performance in KiVA-adults sorted by Transformation Domain and color coded by Question Type. Error bars and chance levels are as described in Figure .

Models get worse with increasing reasoning complexity from verbal description to visual extrapolation, unlike humans. Overall, LMMs can detect transformations and identify the general visual domain of the transformations (e.g., color vs. size), as indicated by the blue bars labeled “Verbal Classification” in Figure for KiVA and in Figure for KiVA-adults. In KiVA, GPT-o1, GPT-4V and MANTIS even outperform children in categorizing rotation and color changes. However, performance generally declines when the models are asked to further specify the transformation within the correctly identified visual domain (e.g., rotating 90 degrees or 180 degrees if spatial orientation is the correctly identified domain), as reflected by the orange bars labeled “Verbal Specification.” Performance for visual extrapolation declines even more, as illustrated by the green bars labeled “Visual Extrapolation.” In other words, models’ success in verbally describing transformations does not guarantee their success in extrapolation. Part of the models’ failure in analogical reasoning is an inability to correctly recognize the given transformation. Another part of the model’s failure lies in extrapolating the correctly identified transformation to a novel object and predicting the corresponding outcome. Even when given the correct verbal specification of the transformation, models still fail to solve extrapolation in different visual domains (Appendix ). By contrast, even young children tested in KiVA can verbally describe the transformations as reflected by their significantly-above-chance performance in verbal classification and verbal specification, and can then use their selected verbal descriptions to extrapolate the visual transformations to new objects. Adults show near-perfect performance from verbal classification to visual extrapolation in both KiVA and KiVA-adults.

Figure 6 Figure 6 6 Figure 6 6Positive correlations between mean error scores of GPT-o1 and mean error score of children (left) and mean response times of adults (right) in KiVA visual extrapolation. Figure 6Positive correlations between mean error scores of GPT-o1 and mean error score of children (left) and mean response times of adults (right) in KiVA visual extrapolation.

Model performance depends on the visual domain and correlates with human performance. Overall, models are better at classifying and describing color and size transformations than transformations in other domains (Figures and ), which involve more discrete and local processing than the other domains ( , ). In KiVA-adults, the best-performing model GPT-o1 nears adult performance only in the color domain (Figure ). Models are less able to specify what changed within the visual domains of rotation, reflection, and number and consequently also did not perform well in extrapolations for those domains. In contrast, children and adults generally show similar performance across visual domains, with children performing slightly worse on rotation compared to other domains. Children’s error scores (1-Accuracy) and adults’ response times correlate with GPT-o1’s error scores in the visual extrapolation of KiVA, as demonstrated in Figure . What is cognitively demanding to humans is also more computationally challenging for GPT-o1.

Models hallucinate where there is no change. For each type of transformation, we randomly sample 10% positive transformation trials, and reassign transformations that involve no change. Only GPT-o1 correctly selects "no change" in both classification and specification across all visual domains, though it struggles to extrapolate this to new objects when distractors involve reflection or number change (Figure ). GPT-4V only accurately identifies "no change" in the verbal classification stage in the size domain. That said, when it does classify a trial as having no change, it consistently specifies that no change is involved (as reflected by the tall orange bars). In contrast, LLaVA-1.5 and MANTIS "hallucinate" a change in 100% of the no-change trials during verbal classification; although they can visually extrapolate the absence of change to some new objects, they are no better than chance.

Figure 7 Figure 7 7 Figure 7 7Model performance on trials involving no change. Error bars and chance levels are as described in Figure . Figure 7Model performance on trials involving no change. Error bars and chance levels are as described in Figure .

Models are inconsistent within the same trials and across reasoning steps. We measured model choice inconsistency by quantifying how often a model selects different responses in identical repeated trials in KiVA (Figure ). Models are the most consistent in Verbal Classification and least consistent in Visual Extrapolation, particularly when reasoning about number, rotation and reflection. GPT-o1 and GPT-4V, but not LLaVA-1.5 and MANTIS, show higher extrapolation performance when they can verbally identify the transformation (Appendix ). When models are given the correct verbal specification in their weaker domains (number, rotation and reflection), they still fail visual extrapolation (Appendix ). This underscores a key limitation in the visual analogical reasoning of LMMs: knowing the correct rule does not reliably translate to extending that rule to a new context.

Figure 8 Figure 8 8 Figure 8 8Proportion of model consistent responses within repeated trials. Each model was evaluated on how many times out of the three repeated trials they did not select the same choice. The heat map shows choice inconsistency broken down by model, visual domain, and question type. Figure 8Proportion of model consistent responses within repeated trials. Each model was evaluated on how many times out of the three repeated trials they did not select the same choice. The heat map shows choice inconsistency broken down by model, visual domain, and question type.

Verbal questions facilitate visual extrapolation in humans but the effects are less clear in GPT-o1. We included verbal questions to reveal the step in the reasoning process where models might fail when making visual analogies. To assess the effects of verbal questions on subsequent visual extrapolation, we tested another 200 adults, 20 children and the best-performing model, GPT-o1, on a visual-extrapolation-only version of KiVA, removing the verbal questions to replicate previous visual analogy benchmarks. We focused on testing the three more challenging visual domains of KiVA, number, rotation and reflection.Without verbal questions, adults demonstrated similar accuracy but significantly slower response times, whereas children performed worse in extrapolation. The effects are less clear for GPT-o1: it performed equally well in the number domain, it is better at extrapolating object rotations but worse at extrapolating reflections when asked to reason about what changed and how it changed beforehand (Figure ). While our verbal questions facilitate humans’ visual extrapolation performance, it is possible that reasoning models like GPT-o1 already reason about “what changed” and “how it changed” independently of our verbal queries. Future work should further explore the effects of guiding questions and chain-of-thought on reasoning models. At the same time, it may also be possible to solve KiVA without language, as in the case of a Large Vision Model ( , ) that is trained in the complete absence of linguistic data (see Section ).

Figure 9 Figure 9 9 Figure 9 9Adults’ Mean Response Times, Children’s and GPT-o1’s Mean Accuracy in Visual Extrapolation with and without the three-step query. Figure 9Adults’ Mean Response Times, Children’s and GPT-o1’s Mean Accuracy in Visual Extrapolation with and without the three-step query.

In-context learning and prompt engineering did not improve model performance. We explore whether model performance improves through careful prompt engineering (Appendix ), which has shown promising results on various tasks ( , ). We consider four different prompt engineering methods: 1) Reasoning through code ( , ): We first prompt the model to generate code snippets describing each transformation in the task, then rephrase the task question to incorporate the generated code. 2) Reasoning after Reflection ( , ): We ask the model to reflect on its answers two times for each question in the task. 3) Reasoning through instruction: inspired by ( ) , which shows that chain-of-thought reasoning is more effective on several benchmarks, we prompt the model to generate step-by-step instructions on how to answer each question, then use the instructions to generate an answer. 4) In-Context Learning ( , ): We give the model two randomly sampled examples with solutions for each concept before displaying the task. Apart from text prompt engineering, we experiment with different visual prompting for LLaVA-1.5. Recent works ( , ) show that visual model performance is sensitive to the alterations in color and size of the visual input. We apply two visual prompting approaches: 1) Color: we alter the image background color (initially transparent) into black and white ( , ). 2) Size: we apply a center crop to the images, varying the image size between 0.9 and 1. None of these approaches improve performance, which points to the challenging nature of our benchmark.

4.2 subsection 4.2 4.2 §4.2 <tag close=" ">4.2</tag>Discussion

Despite extensive training on image and text data, GPT-o1, GPT-4V, LLaVA-1.5, and MANTIS still cannot reason about spatial and numerical visual analogies like young children can. Although GPT‑o1 outperforms the other models, it falls short of child performance in reflection and number domains in KiVA and remains far from adult performance—except in the color domain of KiVA‑adults. Moreover, model performance declines markedly from verbal description to visual extrapolation, unlike human performance, where even correct transformation recognition does not guarantee successful extrapolation to a new object. This points to a fundamental challenge: mapping a transformation from a source object to a target while preserving relational structure ( , ). Future research should explore how vision and language each contribute to visual analogical reasoning.

Human perception of feature-level changes like color or size is relatively straightforward, whereas appreciating reflection, rotation, and numerical changes requires active engagement, sequential tracking, and mental manipulation. Our findings align with prior studies showing that LMMs struggle with spatial reasoning ( , ) and counting ( , ). For humans, size and color changes are processed earlier in the visual pathway and in development ( , ), while spatial and numerical changes are more cognitively demanding. Although this convergence in performance between LMMs and human children does not imply that they are built or function identically, it is intriguing that similar trends emerge from such fundamentally different systems.

KiVA is designed to assess visual change detection and analogical reasoning—the kinds of skills that children as young as three demonstrate. Our results show that LMMs underperform compared to humans, even with in-context learning and prompting, and future improvements may require approaches such as symbolic visual vocabularies and Bayesian inference ( , ).

5 section 5 5 §5 <tag close=" ">5</tag>Conclusion

Overall, large multimodal models remain less capable than humans at visual analogical reasoning. They can classify changes in images, but their ability to specify and extrapolate these changes to novel objects diminishes sharply. GPT‑o1 performs best—especially for color and size, which are surface features—but struggles with spatial and numerical analogies that likely require a deeper understanding of the 3D world. In contrast, humans excel at interpreting diverse object relations and transformations ( , ).

As models improve, our extended benchmarks (KiVA‑adults and KiVA‑compositionality) will probe more advanced analogical reasoning. So far, GPT‑o1 only reaches adult-level reasoning in the color domain, highlighting the need for further research into the complexities of visual cognition.

Acknowledgments

We are grateful to Joe Heyward, Viorica Patraucean, Mariel Goddu, Jefferson Ortega and the participants of the AI, Psychology, and Neuroscience workshop at the Simons Institute for discussion, to participants and their parents, local early childhood centers, and UC Berkeley undergrads who assisted in human data collection: Alexis Davis, Janna Umagat, Kate Choi, Kaydee Manikhong, Linda Marie Trevino, Nitya Sriram, Nora Chen, Ray Huang, Shivalika Jhabua and Weiyin Gao. This project was supported by Meta-BAIR Commons, CIFAR Catalyst Award: Causally guided exploration in children & AI, and ONR MURI Self-Learning Perception Through Real-World Interaction.

References Appendix A Appendix A A Appendix A <tag close=" ">Appendix A</tag>Visual Analogical Reasoning Prompts AVisual Analogical Reasoning Prompts A.1 subsection A.1 A.1 §A.1 <tag close=" ">A.1</tag>Stitched visual extrapolation examples for each domain

Visual Extrapolation. As the final step of the querying process, we presented an image of a new object and ask the model to predict what the object will look like if it goes through the same change as the given transformation.

Figure 10 Figure 10 10 Figure 10

(a) (a) 9(a) (a)Example of a visual extrapolation trial involving a reflection. (a)Example of a visual extrapolation trial involving a reflection.

(b) (b) 9(b) (b)Example of a visual extrapolation trial involving an angular rotation. (b)Example of a visual extrapolation trial involving an angular rotation.

(c) (c) 9(c) (c)Example of a visual extrapolation trial involving a size change. (c)Example of a visual extrapolation trial involving a size change.

(d) (d) 9(d) (d)Example of a visual extrapolation trial involving a number change. (d)Example of a visual extrapolation trial involving a number change.

(e) (e) 9(e) (e)Example of a visual extrapolation trial involving a color change. (e)Example of a visual extrapolation trial involving a color change. 10Examples of visual extrapolation trials for different transformations. Figure 10Examples of visual extrapolation trials for different transformations. A.2 subsection A.2 A.2 §A.2 <tag close=" ">A.2</tag>Prompting of Models and Human Adults

We first include a system prompt to orient the models for visual analogical reasoning. You are an excellent visual puzzle solver! You will be given a visual puzzle that requires using visual analogical reasoning. For models, we include a chain-of-thought prompt. You will think "step-by-step" and carefully examine the visual evidence before providing an answer. For human adults, we additionally include the following prompt to motivate their participation. At the end of the experiment, you will see the total number of correct answers you provided. Each correct answer will convert to $0.01 additional compensation for your study participation. Then we provide an initial instruction prompt: You are given a visual puzzle. The puzzle features a left-to-right transformation of an object on top and three left-to-right transformations of a different object on the bottom marked by (A) or (B) or (C). The transformations involve a change of either the size, orientation, number, or color of an object.

1. item 1 1 item 1

Verbal Classification (“what”).

“Which one of the following rules best describes the left-to-right transformation on top of the puzzle where the picture on the left transforms to the picture on the right? Answer with the correct rule number. Surrounded by parentheses, then provide a "step-by-step" reasoning for your choice."

2. item 2 2 item 2

Verbal Specification (“how”).

“Which one of the following rules best describes the left-to-right transformation in the top of the puzzle where the picture on the left transforms to the picture on the right?. Answer with the correct rule number surrounded by parentheses. Then provide a "step-by-step" reasoning for your choice."

3. item 3 3 item 3

Visual Extrapolation.

“Which one of the three left-to-right object transformations (marked by either (A), (B) or (C)) on the bottom of the puzzle is the same as the left-to-right transformation on the top of the puzzle? Answer with the correct letter surrounded by parentheses (or (D) if none of the options apply), then provide a a "step-by-step" reasoning for your choice."

A.3 subsection A.3 A.3 §A.3 <tag close=" ">A.3</tag>Prompting Human Children

All verbal instructions are read out loud to children by a human experimenter. We first provide a context to motivate children’s participation in the experiment. You are on a mission as a picture detective. You will see how different pictures change. Your job as a picture detective is to figure out how the pictures change, and to guess how a new picture would change based on that. These pictures can change in size, where they face, number, or color. Every time you answer correctly, you will get a coin. You won’t find out how many coins you get until the end of the game. At the end of the game, you will see the total number of coins you win. The more coins you get, the more stickers you win.

1. item 1 1 item 1

Verbal Classification (“what”).

“Here are two pictures separated by a black line in the middle. The picture on the left turns into the picture on the right. Do you think there is a change? What do you think the change is?"

2. item 2 2 item 2

Verbal Specification (“how”).

“Can you say more about the change from the left to the right?"

3. item 3 3 item 3

Visual Extrapolation.

“Here is another picture that goes through the same change from the left to right. Can you find the box that shows the same change?"

Note that the prompt used for children did not improve model or human adult performance.

A.4 subsection A.4 A.4 §A.4 <tag close=" ">A.4</tag>Prompting models through reflection and self-critique 1. item 1 1 item 1

Verbal Classification (“what”).

“Which one of the following rules best describes the left-to-right transformation on top of the puzzle where the picture on the left transforms to the picture on the right? Answer with the correct rule number surrounded by parentheses, then provide a "step-by-step" reasoning for your choice. Please reflect on your answer and provide a revised response if necessary."

(repeat three times following model output) Start your response with your updated answer.

2. item 2 2 item 2

Verbal Specification (“how”).

“Which one of the following rules best describes the left-to-right transformation in the top of the puzzle where the picture on the left transforms to the picture on the right?. Answer with the correct rule number surrounded by parentheses, then provide a "step-by-step" reasoning for your choice. Please reflect on your answer and provide a revised response if necessary."

(repeat three times following model output) Start your response with your updated answer.

3. item 3 3 item 3

Visual Extrapolation.

“Which one of three left-to-right object transformations (marked by either (A), (B) or (C)) on the bottom of the puzzle is the same as the left-to-right transformation on the top of the puzzle? Answer with the correct letter surrounded by parentheses (or (D) if none of the options apply), then provide a "step-by-step" reasoning for your choice. Please reflect on your answer and provide a revised response if necessary."

(repeat three times following model output) Start your response with your updated answer.

A.5 subsection A.5 A.5 §A.5 <tag close=" ">A.5</tag>Prompting models through instructions 1. item 1 1 item 1

Verbal Classification (“what”).

“Which one of the following rules best describes the left-to-right transformation on top of the puzzle where the picture on the left transforms to the picture on the right? Answer with the correct rule number surrounded by parentheses, then provide a “step-by-step” reasoning for your choice."

2. item 2 2 item 2

Verbal Specification (“how”).

“Provide brief instructions on how to establish if a transformation involves an object rotates 90 degrees or 180 degrees. Use the instructions form before to answer the following question: Which one of the following rules best describes the transformation in the top of the puzzle where the picture on the left transforms to the picture on the right?. Answer with the correct rule number surrounded by parentheses, then provide a “step-by-step” reasoning for your choice."

3. item 3 3 item 3

Visual Extrapolation.

“Provide brief instructions on how to determine which one of three left-to-right object transformations (marked by either (A), (B) or (C) ) on the bottom of the puzzle is the same as the left-to-right transformation on the top of the puzzle? Use the instructions from before to determine which one of three left-to-right object transformations (marked by either (A), (B) or (C) ) on the bottom of the puzzle is the same as the left-to-right transformation on the top of the puzzle? Answer with the correct letter surrounded by parentheses (or (D) if none of the options apply), then provide a step-by-step reasoning for your choice."

A.6 subsection A.6 A.6 §A.6 <tag close=" ">A.6</tag>Prompting models through code 1. item 1 1 item 1

Verbal Classification (“what”).

“Which one of the following rules best describes the left-to-right transformation on top of the puzzle where the picture on the left transforms to the picture on the right? Answer with the correct rule number surrounded by parentheses, then provide a "step-by-step" reasoning for your choice."

2. item 2 2 item 2

Verbal Specification (“how”).

“Generate python code using the package pillow that takes in the left image in the left-to-right transformation on top and outputs the right image. Denote this snippet as training snippet using the insights from the training code snippet, which one of the following rules best describes the left-to-right transformation in the top of the puzzle where the picture on the left transforms to the picture on the right?. Answer with the correct rule number surrounded by parentheses, then provide a "step-by-step" reasoning for your choice."

3. item 3 3 item 3

Visual Extrapolation.

“Generate a brief code snippet using python and the pillow package for each left-to-right transformation in the bottom. Each snippet takes in the left picture of the transformation and outputs the right one. Now Which one of three code snippets is the same as the training code snippet you have produced before. Answer with the correct snippet letter ((A) or (B) or (C)) surrounded by parentheses (or (D) if none of the options apply), then provide a "step-by-step" reasoning for your choice."

Appendix B Appendix B B Appendix B <tag close=" ">Appendix B</tag>Additional model analyses BAdditional model analyses B.1 subsection B.1 B.1 §B.1 <tag close=" ">B.1</tag>Effects of Multi-image versus Single-image presentation on GPT-o1’s Visual Extrapolation

We evaluate whether or not GPT-o1 does indeed benefit from muti-image presentation, in which the given transformation and the three test transformation options are provided to the model as four separate images, as opposed to combining everything into a single image, as described in ( , ). GPT-o1 shows significantly better performance in visual extrapolation of color, size and number, but not for rotation and reflection (Figure ), suggesting that challenge in the latter two domains goes beyond a visual binding problem described in ( ) .

Figure 11 Figure 11 11 Figure 11 11Visual Extrapolation performance of GPT-o1 under Multi- versus Single-image presentations. Figure 11Visual Extrapolation performance of GPT-o1 under Multi- versus Single-image presentations. B.2 subsection B.2 B.2 §B.2 <tag close=" ">B.2</tag>Models’ extrapolation performance based on previous verbal reasoning

Furthermore, we report models’ extrapolation performance conditional on succeeding (green) or failing (red) at the previous steps of verbal reasoning in Figure . GPT-o1 exhibits significantly higher visual extrapolation accuracy when its preceding verbal reasoning is correct across all transformation domains, whereas GPT-4V shows this benefit only in the color and size domains. In other words, successful visual extrapolation is contingent on solving verbal classification or specification correctly when models are solving KiVA above chance level. Meanwhile, there is no conditional dependence of prior verbal reasoning on subsequent visual extrapolation in LLaVA-1.5 and MANTIS, and they also perform no better than chance level on KiVA.

Figure 12 Figure 12 12 Figure 12 12Visual Extrapolation performance of models following Correct and Incorrect verbal classification / specification, sorted by transformation domain. Standard errors are in parentheses. (Note that verbal specification is only asked if verbal classification is correct.) Figure 12Visual Extrapolation performance of models following Correct and Incorrect verbal classification / specification, sorted by transformation domain. Standard errors are in parentheses. (Note that verbal specification is only asked if verbal classification is correct.) B.3 subsection B.3 B.3 §B.3 <tag close=" ">B.3</tag>Models’performance when given correct previous verbal reasoning step

10% of transformation trials were randomly sampled to evaluate if model performance across the three weaker performing visual domains (number, rotation and reflection) would improve when given the correct answer to the previous reasoning step. In one experiment, we provided the correct verbal classification answer and evaluated models’ verbal specification (Figure ). In another experiment, we provided the correct verbal specification answer and evaluated models’ visual extrapolation (Figure ). Overall, having the ground truth for the preceding verbal reasoning step did not guarantee much success in the subsequent verbal specification or visual extrapolation tasks.

Figure 13 Figure 13 13 Figure 13

(a) (a) 12(a) (a)Subsequent Verbal Specification performance when given Correct Verbal Classification. (a)Subsequent Verbal Specification performance when given Correct Verbal Classification.

(b) (b) 12(b) (b)Subsequent Visual Extrapolation performance when given Correct Verbal Specification. (b)Subsequent Visual Extrapolation performance when given Correct Verbal Specification. 13Subsequent performance of models when given correct verbal details in KiVA, sorted by transformation domain. Figure 13Subsequent performance of models when given correct verbal details in KiVA, sorted by transformation domain. B.4 subsection B.4 B.4 §B.4 <tag close=" ">B.4</tag>A Large Vision Model’s Visual Extrapolation Performance on KiVA

We further examined whether a large vision model ( , ), trained in the absence of any linguistic data, can solve KiVA. Since the large vision model does not contain text descriptions, we stitch object transformations by adopting the framework described in Section 5.3 of ( ) and prompt the model to generate the missing part in the bottom right corner (see Figure for an example of the image prompt). The prediction with the lowest perplexity is determined as the model’s answer. Even in the absence of any language to reason about what changed, how it changed, and how to extend the change to a new object, the large vision model can solve some visual analogies (Figure ). Interestingly, resembling large multimodal models, the large vision model is more capable of reasoning analogically in terms of color and size than in number and space.

Figure 14 Figure 14 14 Figure 14

(a) (a) 13(a) (a)Example of a KiVA trial input for the Large Vision Model. (a)Example of a KiVA trial input for the Large Vision Model.

(b) (b) 13(b) (b)Visual Extrapolation Performance of the Large Vision Model across Transformation Domains. (b)Visual Extrapolation Performance of the Large Vision Model across Transformation Domains. 14Testing Large Vision Model on KiVA. Figure 14Testing Large Vision Model on KiVA.

Future work may look into the effects of longer visual prompt with more training examples (in-context learning) or further instruction tuning in improving the performance of the large vision model.