<text font="typewriter">X-VoE</text>: Measuring eXplanatory Violation of Expectation in Physical Events

<text font="typewriter">X-VoE</text>: Measuring eXplanatory Violation of Expectation in Physical Events Bo Dai

1, 2

, Linge Wang

3

, Baoxiong Jia

2

, Zeyu Zhang

2

, Song-Chun Zhu

1, 2, 3

, Chi Zhang

2, \Letter

, Yixin Zhu

4, \Letter

https://github.com/daibopku/X-VoE \Letter zhangchi@bigai.ai, yixin.zhu@pku.edu.cn

1

School of Intelligence Science and Technology, Peking University

2

Beijing Institute for General Artificial Intelligence

3

Department of Automation, Tsinghua University

4

Institute for Artificial Intelligence, Peking University

Intuitive physics is pivotal for human understanding of the physical world, enabling prediction and interpretation of events even in infancy. Nonetheless, replicating this level of intuitive physics in artificial intelligence (AI) remains a formidable challenge. This study introduces X-VoE, a comprehensive benchmark dataset, to assess AI agents’ grasp of intuitive physics. Built on the developmental psychology-rooted () paradigm, X-VoE establishes a higher bar for the explanatory capacities of intuitive physics models. Each scenario within X-VoE encompasses three distinct settings, probing models’ comprehension of events and their underlying explanations. Beyond model evaluation, we present an explanation-based learning system that captures physics dynamics and infers occluded object states solely from visual sequences, without explicit occlusion labels. Experimental outcomes highlight our model’s alignment with human commonsense when tested against X-VoE. A remarkable feature is our model’s ability to visually expound events by reconstructing concealed scenes. Concluding, we discuss the findings’ implications and outline future research directions. Through X-VoE, we catalyze the advancement of AI endowed with human-like intuitive physics capabilities.

VoE VoE Violation of Expectation Violation of Expectation XPL XPL eXplanation-based Physics Learner eXplanation-based Physics Learner \iccvfinalcopy

Figure 1 Figure 1 1 Figure 1 1Evaluation settings in the ball blocking exemplar scenario of X-VoE. The explanation video illustrates potential hidden dynamics. Circles denote no surprise, and exclamation marks indicate surprise. In the predictive setup (S1), a solvable pair is presented without requiring explanation: predicting observed entities’ dynamics suffices to reason about the outcome. In the hypothetical setup (S2), perceiving the direction of outgoing balls might lead to surprise, yet alternate explanations exist—\eg, a hidden blocker behind the wall causing ball rebound. However, a random agent’s scores show negligible disparity, necessitating the explicative setup (S3) to discern surprises, demanding explanatory ability absent in predictive-only or random agents. Figure 1Evaluation settings in the ball blocking exemplar scenario of X-VoE. The explanation video illustrates potential hidden dynamics. Circles denote no surprise, and exclamation marks indicate surprise. In the predictive setup (S1), a solvable pair is presented without requiring explanation: predicting observed entities’ dynamics suffices to reason about the outcome. In the hypothetical setup (S2), perceiving the direction of outgoing balls might lead to surprise, yet alternate explanations exist—\eg, a hidden blocker behind the wall causing ball rebound. However, a random agent’s scores show negligible disparity, necessitating the explicative setup (S3) to discern surprises, demanding explanatory ability absent in predictive-only or random agents.

1 section 1 1 §1 <tag close=" ">1</tag>Introduction

Humans possess a profound understanding of the physical world, enabling them to predict the outcomes of physical interactions and events []. From infancy, humans demonstrate intuitive physics, comprehending actions and consequences even in unfamiliar scenarios. For the machine learning community, the challenge lies in emulating this level of intuitive physics understanding. This study introduces X-VoE, a comprehensive benchmark dataset designed to assess and push the limits of AI agents’ intuitive physics comprehension.

The notion of intuitive physics, observed even in young infants, has been foundational in cognitive science and developmental psychology []. Infants show surprise when physical events violate their expectations, indicating an understanding of fundamental physical principles []. Explanation-based learning has been proposed as a mechanism contributing to the development and refinement of intuitive physics understanding []. However, recent advances in this field have primarily resulted in predictive models, lacking the explanatory capacity and falling short of capturing even infant-level intuitive physics comprehension [].

Central to our work is the () paradigm, widely employed in psychological studies to evaluate infants’ intuitive physics understanding []. In this paradigm, participants exhibit surprise, indicated by prolonged attention, when exposed to events that either follow or violate intuitive physics laws. Inspired by the effectiveness of this paradigm, we adopt it to evaluate AI agents’ intuitive physics comprehension. In each trial, models encounter experiments adhering to or contravening intuitive physics laws. Models succeed in the VoE test if they display high surprise scores for physics-violating experiments and lower scores for compliant ones.

Existing works within the machine learning and computer vision community have embraced the paradigm []. However, most of these efforts primarily focus on predictive abilities, disregarding the explanatory component []. This perspective neglects the fundamental aspect of —the act of explaining observed events. In psychological studies, human participants express surprise not at the moment a physics-violating event occurs, but upon learning of its outcome. This observation underscores the significance of explanation within .

Motivated by these insights, we introduce X-VoE, an intuitive physics evaluation dataset designed specifically to incorporate explanation within . Distinct from previous efforts that concentrated on predictive scenarios, our dataset encompasses setups that require explaining observed events in diverse situations. We establish three settings for each of the four scenarios: ball collision, blocking, object permanence, and continuity (see ). Each scenario features predictive, hypothetical, and explicative setups. Notably, the three setups within the ball-blocking scenario distinguish explanatory agents from predictive and random ones.

Furthermore, we propose the () model to emulate the explanation-based process, inspired by findings in human studies []. While is adaptable to diverse deep architectures, we specifically build it upon PLATO [] due to its robust performance. Our model incorporates three self-supervised modules: perception for image encoding, Transformer reasoning for occluded object prediction, and dynamic reasoning for simulating physical dynamics. Importantly, our model introduces a reasoning sub-component to update representations of occluded objects, akin to infants’ explanation-based learning when confronted with unexpected outcomes [].

In summary, our work makes three significant contributions:

• item 1st item

Introduction of X-VoE, a comprehensive intuitive physics evaluation dataset that challenges AI agents not only in predictive capabilities but also in their capacity to explain. The dataset covers four distinct scenarios, each with predictive, hypothetical, and explicative setups. This allows for a more comprehensive assessment of intuitive physics understanding within .

• item 2nd item

Proposition of the model, enhancing existing approaches with an explanatory module that improves evaluation. Our model comprises three modules—perception, reasoning, and dynamics learning—for holistic comprehension and simulation of physical dynamics.

• item 3rd item

Experimental demonstration of ’s enhanced performance in alignment with human commonsense compared to other baselines in X-VoE. Additionally, offers insights into hidden factors, as depicted in .

Figure 2 Figure 2 2 Figure 2 2Testing scenarios in X-VoE: ball collision, blocking, object permanence, and object continuity. Within each scenario, frames in a testing video are linked by the same setup identification number (\eg, S1). Black links denote non-surprising videos, while red links indicate surprising ones. Notably, certain videos require explanation to become non-surprising. For example, in the right S2 branch of the object permanence scenario, three cubes on the floor become non-surprising due to preceding observation of two cubes dropping, suggesting a hidden cube behind the wall. Figure 2Testing scenarios in X-VoE: ball collision, blocking, object permanence, and object continuity. Within each scenario, frames in a testing video are linked by the same setup identification number (\eg, S1). Black links denote non-surprising videos, while red links indicate surprising ones. Notably, certain videos require explanation to become non-surprising. For example, in the right S2 branch of the object permanence scenario, three cubes on the floor become non-surprising due to preceding observation of two cubes dropping, suggesting a hidden cube behind the wall.

2 section 2 2 §2 <tag close=" ">2</tag>Related work Intuitive physics

Intuitive physics forms a cornerstone of human cognition, enabling rapid and accurate predictions about moving object trajectories []. To evaluate machine understanding in this realm, benchmark datasets have emerged, often focusing on predicting future states [] or inferring object properties []. These methods predominantly gauge model performance by comparing generated predictions to ground truth.

More recently, the () paradigm has garnered attention within the machine learning and computer vision community []. Rooted in developmental psychology, the paradigm quantifies model surprise when presented with events that challenge intuitive physics laws. This perspective provides an alternative angle for assessing intuitive physics understanding. Notably, the IntPhys dataset [] pioneered this -based benchmarking approach. ADEPT [] introduced a model combining re-rendering and object tracking. PLATO [] decomposed the learning process into perception and dynamics prediction. Differing from conventional intuitive physics learning, the paradigm does not rely on absolute ground truth. Instead, it hinges on relative measures of surprise, akin to developmental studies that assume higher responses indicate increased surprise. This emphasizes the role of explanation in , as demonstrated in . In contrast to prior works that often neglected this vital component, our X-VoE includes scenarios that demand both traditional prediction-based understanding and explanation-based comprehension. Additionally, we propose an explanation-enhanced physics learner, , which achieves improved performance and interpretability by incorporating explanations.

Video prediction

The challenge of comprehending videos and making plausible predictions of future states from current observations has been a longstanding problem within computer vision [], closely connected to the paradigm. Solving problems frequently involves predicting future frames for inference and evaluation. However, this prediction task is intricate due to the inherent complexity of modeling real-world dynamics and conditional image synthesis []. Within the computer vision community, various architectures have been explored to address these challenges and enhance the quality of generated images []. The task is further complicated by the need to model relationships between frames, leading to approaches that integrate spatial transformations over time []. Disentanglement of motion and content has also been pursued []. More recent efforts involve learning physics-based dynamics from videos and reasoning about unknown factors []. Within X-VoE, we assess the performance of these video prediction models as baseline methods.

Object-centric dynamics

The “vision-as-inverse-graphics” framework and the versatility of physics simulation have led to models based on physics simulation, which offer notable advantages in terms of accuracy and generality []. However, these models are often heavily reliant on specific physics engines, limiting their flexibility. In response, recent works have leveraged graph neural networks and object-centric representations to mitigate this dependence []. By abstracting irrelevant signals and focusing on objects, these models establish a tighter mapping between visual inputs and physics engines. Further, some models can directly simulate real physics engines []. These object-centric dynamics models have demonstrated the ability to capture intricate dynamics. Our approach in X-VoE aligns with this framework, using object-centric representations for downstream computation and reasoning.

3 section 3 3 §3 <tag close=" ">3</tag>Generating <text font="typewriter">X-VoE</text>

Our X-VoE dataset encompasses four distinct scenarios, covering ball collision, ball blocking, object permanence, and object continuity. To evaluate various intuitive physics principles, each scenario, except object permanence, comprises three distinct settings: predictive, hypothetical, and explicative, as illustrated in . Within each setting, we create 1,000 procedurally generated scene pairs using Unreal Engine 4. Importantly, X-VoE primarily serves as a test suite for evaluating intuitive physics understanding, with no constraints on model training data.

3.1 subsection 3.1 3.1 §3.1 <tag close=" ">3.1</tag>Testing data

We generate testing videos that span four key aspects of object dynamics: ball collision, ball blocking, object permanence, and object continuity. Refer to for a visual overview.

Collision

In this scenario, a ball traverses the scene, while an occlusion wall is positioned centrally. In the predictive setting (S1), we design a scenario where a ball of differing color but identical mass stands behind a wall. The incoming ball collides with this hidden ball, resulting in the incoming ball coming to a halt and the concealed ball continuing its trajectory. To introduce effects, we enable the incoming ball to pass through the hidden ball. In the hypothetical setting (S2), we create a scene featuring a central wall concealing objects behind it. An incoming ball enters the scene from the left and rolls behind the wall. In some cases, an additional ball appears to pass through the wall, while in others, the incoming ball does so. This distinction hinges on whether an unseen ball is situated behind the wall. The explicative setting (S3) closely mirrors the hypothetical setting, but we lift the wall to reveal the concealed scene’s contents.

Figure 3 Figure 3 3 Figure 3 3Overview of the model for explanation-based physics learning. The model comprises three key modules: (i) the perception module, responsible for extracting object-centric representation from RGBD videos and segmentation masks; (ii) the reasoning module, utilizing two Transformer networks to infer representations of occluded objects; (iii) the dynamics module, which acquires intuitive physical knowledge and refines reasoning outcomes to align with intuitive physics. Additionally, the inferred object representation can be visualized using the decoder from the perception module, offering a visual explanation of events occurring behind the wall. Wavy curves indicate masking. Refer to the text for comprehensive details. Figure 3Overview of the model for explanation-based physics learning. The model comprises three key modules: (i) the perception module, responsible for extracting object-centric representation from RGBD videos and segmentation masks; (ii) the reasoning module, utilizing two Transformer networks to infer representations of occluded objects; (iii) the dynamics module, which acquires intuitive physical knowledge and refines reasoning outcomes to align with intuitive physics. Additionally, the inferred object representation can be visualized using the decoder from the perception module, offering a visual explanation of events occurring behind the wall. Wavy curves indicate masking. Refer to the text for comprehensive details. Blocking

The blocking scenario is conceptually similar to the collision scenario, substituting the hidden ball with a stationary cube. The impact of the incoming ball causes it to rebound upon collision with the cube.

Object permanence

Drawing inspiration from developmental psychology literature, we recreate a scenario involving cubes falling to the ground and becoming occluded by a wall. In the predictive setting (S1), we devise a case where a wall descends to an initially vacant ground, followed by three cubes falling behind the wall. To elicit effects, we raise the wall, revealing fewer than three objects. In the hypothetical setting (S2), the scenario begins with a wall positioned centrally, obscuring objects behind it. Three or two cubes fall behind the wall. When the wall is lifted, the scene consistently features three cubes, even when only two cubes initially fell. This reflects the possibility of one cube being hidden behind the wall from the outset.

Object continuity

Motivated by psychology studies [], we introduce a wall with a lower-half window. This setup allows a ball to traverse the scene from one side to the other. The ball becomes occluded when behind the wall, emerges through the window, disappears, and subsequently reappears from the opposite end. The three distinct settings mirror the collision and blocking scenarios. The differentiation between plausible and implausible scenes revolves around whether the ball remains visible upon passing through the window. In the predictive setting (S1), all relevant information is presented at the video’s outset and conclusion, negating the presence of hidden objects. In the hypothetical setting (S2), information is deliberately withheld from the video’s start and finish, necessitating the model’s performance to align with infants [], which involves explaining the existence of two balls. In the explicative setting (S3), the wall is lifted, verifying the absence of an additional ball behind the wall.

3.2 subsection 3.2 3.2 §3.2 <tag close=" ">3.2</tag>Training data

Though we do not impose constraints on the training data, for this study, we generate data adhering to the same structure as the test scenarios but without effects. As shown in , the training set consists of 100,000 procedurally generated scenes, closely mirroring the scale used for training PLATO []. During training, we exclusively present videos following intuitive physics laws, raising the wall at the beginning and end of each video. This approach reduces reasoning complexity, simulating the developmental process where only non-surprising physical events are observed. Consequently, models must unsupervisedly learn from video sequences depicting ordinary scenes, developing intuitive physics understanding necessary for . Furthermore, for the collision and blocking scenarios, we create videos depicting balls passing through walls without collision or obstruction, demonstrating the unimpeded path behind the wall as shown in (a). We also generate scenes similar to the previously described settings but devoid of occlusion walls.

Figure 4 Figure 4 4 Figure 4 4Training scenarios for X-VoE. The timeline progresses from left to right, where each row represents the control, collision, blocking, object permanence, and object continuity groups from top to bottom. Please refer to for additional details. Figure 4Training scenarios for X-VoE. The timeline progresses from left to right, where each row represents the control, collision, blocking, object permanence, and object continuity groups from top to bottom. Please refer to for additional details.

4 section 4 4 §4 <tag close=" ">4</tag> <glossaryref inlist="acronym" key="method" show="long"/> (<glossaryref inlist="acronym" key="method" show="short"/>) 4.1 subsection 4.1 4.1 §4.1 <tag close=" ">4.1</tag>Framework

Our proposed () model draws inspiration from developmental psychology theories concerning infancy. As depicted in , the model comprises three key components: (1) a perception module responsible for extracting object-centric representations to facilitate downstream processing, (2) a reasoning module tasked with inferring occluded object states by considering both spatial and temporal contexts, and (3) a dynamics module designed to acquire physical insights and evaluate inference outcomes for occluded objects.

Perception

The perception module is designed to process input RGBD video sequences, represented as $⟨ x 0, x 1, …, x T ⟩$ , alongside their corresponding segmentation masks, denoted as $⟨ m 0, m 1, …, m T ⟩$ . The masks are generated using a pre-trained segmentation model. Notably, the simplicity of the scenes allows for direct use of ground truth segmentation, as observed in PLATO []. For each frame, the perception module employs a Component Variational Autoencoder (Component VAE) [] to transform each input image into a concealed vector representation $⟨ z 0 : 1 K, z 1 : 1 K, …, z T : 1 K ⟩$ , where $K$ represents the object count per frame.

Reasoning

The reasoning module leverages the object embeddings obtained from the perception module as input and endeavors to enhance scene comprehension by inferring the attributes of occluded objects, whose masks remain vacant due to occlusion. This aspect employs two Transformer models to refine object embeddings and recover hidden objects. Both Transformers adopt flattened spatial-temporal embeddings and apply global attention mechanisms to contextualize information. The first Transformer refines input features of occluded objects to align with a learned dynamics module, producing $~ z$ . The second Transformer is responsible for recuperating objects concealed within observation sequences of both original and refined features. It’s important to note that object recovery mirrors Masked Autoencoding [], treating a random object as absent and necessitating reconstruction from contextual cues. Drawing from these observations, we train the second Transformer similarly to Masked Autoencoders (MAE).

Dynamics

The dynamics module predicts object embeddings $^z : 1 K + t 1$ in the succeeding frame based on the preceding frame’s refined object embeddings $~ z : 1 K : 1 t$ . This involves employing the interaction dynamics module introduced in PLATO [], supplemented by a residual module. Unlike PLATO, we employ object embeddings subsequent to the reasoning module and jointly train the modules.

Figure 5 Figure 5 5 Figure 5 5(a) Performance of different models on X-VoE under the holistic metric. The red line denotes the ideal performance. (b) PCA with or without residual connection. The first ten principal components are shown. (c) Results from each score component. Figure 5(a) Performance of different models on X-VoE under the holistic metric. The red line denotes the ideal performance. (b) PCA with or without residual connection. The first ten principal components are shown. (c) Results from each score component. 4.2 subsection 4.2 4.2 §4.2 <tag close=" ">4.2</tag>Model training

Initially, we pre-train the perception module to equip the system with foundational visual capabilities. Precisely, the perception module undergoes pre-training using RGBD images and segmentation masks. Throughout this phase, we segment objects and employ masked images for VAE training. During image reconstruction, depth information assists in calculating object mask details.

We then train one Transformer and the dynamics module, with latent codes frozen from the perception module, in an end-to-end manner employing the following loss:

= ~ z ⁢ f inf (z)

~ z

= ⁢ f inf (z)

(1) Equation 1 1

= L - ⁢ \norm f dyn (~ z : 1 K : 0 t) ~ z : 1 K + t 1 2,

L

= - ⁢ \norm f dyn (~ z : 1 K : 0 t) ~ z : 1 K + t 1 2,

Here, the Transformer employs the architecture featured in Aloe [] ( $⁢ f inf (⋅)$ ), while the dynamics prediction module aligns with PLATO [] ( $⁢ f dyn (⋅)$ ). The second Transformer is trained independently using MAE.

5 section 5 5 §5 <tag close=" ">5</tag>Experiments

In this section, we thoroughly evaluate the performance of using our X-VoE dataset across different experimental configurations: predicting future phenomena (predictive setup), interpreting existing phenomena (hypothetical setup), and understanding past occurrences given future conditions (explicative setup). We compare against PhyDNet [], a video prediction model, and PLATO [] in our X-VoE dataset. These models are evaluated under two different metrics.

5.1 subsection 5.1 5.1 §5.1 <tag close=" ">5.1</tag>Defining accuracy and surprise

Before delving into different evaluative configurations, we first introduce how accuracy and surprise are formally defined.

In developmental psychology experiments on , a surprise was defined by comparing infants’ responses to normal scenes with those that violate expectations. Similar to existing works [], we borrow the idea and define the model accuracy as the relative scores between two videos, one that violates intuitive physics laws and another that does not:

(2) Equation 2 2

Accuracy = 1 N ∑ 1 [s nor < s sur],

where $N$ denotes the total number of such pairs, and $s nor$ and $s sur$ are scores of a normal physics video and one that violates physics, respectively. The scores are computed as the sum of the difference between the inferred results from the observation and that from the dynamics module’s prediction, \ie,

(3) Equation 3 3

= s + s img s dyn,

where

(4) Equation 4 4

= s img ∑ = t 1 T ⁢ ℓ (I t, ∑ i ⁢ f dec (~ z t i)),

and

(5) Equation 5 5

s dyn = ∑ = t 2 T ℓ (∑ i f dec (~ z t i), f dec (f dyn (~ z : 0 - t 1 : 1 K)) .

Here, $⁢ f dec (⋅)$ denotes the learned decoder in our VAE, and we use MSE loss for $⁢ ℓ (⋅)$ .

Figure 6 Figure 6 6 Figure 6 6Performance of different models on X-VoE under the comparative metric. The red line denotes the ideal performance. The top part shows the absolute comparative values and the bottom part shows the difference from the ideal. Figure 6Performance of different models on X-VoE under the comparative metric. The red line denotes the ideal performance. The top part shows the absolute comparative values and the bottom part shows the difference from the ideal. 5.2 subsection 5.2 5.2 §5.2 <tag close=" ">5.2</tag>The holistic metric

Similar to Smith \etal [], we adopt the holistic metric to evaluate effects in all pairs of unexpected and normal event videos. Ideally, an intuitive physics model should produce higher surprise scores for unexpected events. Formally, the holistic metric is defined as such,

(6) Equation 6 6

1 ⁢ n s n c ∑ i, j 1 [s (x i +) > s (x j -)],

where $x i +$ and $x j -$ denote the unexpected and normal videos and $n s$ and $n c$ are the number of unexpected and normal videos. This metric aggregates results from all confounding factors, including interference from colors, shapes, scene complexity, \etc. Therefore, it provides a holistic view of models’ understanding of intuitive physics events; models need to judge the unexpectedness of outcomes from the intuitive physics perspective, disentangling all other confounding factors.

As shown in (a), we measure the holistic value on different models on X-VoE. Both and PLATO show better performance in all four testing scenarios, though with a notable gap from perfection. is significantly better than PLATO in the collision, blocking, and permanence, but less so in continuity. We also compare different dynamic modules, with or without residual, in . The results show that the residual connection in the dynamics module plays a critical role in our system, as evidenced by results for collision and blocking. An in-depth analysis from Principal Component Analysis (PCA) in (b) shows that after adding the residual connection, the standard deviation in different principal components is particularly reduced, making learning easier.

To investigate the contribution of each of the two surprise components in , we compute the holistic metric from each of them separately. As shown in (c), the performance of $s dyn$ is superior to that of $s img$ in the collision and blocking scenarios, whereas the performance of $s img$ is better in permanence and continuity. This result implies that the violation of physical knowledge plays a more important role in collision and blocking. In contrast, the mismatch from the observation is a more crucial factor for permanence and continuity. Thus, the residuals in , explicitly taking earlier information into computation, could exert a greater influence on the dynamic module and its impact in the collision and blocking scenarios as shown in (a).

The holistic metric only provides a global view of how a model understands intuitive physics. To paint a more complete landscape of a model, we look deeper into the comparative metric in the next section.

5.3 subsection 5.3 5.3 §5.3 <tag close=" ">5.3</tag>The comparative metric

The comparative metric, similar to ones proposed in literature [], is calculated in a pair of the unexpected and normal events within one specific setting in each scenario,

(7) Equation 7 7

1 n ∑ i 1 [s (x i +) > s (x i -)],

where $x i +$ and $x i -$ are the two paired videos in each settings and $n$ is the number of such pairs. The comparative metric is also most commonly used in evaluating infants’ intuitive physics knowledge in developmental psychology [].

Whereas the holistic metric describes whether an observation sequence is absolutely surprising from a holistic perspective, the comparative metric assesses whether one observation sequence is more surprising than another from a comparative perspective. Although the holistic metric provides an overall perspective, it lacks the detailed results of the three specific cases the comparative metric provides; see . In each scenario in X-VoE, the two videos in the hypothetical setting are likely to occur, while only one of the two videos in the predictive and explicative settings is likely to occur. Therefore, the comparative metric in the hypothetical setting should be ideally 50%, while the metric in the predictive and explicative settings should be ideally 100%.

shows the comparative values of different models. The results in the predictive setting indicate that current AI systems, even as simple as general video prediction, can easily predict future outcomes accurately for such a simple task. However, when it comes to the setting that requires reasoning and explanation (\ie, explicative), only can consistently achieve over 50%. When common predictive models can only predict future occurrences based on past conditions, can reason about the past conditions that lead to the observation, a critical ability necessary for successfully solving the explicative setting.

Figure 7 Figure 7 7 Figure 7 7Training: Visualization of the internal representation in PLATO and during training. Figure 7Training: Visualization of the internal representation in PLATO and during training.

Of these, the hypothetical setting is where we notice the most performance volatility. For the hypothetical setting, both a random-answering human subject and an ideal human subject with perfect understanding would reach 50% accuracy. However, this is exactly why this problem is intriguing for psychologists. From this perspective, a model achieving 50% could mean it is either the worst or best. While in the hypothetical setup, PhyDNet achieves nearly 50%, it can only reach random-level performance in the explicative setting, showing that the model does not understand different possibilities behind the wall. This is why the explicative setting is so important. The explicative setting provides more new information in the video follow-up than the hypothetical setting. As shown in , the new information will change a possible scene to an impossible scene in the hypothetical setting. The metric gap between the hypothetical setting and explicative setting shows the power of the explanatory abilities. demonstrates this property on both collision and blocking scenarios, especially on the collision scenario, where this gap reaches close to 90%.

Although the with or without a residual module both have the reasoning module, they still have different explanatory abilities for hypothetical and explicative settings. In collision and blocking tasks, residuals’ presence improves the explicative but not the hypothetical setting. The residual module enhances the connection between two consecutive frames, allowing the reasoning module to better infer the previous state based on the subsequent state. The main difference between the hypothetical and explicative setting is the inclusion of follow-up information. In the explicative setting, the presence of follow-up information enhances the performance of the reasoning module (with residual module) due to more subsequent state information. However, in the hypothetical setting, the absence of follow-up information negatively impacts the module’s performance.

Overall, improves over previous state-of-the-art but still fares worse on collision and continuity. While developmental psychology experiments have found the ability in infants [], it remains a challenge for AI systems.

5.4 subsection 5.4 5.4 §5.4 <tag close=" ">5.4</tag>Visualization results

Figure 8 Figure 8 8 Figure 8 8Testing: Visualization of the inferred internal representation in during testing. This example corresponds to the settings in . Figure 8Testing: Visualization of the inferred internal representation in during testing. This example corresponds to the settings in .

The challenge of visual occlusion persists in computer vision. Unless the ground-truth value is given directly, it is difficult to characterize occluded objects by vision alone, especially in the case of complete occlusion. However, humans can deduce occluded objects and corresponding physical phenomena intuitively, even under complete occlusion. We investigate whether can reason about occluded objects through visualization.

We visualize occluded objects within the learned representation. Specifically, we mask the token associated with the wall and decode the resulting features to assess the model’s ability to reconstruct hidden objects. Training visualization results are presented in . Notably, PLATO lacks a dedicated reasoning module for occluded objects, resulting in an inability to recover occluded factors. Conversely, gradually learns to infer the presence of occluded objects behind the wall to explain observations. Crucially, we never provide ground-truth occluded object representations during training, emphasizing the importance of synchronized training of the inference and dynamic modules. This approach allows to achieve improved occluded object restoration, though it still falls short of ground-truth results ().

For test visualization, detailed results corresponding to are showcased in . The predictive setting demonstrates ’s accurate reconstruction of observed objects. In the hypothetical setting, provides coherent explanations involving hidden object interactions. In the explicative setting, the occluder is lifted toward the end of the videos, resulting in surprising outcomes.

To conclude, proficiently reconstructs occluded objects and provides visual explanations for various events, underscoring its capacity to reason about hidden factors in the context of intuitive physics.

6 section 6 6 §6 <tag close=" ">6</tag>Conclusion and discussion

In this paper, we introduced X-VoE, a novel explanation-based () dataset consisting of four distinct scenarios, each encompassing three unique settings: predictive, hypothetical, and explicative. While the predictive setting aligns with conventional tasks, the other two settings focus on evaluating a model’s explanatory capacity. Our proposed combines reasoning and explanation processes to address occluded objects, offering enhanced performance within the X-VoE settings. Our experiments revealed that excels in scenarios requiring explicit explanations for occluded objects, positioning it ahead of other methodologies. Notably, the decoded representation from offers visual explanations for occluded events, highlighting its ability to reason about hidden factors.

Our work underscores the pivotal role of explanations in tasks, particularly concerning occluded objects and their contribution to video comprehension. Even when objects are obscured by walls, the possibility of underlying physical events remains, and a model equipped with explanation capabilities performs more adeptly in such situations. The capacity to reason about occluded objects extends the model’s scope beyond mere video prediction, enabling it to capture intuitive physics principles more effectively.

However, certain challenges persist. Notably, encounters difficulties in scenarios that demand high-level explanations, such as the hypothetical setting in collision or continuity (). These limitations underscore the need for further advancements in the reasoning aspect of our model, paving the way for future research. The ability to handle complex interactions and provide meaningful explanations remains a challenging aspect that requires careful consideration in model design.

In conclusion, while our model’s reasoning capabilities are still a work in progress, our study sheds light on the integration of explanations into tasks, aiming to develop models with a level of intuitive physics comprehension akin to infants. The focus on occluded objects and their explanatory potential broadens the scope of tasks and encourages the development of AI systems with deeper understanding.

6.1 subsection 6.1 6.1 §6.1 <tag close=" ">6.1</tag>Limitations Method

Despite its strengths, faces certain limitations. It struggles in some experiments, particularly the hypothetical setting in collision or continuity (), where its performance falls short of human-like comprehension. Furthermore, our explanation process employs a basic Transformer module, lacking physics-related inductive biases that could enhance performance. A promising direction for future research lies in incorporating domain-specific inductive biases that exploit physical principles to improve reasoning and explanatory capabilities.

Accuracy metric

Although our accuracy metrics draw inspiration from developmental psychology experiments and prior works, they rely on video comparisons to evaluate violations of intuitive physics. This approach, while effective, assumes that one of the videos violates intuitive physics laws, even if the difference in surprise values is marginal. As a result, the method might struggle to achieve the desired metrics in scenarios like the hypothetical setting. Exploring metrics that focus on higher-level concepts and the detection of fundamental violations could yield insights into the underlying mechanisms that drive these evaluations.

Dataset

X-VoE pioneers the evaluation of physical explanatory abilities in tasks. However, our test scenarios could be more diverse and comprehensive. Future efforts will expand and diversify these scenarios to create a more robust framework for testing intuitive physics understanding in . By incorporating a wider range of physical phenomena and interactions, future datasets can challenge AI systems with greater complexity.

6.2 subsection 6.2 6.2 §6.2 <tag close=" ">6.2</tag>Future Directions

Future research should focus on refining ’s reasoning capabilities, enhancing its performance in scenarios demanding higher-order explanations. Introducing more sophisticated physics-based inductive biases could contribute to better occluded object reasoning. Additionally, exploring hybrid approaches that combine neural networks with symbolic reasoning could lead to more advanced models with enhanced explanatory capabilities.

Additionally, X-VoE can serve as a stepping stone for designing more intricate and varied scenarios. Incorporating more complex physical interactions, occlusions, and multiple objects would lead to a richer and more challenging testbed for evaluating AI systems’ intuitive physics comprehension. Diverse scenarios can provide comprehensive evaluation of models’ understanding across a wide range of intuitive physics principles.

In summary, our study provides insights into the integration of explanations in tasks and sets the stage for future advancements in both model design and dataset development. The intersection of explanations and intuitive physics comprehension holds promise for creating AI systems that not only predict events but also understand the underlying physical principles that govern them.

Acknowledgment

The authors would like to thank four anonymous reviews for constructive feedback, Huiyin Li (BIGAI) for designing the figures, and NVIDIA for their generous support of GPUs and hardware. This work is supported in part by the National Key R&D Program of China (2022ZD0114900) and the Beijing Nova Program.

References Appendix A Appendix A A Appendix A <tag close=" ">Appendix A</tag>Dataset ADataset A.1 subsection A.1 A.1 §A.1 <tag close=" ">A.1</tag>Test data

For the VoE task, we divided the four scenarios into 11 groups, each with two comparison cases. The setups in the testing data are very similar to the ones in the training data except for the behavior of the wall. All scenarios except Permanence contain predictive, hypothetical, and explicative settings. The predictive and explicative settings contain both plausible and implausible events, while the hypothetical setting contains two plausible events. In the predictive setting, the wall is moved away at the beginning and end of the video, so all information is shown at the beginning and end of the video. In the hypothetical setting, the wall always stays in the middle of the scene. In the explicative setting, the wall is moved away only at the end of the video, so new information is shown to the model at the end of the video.

Collision

The Collision scenario is shown in . Collision contains predictive, hypothetical, and explicative settings. In the predictive setting, the wall is moved away at the beginning and end of the video, so two balls are visible to the model. We can easily tell from intuitive physics that the case in the first row is possible while the case in the second row is not, because the red ball cannot pass through the blue ball without collision. In the hypothetical setting, the wall always stays in the middle of the scene, so we can not tell how many balls there are in the scene. As we can not infer if a blue ball is hidden behind the wall at the beginning of the video, both cases in the setting are possible. In the explicative setting, the wall is moved away at the end of the video, so additional information is given. We can infer that a blue ball must be hidden behind the wall, so the case in the first row is possible, while the case in the second row is not.

Blocking

The Blocking scenario is shown in . The Blocking scenarios are similar to the Collision scenarios, except that the ball hidden behind the wall is replaced by a fixed cube. In the predictive setting, the wall is moved away at the beginning and end of the video, so the cube is visible to the model. Similar to Collision, we can easily tell that the case in the first row is possible while the case in the second row is not, because the blue ball can not pass through the green cube without collision. In the hypothetical setting, the wall always stays in the middle of the scene, so we can not tell if there is a cube behind the wall. Therefore, both cases in the setting are possible. In the explicative setting, the wall is moved away at the end of the video, so we can infer that a cube must be hidden behind the wall. Furthermore, we can tell that the case in the first row is possible while the case in the second row is not.

Permanence

The Permanence scenario is shown in . In the Permanence scenarios, three cubes are randomly divided into two groups (allowing empty groups), where cubes in the first group are dropped to the ground and the second rest on the floor. We do not have an explicative setting for this scenario, as there is no new evidence at the end of the video. In the predictive setting, the wall is moved away at the beginning of the video, so we can infer that there is no object on the ground at the beginning. So the case in the second row is impossible, while the case in the first row is possible. In the hypothetical setting, the wall stays in the middle of the scene at the beginning, so we can not tell if there are cubes on the ground at the beginning, so both cases are possible.

Continuity

The Continuity scenario is shown in . In the Continuity scenarios, we create a window on the lower half of the wall. In the case of the wall, the ball rolls across the scene. When the ball passes through the wall, it can be seen going from one side to the other. In the predictive setting, the wall is moved away at the beginning of the video, so we can infer that only one ball is in the scene. We can tell that the case in the second row is impossible while the case in the first row is possible. In the hypothetical setting, the wall always stays in the middle of the scene, and we can easily infer that the case in the first row is possible. Considering the case in the second row, we can not tell if there are two balls with the same appearance in the scene, one of which is visible at the beginning and the other one is hidden by the right part of the wall. If that is true, the case in the second row is also possible. So both cases are possible. In the explicative setting, the wall is moved away at the end of the video, so we can infer that there is only one ball in the scene. Thus we can tell that the case in the first row is possible while the case in the second row is not.

A.2 subsection A.2 A.2 §A.2 <tag close=" ">A.2</tag>Train data

For four scenarios, we created 5 groups for training. Each of Permanence and Continuity contains 1 group, while Collision and Blocking in total contain 3 groups. Each group contains 2 kinds of cases: cases with a wall and ones without a wall. In the case with a wall, a movable wall stands in the middle of the scene and will be moved away at the beginning and the end of the video. In the case without the wall, everything stays the same except that the wall does not exist, showing that the wall won’t interact with other objects physically. Each row in the corresponds to one sampled video in a specific case. See for all training groups.

Control group

In the control group, a ball rolls across the scene without interacting with other objects, indicating that the environment follows basic physics.

Collision group

A ball rolls across the scene in the Collision scenario with the wall. Another ball with the same mass but a different color is hidden behind the wall and will collide with the incoming ball, causing the first ball to stop and itself to pass through. In a setting without a wall, the second ball will always be visible.

Blocking group

The Blocking scenarios are similar to the Collision scenario, except that the ball hidden behind the wall is replaced by a fixed cube. A ball rolls across the scene in the blocking setting with the wall. A fixed cube is hidden behind the wall and will collide with the incoming ball, causing the incoming ball to turn around. In the setting without a wall, everything stays the same except that the wall doesn’t exist, and the cube will always be visible.

Permanence group

In the Permanence scenario, three cubes are randomly divided into two groups (allowing empty groups), where cubes in the first group are dropped to the ground and the second rest on the floor. In the setting with the wall, the wall will be moved away at the end of the video, showing that all of the cubes still exist. In the setting without the wall, the cubes will always be visible.

Continuity group

In the Continuity scenario, we create a window on the lower half of the wall. In the setting with the wall, the ball rolls across the scene. When the ball passes through the wall, it can be seen going from one side to the other, especially visible from the window. In the setting without the wall, the ball will always be visible.

A.3 subsection A.3 A.3 §A.3 <tag close=" ">A.3</tag>Environment

Our X-VoE dataset comprises 22K+100K procedurally generated scenes using Unreal Engine 4. In addition to the floors and the backgrounds, there are four different object types: balls, cubes, walls, and windowed walls. In all videos, the size of the ball and the cube are the same, while the size of the wall with or without windows are randomly different. The positions of objects are randomly set in the videos, except for the walls in the permanent scenes in which the wall is placed in the middle. All objects, including the floor and the background, are randomly set in different colors.

Table A1 Table A1 A1 Table A1 A1Spatial broadcast decoder architecture (from top to down).

Table A1Spatial broadcast decoder architecture (from top to down).
Type	Size	Activation	Comment
Spatial Broadcast	8 × 8	-	-
Position Embedding	-	-	-
Conv 5 × 5	64	ReLU	stride: 2
Conv 5 × 5	64	ReLU	stride: 2
Conv 5 × 5	64	ReLU	stride: 2
Conv 5 × 5	64	ReLU	stride: 2
Conv 5 × 5	64	ReLU	stride: 1
Conv 3 × 3	4	-	stride: 1
Channels	RGBD(4)	Softmax (on depth channel)	softmax(depth × abs( $θ$ ) × -1000.0)

Table A2 Table A2 A2 Table A2 A2The Transformer architecture (from top to down). The [M] is a learnable mask token for Transformer.

Table A2The Transformer architecture (from top to down). The [M] is a learnable mask token for Transformer.
Type	Size	Activation	Comment
LP (1)	256	-	-
Mask (1)	-	× mask + [M] × (1-mask)	mask : (size F × N × 1), (value 0 or 1)
Position Embedding	-	-	-
Transformer	256, 256 (MLP)	ReLU (MLP)	head=8,key=32,layers=6
LP (2)	256	-	-
Mask (2)	-	× (1-mask) + inputs × mask	mask : (size F × N × 1), (value 0 or 1)

Appendix B Appendix B B Appendix B <tag close=" ">Appendix B</tag>Model BModel B.1 subsection B.1 B.1 §B.1 <tag close=" ">B.1</tag>Perception

The perception module in is similar to that of Component Variational Autoencoder (ComponentVAE) in the PLATO model []. For each object $k$ in an image, we take as input a 128 × 128 RGBD (0-255 for each channel) image $x k$ that is masked except around the object. Then we use a Vision Transformer [] encoder $Φ$ to encode the image with only one object into a 32-dimensional Gaussian posterior distribution $⁢ q Φ (| z k x k)$ . The sample from this distribution, $z k$ , is decoded by a spatial broadcast decoder [] to an RGBD image. To address occlusion, we use the depth of the decoder image to combine all objects in the image by multiplying them with softmaxed depth values. We first pretrained the perception module by optimizing the variational objective defined in []. We set $σ$ to 0.05, $β$ to 0.5, and $γ$ to 0 to ensure that the model reconstructs object masks without segmentation information in the loss function.

ViT encoder

We first reshape the 128 × 128 × 4 images into a sequence of flattened 16 × 16 × 256 patches, followed by a linear layer with 256 dimensions. Next, we add 2D position embeddings and learnable embeddings, flatten, and send them to a Transformer. We use 8 multi-head, 32 key dimensions, 1024 MLP layer dimensions, and 6 Transformer layers for the Transformer model []. Finally, we use an MLP layer with size [512, 64] and a leaky-ReLU activation function to the Transformer output and obtain 32-dimensional Gaussian posterior distributions for each object.

Spatial broadcast decoder

Our spatial broadcast decoder is similar to that in []. As shown in , we use position embeddings and CNN model to decode the object embeddings, where the parameter $θ$ in the softmax layer is learnable, thus representing the mask in terms of depth.

B.2 subsection B.2 B.2 §B.2 <tag close=" ">B.2</tag>Reasoning

In the reasoning module, we use two Transformer modules to reason the hidden object which is occluded in some or all of the frames. All objects in a video can be reshaped as F × N × D embeddings, where F is 15 frames, N is 8 objects, and D is 32 dimensions in our work. As shown in , we use a Transformer model to reason the masked objects in video, similar to the self-supervised learning module in Aloe []; the parameter [M] in the Mask (1) part is learnable.

First Transformer

We set the mask to 0 for objects that are temporally occluded in some frames, and 1 for others. As shown in , we can use the Transformer model to reason the new object embeddings whose mask equals 0. We use it in both the training and testing steps to have better object embedding for the whole video.

Second Transformer

In our test dataset, there may be cases where an object is obscured in all frames. So in the training step, we set the mask to 0 for one random object (including empty object) in all frames. Then we can train the second Transformer model in a self-supervised manner. In the test step, we set the mask to 0 for one object that is not visible in all frames. Then we can reason about the occluded object to explain the whole video.

B.3 subsection B.3 B.3 §B.3 <tag close=" ">B.3</tag>Dynamics

In fact, the occluded objects are never directly seen for the Transformer model. After the first reasoning module, we obtain reasonable video object embeddings based on experience. In the dynamics module, we predict the value of the incremental change of the object embeddings in the time step by using the same dynamics module from PLATO [] with the only difference in object dimension used (from 16 to 32). We refer the readers to [] for architectural details.

Table A3 Table A3 A3 Table A3 A3Training parameters. The pre-processed video features are calculated by the Perception module, which is pre-trained.

Table A3Training parameters. The pre-processed video features are calculated by the Perception module, which is pre-trained.
Model	batch size	training step	optimizer	learning rate	warm step	delay step
Perception module (in ,PLATO)	300 (images)	472000	Adam	0.0004	2000	100000
	500 (pre-processed video features)	32000	Adam	0.0004	1000	10000
PLATO	500 (pre-processed video features)	32000	Adam	0.0004	1000	10000
PhyDNet	100 (videos)	70000	Adam	0.001	-	-

Appendix C Appendix C C Appendix C <tag close=" ">Appendix C</tag>Training CTraining C.1 subsection C.1 C.1 §C.1 <tag close=" ">C.1</tag>Training detail

In a scene with occlusion, we cannot get the representation of the occluded object directly by observation. Therefore, we first use the dynamics loss on the object embeddings after the first Transformer to train our first Transformer and dynamics model. Then, we use the object embeddings after the first Transformer to train our second Transformer model. We randomly mask an object throughout the video frame and use the model to predict representations of the objects throughout the video, enabling the model to infer whether there is a fully hidden object in the test task.

C.2 subsection C.2 C.2 §C.2 <tag close=" ">C.2</tag>Training parameters

We first pre-train the perception module and use it for both PLATO and . Then we train our model , PLATO, and PhyDNet with the parameters shown in .

C.3 subsection C.3 C.3 §C.3 <tag close=" ">C.3</tag>Training steps

During the development of the model, we explored how the size of the training dataset impacted the pixel loss of the dynamics module. We use the expected video in the predictive setting of all scenarios as the test dataset to calculate the average pixel loss. shows that more training data will improve the performance of the dynamics module.

Appendix D Appendix D D Appendix D <tag close=" ">Appendix D</tag>Visualize supplementary DVisualize supplementary

In the main text, we visualize the reasoning results by our model in the Blocking scenario. Here, we visualize the reasoning results for the rest of the scenarios.

D.1 subsection D.1 D.1 §D.1 <tag close=" ">D.1</tag>Collision

As shown in , in the predictive setting, has no problem accurately reconstructing the objects, and the surprise video can be found directly. In the hypothetical setting, the possible explanation for the first video is that another ball collides with the incoming ball. In contrast, no such ball is in the second video, explaining both cases. This result also shows the limitation of our as the incoming ball did not stop behind the wall. In the explicative setting, the occluder is only moved away at the end of the videos. Unlike the hypothetical, when showing a hidden ball behind it, it is impossible for the ball to pass through, causing surprise.

D.2 subsection D.2 D.2 §D.2 <tag close=" ">D.2</tag>Permanence

As shown in , in the predictive setting, can reconstruct the objects behind the wall, and the surprise video can be found by comparing it with the origin image. The visual effect of the reconstructed objects does not seem to be very well, which is still a limitation of our . In the hypothetical setting, the possible explanation for the second video is that there exists another object behind the wall, and our can reason about the object.

D.3 subsection D.3 D.3 §D.3 <tag close=" ">D.3</tag>Continuity

As shown in , the visualization results of our are the same in all settings. Even though the visualization results can show surprise in predictive and explicative settings by comparing with the origin videos, our still can not deal with the hypothetical setting due to the limitation discussed in the main text. Our requires given masks and identification of objects. Therefore, it can not reason about the hypothetical setting in continuity by changing the identification of objects and suggesting that there are two same objects as infants do [].

Figure A1 Figure A1 A1 Figure A1 A1Collision test groups. Figure A1Collision test groups.

Figure A2 Figure A2 A2 Figure A2 A2Blocking test groups. Figure A2Blocking test groups.

Figure A3 Figure A3 A3 Figure A3 A3Permanence test groups. Figure A3Permanence test groups.

Figure A4 Figure A4 A4 Figure A4 A4Continuity test groups. Figure A4Continuity test groups.

Figure A5 Figure A5 A5 Figure A5 A5Average pixel loss of test data for different sizes of training data. Figure A5Average pixel loss of test data for different sizes of training data.

Figure A6 Figure A6 A6 Figure A6 A6Visualization of the inferred internal representation in during testing in collision scenarios. Figure A6Visualization of the inferred internal representation in during testing in collision scenarios.

Figure A7 Figure A7 A7 Figure A7 A7Visualization of the inferred internal representation in during testing in permanence scenarios. Figure A7Visualization of the inferred internal representation in during testing in permanence scenarios.

Figure A8 Figure A8 A8 Figure A8 A8Visualization of the inferred internal representation in during testing in continuity scenarios. Figure A8Visualization of the inferred internal representation in during testing in continuity scenarios.