Emulating Human Developmental Stages with Bayesian Neural Networks

Emulating Human Developmental Stages with Bayesian Neural Networks Marcel Binz (binz@staff.uni-marburg.de) Department of Psychology, Theoretical Neuroscience Group Philipps-Universität Marburg \ANDDominik Endres (dominik.endres@staff.uni-marburg.de) Department of Psychology, Theoretical Neuroscience Group Philipps-Universität Marburg

We compare the acquisition of knowledge in humans and machines. Research from the field of developmental psychology indicates, that human-employed hypothesis are initially guided by simple rules, before evolving into more complex theories. This observation is shared across many tasks and domains. We investigate whether stages of development in artificial learning systems are based on the same characteristics. We operationalize developmental stages as the size of the data-set, on which the artificial system is trained. For our analysis we look at the developmental progress of Bayesian Neural Networks on three different data-sets, including occlusion, support and quantity comparison tasks. We compare the results with prior research from developmental psychology and find agreement between the family of optimized models and pattern of development observed in infants and children on all three tasks, indicating common principles for the acquisition of knowledge.

Keywords: Core knowledge; developmental psychology; intuitive physics; approximate number system; machine learning, deep learning, variational inference, normative models

1 1 §1 <tag close=" ">1</tag>Introduction

The theory of core knowledge in developmental psychology identifies several domains, that build the foundations of human cognition ( ) . Typically physics, actions, numbers, space and social interactions are listed among the core domains. Knowledge in these areas is present starting from early stages of childhood and serves as the basis for learning during later life. Research in developmental psychology over the past decades equipped us with a solid understanding about the acquisition of such knowledge. Different stages of development have been identified for a wide range of phenomena. Insights across studies suggest, that established rules generally start with a simple hypothesis before becoming more sophisticated over time ( ) .

We investigate whether current machine learning systems show generalization behavior reminiscent of human infants and children at different stages of development. For this purpose, we assume that the amount of data available to the learning algorithm is proportional to human age. The class of models we focus on are Bayesian Neural Networks (BNNs), that are trained through variational inference. Neural networks have the desirable property of being able to approximate an arbitrary complex mapping given enough capacity, while Bayesian inference captures normative principles of how to update an initial belief in the light of new evidence. The specific choice of variational inference and neural networks is mainly due to convenience reasons, and we hypothesize, that many different combinations of universal function approximators and Bayesian learning would lead to comparable results.

Our experiments focus on two of the established core domains: physics and numbers. We consider two experiments involving intuitive reasoning about the laws of physics ( ) and one examining the approximate number system, which is responsible for forming fast, but imprecise, representations of quantities ( ) . In all three cases we observe pattern in our models, that share similarities with the development progress during childhood, as we increase the data-set size.

There has been a recent interest in replicating reasoning capabilities from the core domains in artificial systems. Prior work regarding intuitive physics has considered generative ( ) as well as discriminative models ( ) . Both classes of models are often able to reach performance levels comparable to those of adults on specific tasks. Another core domain, that has received some attention within the computational modelling community, is the one of intuitive psychology. Here, for example, \citeAbaker2009action suggest to employ Bayesian inverse planning for inferring mental states of other agents. In contrast to the aforementioned prior work, we are interested in the differences between optimal models for varying data-set sizes and how these differences compare to observations made in developmental psychology. Existing work on modelling the development of intuitive physics is limited to descriptive models, such as list of rules ( ) or decision trees ( ) . In contrast to this, our approach is based on normative principles and we ask the question, whether observed stages emerge naturally in complex artificial learning systems.

In the next section we provide a short technical overview of neural networks and variational inference. This is followed by a description of the three experiments under examination. For each experiment we outline the given task, the empirical observations made in the developmental psychology literature and how we construct an artificial data-set. We finally provide a comparison between the developmental progress of children at different ages and that of optimized models for different data-set sizes. We conclude the article with a discussion of the obtained results and an outlook of the future interaction between the areas of machine learning and developmental psychology.

2 2 §2 <tag close=" ">2</tag>Methods 2.1 2.1 §2.1 <tag close=" ">2.1</tag>Deep Learning

Neural networks are parametric function approximators, that combine linear transformations $W$ and non-linear activation functions $f$ in alternating fashion:

= h l ⁢ f l (⁢ W l ⊤ h - l 1)

where $∈ l {1, …, L}$ . $h 0$ corresponds to the input $x$ and $h L$ to an estimate of the target $^y$ . Parameters of the model are commonly updated via gradient descent on a loss function, usually some form of negative log-likelihood. The power of neural networks stems from their ability to approximate any continuous function on a compact subset of $R n$ .

2.2 2.2 §2.2 <tag close=" ">2.2</tag>Variational Inference

The task of learning model parameters $W$ can also be stated as a Bayesian inference problem:

(1) 1

= ⏟ ⁢ p (| W D) posterior ⁢ ⏞ ⁢ p (| y X, W) likelihood ⏞ ⁢ p (W) prior ⏟ ⁢ p (| y X) evidence

for a given data set $= D {(x i, y i)} = i 1 N$ , with inputs $= X {x i} = i 1 N$ and targets $= y {y i} = i 1 N$ . Bayes’ theorem defines how we should update our beliefs as more information becomes available. As $→ N ∞$ the influence of the prior vanishes, while for $→ N 0$ we can only rely on prior assumptions. In our context we assume that experience (i.e. $N$ ) increases with age and hence we use an approximation to Equation with data-sets of varying size to represent agents of different age.

Equation is in general hard to compute for models of useful complexity. Variational inference offers a tractable approximation to Equation ( ) . Let $⁢ q ϕ (W)$ be a distribution with parameters $ϕ$ , that approximates the true posterior $⁢ p (| W D)$ . Formulating the problem as a minimization of the Kullback-Leibler (KL) divergence between $⁢ q ϕ (W)$ and $⁢ p (| W D)$ leads to the evidence lower bound (ELBO):

(2) 2

L (ϕ) = E ⁢ q ϕ (W) [log p (y | X, W)] - KL (q ϕ (W) | | p (W))

which can be maximized with respect to $ϕ$ using standard optimization techniques.

In order to be able to scale to large data-sets Equation is often approximated using batches $⊆ B D$ of size M, with the log-likelihood term being scaled appropriately:

(3) 3

≈ ⁢ log p (| y X, W) ⁢ N M ∑ ∈ i B ⁢ log p (| y i X i, W)

Note, that only the first term of the ELBO depends on the data-set size $N$ , while the second term is independent of it. Hence the divergence term will dominate for small data-sets, leading to models that closely reflect our prior assumptions. In this work we employ priors, that promote simple functions. Therefore, our models are able to capture successively more complex functions with increasing data-set size.

2.3 2.3 §2.3 <tag close=" ">2.3</tag>Implementational Details

All models consist of $= L 3$ layers with hidden layer sizes $= | h l | 256$ and ELU activation functions ( ) , unless otherwise mentioned. Inputs $x$ correspond to flattened images of the scene and targets $y$ are dependent on the current task. We place a group horseshoe prior, which can be viewed as a continuous relaxation of a spike-and-slab prior ( ) , over all parameters:

∼ s ⁢ C + (0, τ 0); ∼ ~ z i ⁢ C + (0, 1);

∼ s ⁢ C + (0, τ 0); ∼ ~ z i ⁢ C + (0, 1);

∼ ~ w ⁢ i j ⁢ N (0, 1); = w ⁢ i j ⁢ ~ w ⁢ i j ~ z i s

∼ ~ w ⁢ i j ⁢ N (0, 1); = w ⁢ i j ⁢ ~ w ⁢ i j ~ z i s

and represent the approximate posterior $⁢ q ϕ (W)$ through a fully factorized distribution as proposed in ( ) . The sparsity hyperparameter of the horseshoe prior is fixed to $= τ 0 10 - 5$ . During training we approximate the expectation of the log-likelihood term with a single sample from $⁢ q ϕ (W)$ and make use of the local reparametrization trick ( ) . Gradient-based optimization is performed using Adam ( ) with batches consisting of $64$ samples. Results reported after training correspond to a Monte-Carlo estimate using 100 samples from $⁢ q ϕ (W)$ .

3 3 §3 <tag close=" ">3</tag>Experiments

In this section we present an analysis of the proposed model on three different tasks adopted from the developmental psychology literature. The first two tasks involve reasoning about physical events (occlusion and support), while the last is concerned with the intuitive representation of quantities. For each task we include a summary of empirical observations made in children, alongside a comparison between these results and our models. Code for performing all experiments and generating artificial data-sets is publicly available 1 1 footnote 1 https://github.com/marcelbinz/Developmental-Stages-of-BNNs.

3.1 3.1 §3.1 <tag close=" ">3.1</tag>Occlusion Events

Figure 1 1 Figure 1

Original Task

Artificial Data

Results

1Visualization of the occlusion event experiment. Left: Schematic illustration of the original setup. Figure adopted from ( ) . Center: Examples from the artificial data-set. Right: Performance for models with access to different amounts of data in the different conditions. Baseline indicates the chance for guessing randomly. Models discover solutions in order of their difficulty (green

→

blue

→

red), which is in accordance with observations made in the developmental psychology literature. Figure 1Visualization of the occlusion event experiment. Left: Schematic illustration of the original setup. Figure adopted from ( ) . Center: Examples from the artificial data-set. Right: Performance for models with access to different amounts of data in the different conditions. Baseline indicates the chance for guessing randomly. Models discover solutions in order of their difficulty (green

→

blue

→

red), which is in accordance with observations made in the developmental psychology literature.

The first task under investigation is concerned with occlusion events. It is based on an experiment conducted by \citeAbaillargeon2002acquisition. Each scene consists of a cylinder and a screen in form of a rectangular plane. During experimental manipulation the cylinder is moved back and forth behind the screen, while parts of the screen are removed. There are three different experimental conditions, each differing in which part of the screen is removed (top, bottom or everything removed). A depiction of the setup is shown in Figure (left). We are interested in infants’ ability to judge, whether the cylinder remains visible as it moves behind the screen, which is measured via violation-of-expectation methods ( ) . In violation-of-expectation methods gazing times for physically implausible events are measured. High gazing times for such events indicate, that children are surprised by the observation, which is interpreted as a violation of their expectation of what should have happened.

Empirical evidence indicates, that infants form an initial concept based on a behind/not-behind distinction starting from the age of 2.5 months. At this stage they have learned, that the cylinder remains visible while moving past the screen, if the entire middle section of the screen is removed. They do not expect the cylinder to be visible in the other two conditions. This initial concept is refined during later stages of development. At the age of three months they also predict to see parts of the cylinder, if only the bottom part of the screen is removed. The knowledge of 3.5 months olds additionally extends to screens with removed top fractions. Note, that the last condition is more challenging compared to the other two, as it involves a comparison of heights between the cylinder and the lower connection of the screen (if the connection height is lower than the cylinder, the cylinder will be visible, otherwise it will not be visible). Baillargeon et al. conclude from these observations, that infants start with initially simple rules about the laws of physics (behind/not-behind distinction), which become successively more sophisticated over time (reasoning about relative heights).

Figure 2 2 Figure 2

Original Task

Artificial Data

Results

top

side

amount

prop.

2Visualization of the support event experiment. Left: Schematic illustration of the original setup. Figure adopted from ( ) . Center: Examples from the artificial data-set alongside the name of the corresponding condition. The left column shows stable configurations, while those in the right column are unstable. Right: Result from training models with access to different amounts of data. Baseline indicates the performance for guessing randomly. Models discover solutions to the easier condition (two upper rows in the left figure) first, and solutions to the harder conditions (two bottom rows in the left figure) later. Note the slight increase in number of samples required for learning between the third and forth condition. Figure 2Visualization of the support event experiment. Left: Schematic illustration of the original setup. Figure adopted from ( ) . Center: Examples from the artificial data-set alongside the name of the corresponding condition. The left column shows stable configurations, while those in the right column are unstable. Right: Result from training models with access to different amounts of data. Baseline indicates the performance for guessing randomly. Models discover solutions to the easier condition (two upper rows in the left figure) first, and solutions to the harder conditions (two bottom rows in the left figure) later. Note the slight increase in number of samples required for learning between the third and forth condition.

We construct an artificial version of this task as follows. Each input $x$ corresponds to a $× 2424$ image with three channels, containing segregated information about the floor, the cylinder and the screen respectively. Input images show an initial configuration of the scene, in which the cylinder is located on the left side of the screen. Each target $y$ indicates the visibility of the cylinder when passing behind the screen, as it is moved towards the right side of the image. Cylinder height and position, screen position and size and floor level are determined randomly. We generate a training and test set, each consisting of 10000 data-points. Figure (center) shows examples for each of the conditions. Both sets include 2000 images for each of the three condition, as well as 4000 baseline images (nothing of the screen is removed). This is to ensure, that both $= y 0$ (cylinder is visible) and $= y 1$ (cylinder is not visible) are represented in equal fractions, i.e. the chance of guessing correctly is 50 percent.

We train otherwise identical models for $∈ N {256, 512, 1024, 2048, 8192}$ until convergence and report results averaged over ten random seeds. The resulting performances on the artificial data-set are visualized in Figure (right). We observe, that the percentage of incorrect predictions, similar to the developmental progress in infants, decreases first in the easier conditions. The network is able to predict visibility of the cylinder reliably (with less than ten percent errors), if the entire middle section of the screen is removed, for $N$ larger than 512. For $N$ larger than 4096 it is additionally able to predict the correct targets, if the bottom part is removed (corresponding to knowledge of a three months old). This extends to the condition, in which the top part is removed for $= N 8192$ (corresponding to knowledge of a 3.5 months old). Hence we conclude, that in this task the family of BNNs recovers the order of developmental stages observed in infants.

3.2 3.2 §3.2 <tag close=" ">3.2</tag>Support Events

Next we take a closer look at infants’ knowledge of support events, for which we adopt another experiment of \citeAbaillargeon2002acquisition. In this task a scene consists of a box and a platform. The box is presented in different positions relative to the platform and the experimenter measures (via violation-of-expectation methods), if infants are able to predict, whether the given configuration is stable or not. Four different conditions are investigated. In the first the box is positioned either on top of the platform or some distance away from it. This condition requires a simple contact/no-contact distinction to make reliable predictions about the stability of the configuration. The second condition involves a distinction between different types of contact. Here the box connects with the platform either on the top (as before) or on the side. The third condition requires judgments based on the amount of contact, i.e. the box is only partially positioned on the platform. The final condition adds an additional layer of complexity, as it involves reasoning about non-rectangular shapes. The different conditions are summarized in Figure (left).

According to \citeAbaillargeon2002acquisition, from three months onward infants knowledge about the stability of a configuration is captured through a contact/no-contact distinction. At this stage they expect the box to be stable if and only if it touches the platform in some way. This initial hypothesis is than refined, as they grow older. At the age of around five months infants begin to distinguish between different types of contacts. They realize, that the box will only be stable, if it positioned above the platform, but not if it touches it on the side. Starting with an age of 6.5 months they are able to take into account the center of mass of simple objects (rectangular boxes) when reasoning about stability. This is extended to more complex, asymmetrical shapes at an age of roughly twelve months. As in the occlusion task, infants start with an initially simple hypothesis of how the laws of physics work, which is subsequently refined to better fit the observed data.

In our artificial version of this task inputs $x$ are represented as $× 2424$ images with three channels (one for platform, box and floor) and targets $y$ are a binary indicator of the stability of the given configuration. Floor level as well as the size and position of the platform and the box are randomized. Again we generate a training set of 10000 samples and an equally large test set. In both sets 2500 data-points belong to each of the described conditions. Within each condition the amount of stable and unstable configurations is balanced, leading to a chance of 50 percent for guessing correctly. Example configurations are shown in Figure (center).

Figure 3 3 Figure 3

Original Task

Artificial Data

Results

3Visualization of the approximate number system experiment. Left: Schematic illustration of the original setup. Figure adopted from ( ) . Center: Examples from the artificial data-set. This pair corresponds to a ratio of 6:5. Right: Weber ratios of the experimental data. Red coloring corresponds to estimates from human participants, while blue coloring corresponds to estimates from optimized BNNs. The progression of Weber ratios follows a Weber-Fechner Law and is well described through exponential functions in both cases. Figure 3Visualization of the approximate number system experiment. Left: Schematic illustration of the original setup. Figure adopted from ( ) . Center: Examples from the artificial data-set. This pair corresponds to a ratio of 6:5. Right: Weber ratios of the experimental data. Red coloring corresponds to estimates from human participants, while blue coloring corresponds to estimates from optimized BNNs. The progression of Weber ratios follows a Weber-Fechner Law and is well described through exponential functions in both cases.

As before we train otherwise identical models for $∈ N {256, 512, 1024, 2048, 8192}$ until convergence and report result averaged over ten random seeds. Inspecting Figure (right) we observe, that the family of BNNs first discovers solutions to the easier conditions (those where the box is positioned fully on the platform or on the side of it). If we increase $N$ to $4096$ or more, they are also able to reason reliably about stability in both of the center of mass conditions. Note, that the error rate decreases slower, when being exposed to the more complex, L-shaped objects, although this difference is only marginal. We conclude, that the models show pattern similar to the developmental progress of infants, as we increase the data-set size, akin to the observation from experiment 1, although not as pronounced.

3.3 3.3 §3.3 <tag close=" ">3.3</tag>Approximate Number System

Moving away from probing knowledge about physical laws, we next inspect a different domain: children’s intuitive counting abilities ( ) . A single trial consists of two images, each containing between 1 and 10 items, and the goal is to determine quickly, i.e. without to much deliberation time, which of the two images contains the larger quantity of items. The display time is adjusted depending on age, such that it is short enough to prevent serial counting. Objects within a trial are identical, but are selected from a set of different objects across trials. Two conditions either control for average item size or the summed continuous area. An example trial from the original task is shown in Figure (left).

Experimental data for this task has been obtained for three to six years olds, as well as for adults ( ) . Human perception is sensitive to the ratio between amounts of objects in the two images and not their difference, i.e. it follows a Weber-Fechner law. Levels of accuracy are measured for different ratios of objects, ranging from 1.11 (10:9) to 2.0 (2:1), from which Weber ratios are estimated. The Weber ratio is the smallest ratio, where a participant is able to identify the correct stimulus in more than 75 percent of the cases. Empirical results indicate, that the Weber ratio decreases during childhood and our ability to distinguish similar stimuli improves over time. Prior work ( ) estimates Weber ratios of around 1.53 in three years olds, which improves to 1.38 in four years olds, to 1.23 in five years olds, to 1.18 in six years olds and to 1.11 in adults. Overall the decrease of Weber ratios is well described through an exponential function of age (see Figure , right).

We created an artificial version of this task, with inputs $x$ corresponding to two $× 2424$ images. For simplicity images contain only rectangular objects and the difference in quantity within a pair of images is always one. Targets $y$ are the number of objects in each images and the final prediction is obtained by comparing expected values between the two estimates. Object sizes are randomly drawn from ${1, 2, 3}$ and their position is randomized, such that neighbouring objects do not overlap. We controlled for an equal expected, total area between both images as done in the second condition of the original task. Both training and test set contain 6000 samples for each ratio from ${$ 10:9, 9:8, 8:7, 7:6, 6:5, 5:4, 4:3, 3:2, 2:1 $}$ . Examples are provided in Figure (center).

We train one model for each value from $∈ N ⋅ {4, …, 15} 2048$ . Hidden layers have 512 units in this experiment and we found it helpful to initialize weights for all models from a pretrained network. For the resulting models we calculate Weber ratios after estimating accuracy levels of 75 percent via linear interpolation and visualize the results in Figure (right). We observe an improvement of Weber ratios as the data-set size $N$ increases and conclude, that our models also follow a Weber-Fechner law. The resulting progression is well described through an exponential function 2 2 footnote 2 $= y + ⁢ a e + - ⁢ b x c d$ , with $x$ corresponding to data-set size or age. of data-set size, a characteristic shared with the curve obtained from human participants. There is however a small gap between the overall model performance in the artificial task and that of human participants in the original one, even for large data-sets.

4 4 §4 <tag close=" ">4</tag>Discussion

We investigated the progress of Bayesian Neural Networks with access to increasingly large data-sets on three different tasks. In all three examples we find an at least partial agreement between the development of our artificial learning systems and findings from the developmental psychology literature. However we also observe some considerable differences. The best performing BNNs in the quantity comparison task, for example, do not reach the level of human adults. We attribute this effect to difficulties for standard neural networks architectures on relational reasoning tasks and hypothesize that recent advances in visual relational reasoning ( ) could close this gap. In general we interpret our results as additional evidence for Bayesian theories of cognition and learning ( ) .

The discriminative models employed in this work require large amounts of input-target pairs to obtain the desired result. Children on the other hand have to operate in a much more data-efficient manner, as they do not have constant access to a teacher providing correct targets. One approach to resolve the question of sources, that children use for learning, are generative models, which are able to discover underlying structures without explicitly provided targets. Whether our results transfer to generative models remains to be seen. Indeed applying generative models in this context would enable us to measure performance in artificial systems directly via violation-of-expectation methods, as was done with infants in two of our examples. In the future it would be natural to extend our work to more realistic settings and apply different architectures, such as recurrent or convolutional networks.

We believe there are exciting opportunities for research on the intersection of machine learning and developmental psychology. On one hand insights from developmental psychology can provide guidelines of how to build more intelligent, human-like systems. The machine learning framework on the other hand enables researchers to formulate normative theories, that can be empirically verified. We already see some progress in these areas. Examples include the usage of violation-of-expectation methods for probing the knowledge of deep networks ( ) or the proposal to select training curricula for machine learning systems, based on how children obtain samples ( ) .

5 5 §5 <tag close=" ">5</tag>Acknowledgments

This work was supported by the DFG GRK-RTG 2271 ’Breaking Expectations’.

References 1 \APACyear2002 Baillargeon Baillargeon Baillargeon (\APACyear2002) baillargeon2002acquisition \APACinsertmetastarbaillargeon2002acquisition{APACrefauthors}Baillargeon, R. \APACrefYearMonthDay2002. \BBOQ\APACrefatitleThe acquisition of physical knowledge in infancy: A summary in eight lessons The acquisition of physical knowledge in infancy: A summary in eight lessons.\BBCQ \APACjournalVolNumPagesBlackwell handbook of childhood cognitive development146-831. \PrintBackRefs\CurrentBib 2 \APACyear2009 Baillargeon \BOthers. Baillargeon, Li, Ng\BCBL \BBA Yuan Baillargeon \BOthers. (\APACyear2009) baillargeon2009account \APACinsertmetastarbaillargeon2009account{APACrefauthors}Baillargeon, R., Li, J., Ng, W.\BCBL \BBA Yuan, S. \APACrefYearMonthDay2009. \BBOQ\APACrefatitleAn account of infants’ physical reasoning An account of infants’ physical reasoning.\BBCQ \APACjournalVolNumPagesLearning and the infant mind66–116. \PrintBackRefs\CurrentBib 3 \APACyear1985 Baillargeon \BOthers. Baillargeon, Spelke\BCBL \BBA Wasserman Baillargeon \BOthers. (\APACyear1985) baillargeon1985object \APACinsertmetastarbaillargeon1985object{APACrefauthors}Baillargeon, R., Spelke, E\BPBIS.\BCBL \BBA Wasserman, S. \APACrefYearMonthDay1985. \BBOQ\APACrefatitleObject permanence in five-month-old infants Object permanence in five-month-old infants.\BBCQ \APACjournalVolNumPagesCognition203191–208. \PrintBackRefs\CurrentBib 4 \APACyear2009 Baker \BOthers. Baker, Saxe\BCBL \BBA Tenenbaum Baker \BOthers. (\APACyear2009) baker2009action \APACinsertmetastarbaker2009action{APACrefauthors}Baker, C\BPBIL., Saxe, R.\BCBL \BBA Tenenbaum, J\BPBIB. \APACrefYearMonthDay2009. \BBOQ\APACrefatitleAction understanding as inverse planning Action understanding as inverse planning.\BBCQ \APACjournalVolNumPagesCognition1133329–349. \PrintBackRefs\CurrentBib 5 \APACyear2013 Battaglia \BOthers. Battaglia, Hamrick\BCBL \BBA Tenenbaum Battaglia \BOthers. (\APACyear2013) battaglia2013simulation \APACinsertmetastarbattaglia2013simulation{APACrefauthors}Battaglia, P\BPBIW., Hamrick, J\BPBIB.\BCBL \BBA Tenenbaum, J\BPBIB. \APACrefYearMonthDay2013. \BBOQ\APACrefatitleSimulation as an engine of physical scene understanding Simulation as an engine of physical scene understanding.\BBCQ \APACjournalVolNumPagesProceedings of the National Academy of Sciences201306572. \PrintBackRefs\CurrentBib 6 \APACyear2016 Chang \BOthers. Chang, Ullman, Torralba\BCBL \BBA Tenenbaum Chang \BOthers. (\APACyear2016) chang2016compositional \APACinsertmetastarchang2016compositional{APACrefauthors}Chang, M\BPBIB., Ullman, T., Torralba, A.\BCBL \BBA Tenenbaum, J\BPBIB. \APACrefYearMonthDay2016. \BBOQ\APACrefatitleA compositional object-based approach to learning physical dynamics A compositional object-based approach to learning physical dynamics.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1612.00341. \PrintBackRefs\CurrentBib 7 \APACyear2015 Clevert \BOthers. Clevert, Unterthiner\BCBL \BBA Hochreiter Clevert \BOthers. (\APACyear2015) clevert2015fast \APACinsertmetastarclevert2015fast{APACrefauthors}Clevert, D\BHBIA., Unterthiner, T.\BCBL \BBA Hochreiter, S. \APACrefYearMonthDay2015. \BBOQ\APACrefatitleFast and accurate deep network learning by exponential linear units (elus) Fast and accurate deep network learning by exponential linear units (elus).\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1511.07289. \PrintBackRefs\CurrentBib 8 \APACyear2008 Griffiths \BOthers. Griffiths, Kemp\BCBL \BBA Tenenbaum Griffiths \BOthers. (\APACyear2008) griffiths2008bayesian \APACinsertmetastargriffiths2008bayesian{APACrefauthors}Griffiths, T\BPBIL., Kemp, C.\BCBL \BBA Tenenbaum, J\BPBIB. \APACrefYearMonthDay2008. \BBOQ\APACrefatitleBayesian models of cognition Bayesian models of cognition.\BBCQ \PrintBackRefs\CurrentBib 9 \APACyear2008 Halberda \BBA Feigenson Halberda \BBA Feigenson Halberda \BBA Feigenson (\APACyear2008) halberda2008developmental \APACinsertmetastarhalberda2008developmental{APACrefauthors}Halberda, J.\BCBT \BBA Feigenson, L. \APACrefYearMonthDay2008. \BBOQ\APACrefatitleDevelopmental change in the acuity of the” Number Sense”: The Approximate Number System in 3-, 4-, 5-, and 6-year-olds and adults. Developmental change in the acuity of the” number sense”: The approximate number system in 3-, 4-, 5-, and 6-year-olds and adults.\BBCQ \APACjournalVolNumPagesDevelopmental psychology4451457. \PrintBackRefs\CurrentBib 10 \APACyear1993 Hinton \BBA Van Camp Hinton \BBA Van Camp Hinton \BBA Van Camp (\APACyear1993) hinton1993keeping \APACinsertmetastarhinton1993keeping{APACrefauthors}Hinton, G\BPBIE.\BCBT \BBA Van Camp, D. \APACrefYearMonthDay1993. \BBOQ\APACrefatitleKeeping the neural networks simple by minimizing the description length of the weights Keeping the neural networks simple by minimizing the description length of the weights.\BBCQ \BIn \APACrefbtitleProceedings of the sixth annual conference on Computational learning theory Proceedings of the sixth annual conference on computational learning theory (\BPGS 5–13). \PrintBackRefs\CurrentBib 11 \APACyear2014 Kingma \BBA Ba Kingma \BBA Ba Kingma \BBA Ba (\APACyear2014) kingma2014adam \APACinsertmetastarkingma2014adam{APACrefauthors}Kingma, D\BPBIP.\BCBT \BBA Ba, J. \APACrefYearMonthDay2014. \BBOQ\APACrefatitleAdam: A method for stochastic optimization Adam: A method for stochastic optimization.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1412.6980. \PrintBackRefs\CurrentBib 12 \APACyear2015 Kingma \BOthers. Kingma, Salimans\BCBL \BBA Welling Kingma \BOthers. (\APACyear2015) kingma2015variational \APACinsertmetastarkingma2015variational{APACrefauthors}Kingma, D\BPBIP., Salimans, T.\BCBL \BBA Welling, M. \APACrefYearMonthDay2015. \BBOQ\APACrefatitleVariational dropout and the local reparameterization trick Variational dropout and the local reparameterization trick.\BBCQ \BIn \APACrefbtitleAdvances in Neural Information Processing Systems Advances in neural information processing systems (\BPGS 2575–2583). \PrintBackRefs\CurrentBib 13 \APACyear2017 Lake \BOthers. Lake, Ullman, Tenenbaum\BCBL \BBA Gershman Lake \BOthers. (\APACyear2017) lake2017building \APACinsertmetastarlake2017building{APACrefauthors}Lake, B\BPBIM., Ullman, T\BPBID., Tenenbaum, J\BPBIB.\BCBL \BBA Gershman, S\BPBIJ. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleBuilding machines that learn and think like people Building machines that learn and think like people.\BBCQ \APACjournalVolNumPagesBehavioral and Brain Sciences40. \PrintBackRefs\CurrentBib 14 \APACyear2016 Lerer \BOthers. Lerer, Gross\BCBL \BBA Fergus Lerer \BOthers. (\APACyear2016) lerer2016learning \APACinsertmetastarlerer2016learning{APACrefauthors}Lerer, A., Gross, S.\BCBL \BBA Fergus, R. \APACrefYearMonthDay2016. \BBOQ\APACrefatitleLearning physical intuition of block towers by example Learning physical intuition of block towers by example.\BBCQ \APACjournalVolNumPagesarXiv preprint arXiv:1603.01312. \PrintBackRefs\CurrentBib 15 \APACyear2017 Louizos \BOthers. Louizos, Ullrich\BCBL \BBA Welling Louizos \BOthers. (\APACyear2017) louizos2017bayesian \APACinsertmetastarlouizos2017bayesian{APACrefauthors}Louizos, C., Ullrich, K.\BCBL \BBA Welling, M. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleBayesian compression for deep learning Bayesian compression for deep learning.\BBCQ \BIn \APACrefbtitleAdvances in Neural Information Processing Systems Advances in neural information processing systems (\BPGS 3288–3298). \PrintBackRefs\CurrentBib 16 \APACyear1988 Mitchell \BBA Beauchamp Mitchell \BBA Beauchamp Mitchell \BBA Beauchamp (\APACyear1988) mitchell1988bayesian \APACinsertmetastarmitchell1988bayesian{APACrefauthors}Mitchell, T\BPBIJ.\BCBT \BBA Beauchamp, J\BPBIJ. \APACrefYearMonthDay1988. \BBOQ\APACrefatitleBayesian variable selection in linear regression Bayesian variable selection in linear regression.\BBCQ \APACjournalVolNumPagesJournal of the American Statistical Association834041023–1032. \PrintBackRefs\CurrentBib 17 \APACyear2018 Piloto \BOthers. Piloto \BOthers. Piloto \BOthers. (\APACyear2018) DBLP:journals/corr/abs-1804-01128 \APACinsertmetastarDBLP:journals/corr/abs-1804-01128{APACrefauthors}Piloto, L., Weinstein, A., TB, D., Ahuja, A., Mirza, M., Wayne, G.\BDBLBotvinick, M\BPBIM. \APACrefYearMonthDay2018. \BBOQ\APACrefatitleProbing Physics Knowledge Using Tools from Developmental Psychology Probing physics knowledge using tools from developmental psychology.\BBCQ \PrintBackRefs\CurrentBib 18 \APACyear2017 Santoro \BOthers. Santoro \BOthers. Santoro \BOthers. (\APACyear2017) santoro2017simple \APACinsertmetastarsantoro2017simple{APACrefauthors}Santoro, A., Raposo, D., Barrett, D\BPBIG., Malinowski, M., Pascanu, R., Battaglia, P.\BCBL \BBA Lillicrap, T. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleA simple neural network module for relational reasoning A simple neural network module for relational reasoning.\BBCQ \BIn \APACrefbtitleAdvances in neural information processing systems Advances in neural information processing systems (\BPGS 4967–4976). \PrintBackRefs\CurrentBib 19 \APACyear1998 Siegler \BBA Chen Siegler \BBA Chen Siegler \BBA Chen (\APACyear1998) siegler1998developmental \APACinsertmetastarsiegler1998developmental{APACrefauthors}Siegler, R\BPBIS.\BCBT \BBA Chen, Z. \APACrefYearMonthDay1998. \BBOQ\APACrefatitleDevelopmental differences in rule learning: A microgenetic analysis Developmental differences in rule learning: A microgenetic analysis.\BBCQ \APACjournalVolNumPagesCognitive psychology363273–310. \PrintBackRefs\CurrentBib 20 \APACyear2017 Smith \BBA Slone Smith \BBA Slone Smith \BBA Slone (\APACyear2017) 10.3389/fpsyg.2017.02124 \APACinsertmetastar10.3389/fpsyg.2017.02124{APACrefauthors}Smith, L\BPBIB.\BCBT \BBA Slone, L\BPBIK. \APACrefYearMonthDay2017. \BBOQ\APACrefatitleA Developmental Approach to Machine Learning? A developmental approach to machine learning?\BBCQ \APACjournalVolNumPagesFrontiers in Psychology82124. \PrintBackRefs\CurrentBib 21 \APACyear2007 Spelke \BBA Kinzler Spelke \BBA Kinzler Spelke \BBA Kinzler (\APACyear2007) spelke2007core \APACinsertmetastarspelke2007core{APACrefauthors}Spelke, E\BPBIS.\BCBT \BBA Kinzler, K\BPBID. \APACrefYearMonthDay2007. \BBOQ\APACrefatitleCore knowledge Core knowledge.\BBCQ \APACjournalVolNumPagesDevelopmental science10189–96. \PrintBackRefs\CurrentBib