Evaluating Robots Like Human Infants: A Case Study of <break/>Learned Bipedal Locomotion

Evaluating Robots Like Human Infants: A Case Study of <break/>Learned Bipedal Locomotion Devin Crowley

1

, Whitney G. Cole

2

, Christina M. Hospodar

3

, Ruiting Shen

4

, Karen E. Adolph

5

, and Alan Fern

6

*This work was supported under NSF grant number 2321851.

1

Devin Crowley is with the Department of Electrical Engineering and Computer Sciences, Oregon State University crowleyd@oregonstate.edu

2

Whitney G. Cole is with the Department of Psychology, New York University wgcole@nyu.edu

3

Christina M. Hospodar is with the Department of Psychology, New York University christina.hospodar@nyu.edu

4

Ruiting Shen is with the Department of Psychology, New York University rs8422@nyu.edu

5

Karen E. Adolph is with the Department of Psychology, New York University karen.adolph@nyu.edu

6

Alan Fern is with the Department of Electrical Engineering and Computer Sciences, Oregon State University alan.fern@oregonstate.edu

Typically, learned robot controllers are trained via relatively unsystematic regimens and evaluated with coarse-grained outcome measures such as average cumulative reward. The typical approach is useful to compare learning algorithms but provides limited insight into the effects of different training regimens and little understanding about the richness and complexity of learned behaviors. Likewise, human infants and other animals are “trained” via unsystematic regimens, but in contrast, developmental psychologists evaluate their performance in highly-controlled experiments with fine-grained measures such as success, speed of walking, and prospective adjustments. However, the study of learned behavior in human infants is limited by the practical constraints of training and testing babies. Here, we present a case study that applies methods from developmental psychology to study the learned behavior of the simulated bipedal robot Cassie. Following research on infant walking, we systematically designed reinforcement learning training regimens and tested the resulting controllers in simulated environments analogous to those used for babies—but without the practical constraints. Results reveal new insights into the behavioral impact of different training regimens and the development of Cassie’s learned behaviors relative to infants who are learning to walk. This interdisciplinary baby-robot approach provides inspiration for future research designed to systematically test effects of training on the development of complex learned robot behaviors.

Fig. 1 Figure 1 1 Fig. 1 1Test environments with infants in the real world and the bipedal robot Cassie in simulation. Left-to-right: slopes, drop-offs, gaps, and bridges. Fig. 1Test environments with infants in the real world and the bipedal robot Cassie in simulation. Left-to-right: slopes, drop-offs, gaps, and bridges.

I section I I §I <tag close=" ">I</tag><text font="smallcaps">Introduction</text>

Consider a visually-guided bipedal robot trained in simulation to walk over challenging terrain with varied elevations and obstacles in the path. Typically, training regimens and test situations are not manipulated systematically, and performance evaluations report only crude metrics (e.g., average cumulative reward over multiple test environments). Albeit useful to roughly rank learned controllers, the typical approach to training, testing, and evaluation cannot reveal how different training regimens affect the complex details of learned behaviors like gait modifications while approaching and navigating varied terrain. For example, two different controllers with similar reward functions may nonetheless behave very differently when walking down a steep slope. One controller might leverage its vision and adjust its gait prospectively while approaching the slope, whereas the other controller may adjust reactively after stepping on the slope.

Inspired by developmental research with human infants, we advocate for a more systematic approach to training and testing, and a more detailed evaluation of the learned behaviors. To promote this approach, this paper presents a case study that adapts developmental research on infant locomotion to learned, visually-guided locomotion controllers for the Cassie bipedal robot. In particular, we used simulated test environments with the experimental apparatuses from infant studies (slopes, drop-offs, gaps, and bridges) and conducted similar experiments and analyses [] (see Fig. ).

Testing in simulation removes the constraints posed by real-world robot experiments, allowing fine-grained comparisons of different training regimens. Simulation-based work is often accompanied by a sim-to-real transfer method to produce a controller that functions in the real world. However, addressing the reality gap is orthogonal to the investigations of this work. We test in simulation without transfer to the real world because our objective is to study the development of locomotion capabilities, not in producing a working real-world controller.

This robot evaluation approach can advance behavioral research with robots by guiding robot training regimens and informing decisions about which controller works best in real-world operating environments. In addition, this approach may also inform behavioral research with infants by revealing the benefits and constraints of reinforcement learning (RL) under various training regimens—regimens that would be unethical or impractical for human babies.

II section II II §II <tag close=" ">II</tag><text font="smallcaps">Background</text>

Our case study follows prior work on learning locomotion controllers for the Cassie robot and analyzes behavior using established methods from research with walking infants.

II-A subsection II-A II-A §II-A <tag close=" ">II-A</tag><text font="italic">Learning in the Development of Infant Locomotion</text>

How do babies learn to navigate varied terrain such as steep slopes, high drop-offs, wide gaps, and narrow bridges, as illustrated in Fig. ? Infants learn to walk amidst continual changes in their environments and skills. Features of the environment are variable (ground surfaces can be flat or sloping, rigid or deformable, high-traction or slippery; the path can be clear or cluttered with obstacles and elevations) and infants’ walking skill improves dramatically over the first several months after walk onset []. During natural locomotion, infants’ walking paths are curved, not straight; their steps are omnidirectional, not forward; and their locomotion is intermittent rather than continuous []. Infants’ everyday life creates a natural training regimen—with toys scattered on the ground, laundry heaps in the corner, furniture obstructing the path, and so on. For infants, learning to walk means learning to modify gait from step to step to navigate a varied environment.

Developmental researchers test novice and experienced walking infants on challenging terrain, such as slopes, drop-offs, gaps, and bridges of varying difficulties, because the obstacles are novel (no baby encounters steep slopes, narrow bridges, etc. during everyday life). Thus, researchers can test generalization from everyday experience to novel situations based on whether babies modify their gait prospectively while approaching the obstacle or reactively after stepping onto the obstacle. In all cases, the best solution is to modify gait prospectively on approach. Reactive adjustments while traversing the obstacles are less optimal because gravity takes over and gait modifications entail fighting to keep the body over the moving base of support. The strongest evidence for prospective control is changes in foot placement, step length, and walking speed before stepping onto the obstacle—rather than modifying gait after stepping onto the obstacle.

Prior work suggests the ability to modify gait develops with walking experience []. Novice infants walk blithely over the brink of impossibly steep slopes, high drop-offs, wide gaps, and narrow bridges (requiring rescue from an experimenter). After several months of everyday walking experience—notably, with no prior experience on the test obstacles—infants modify their gait prospectively. For every obstacle, experienced infants slow down and shorten their steps during approach. For slopes, their initial steps on the slope are short and slow, they widen their base of support, and brake forward momentum from step to step []. For drop-offs, they place their stance foot close to the brink so that their moving foot can stretch down to the bottom of the precipice all the while keeping their body upright as it drops vertically onto their moving foot []. For gaps, they place their stance foot close to the brink of the gap, increase step length with the moving foot, and place their moving foot close to the far side of the gap. For bridges, infants place their body at the near side of the bridge, and take short, slow, narrow steps [].

Developmental researchers hypothesize that infants learn to modify gait prospectively as they accumulate everyday experience navigating an ever-changing environment. Practice doing the same thing repeatedly (e.g., stepping on a treadmill) and training to the test (e.g., repeated practice walking down slopes) do not contribute to the development of prospective gait modifications []. However, developmental researchers cannot control infants’ natural, everyday training regimens and must accommodate practical limitations for testing infants. Babies can complete only a few dozen trials before becoming tired or fussy, and often learn gait modifications During testing. Such limitations do not exist for simulated robots. Thus, simulated robots provide a powerful, highly-controlled platform for testing effects of experience on motor learning, with the potential to serve as a surrogate for studying how infants learn to walk over varied terrain.

II-B subsection II-B II-B §II-B <tag close=" ">II-B</tag><text font="italic">RL for Bipedal Locomotion</text>

Controller Architecture We tested the behavior of controllers for the Cassie bipedal robot trained with RL using a learning framework from prior work [], lightly adapted for purely simulation-based experimentation. This framework layers a visually-guided residual controller on top of the outputs of a frozen blind controller trained only on flat terrain. The controllers are represented as recurrent long short-term memory (LSTM) neural networks. Their outputs are proportional derivative (PD) control targets, set at 50 Hz for 10 actuated joints centered around a standing pose. A static PD controller uses these targets to set the motor torques at 2000 Hz. A schematic of this learning framework is shown in Fig. . Training a slower controller to set targets for a faster PD controller is commonly used in RL for locomotion [] and is a strong precedent for the Cassie platform [].

The input to the blind controller includes: (1) a 35-dimensional vector containing the positions and velocities of all joints, plus the pelvis orientation and rotational velocity; (2) a clock signal that dictates the cadence of the footsteps; (3) gait parameters that modulate the clock signal; and (4) commands for forward speed, lateral speed, and turn rate. A grid of ground-truth terrain heights taken from the simulated terrain serves as additional input to the visually-guided controller. The grid is a 1m wide by 1.5m long rectangle of 20 by 30 values respectively in front of Cassie.

In addition to PD target residuals, the visually-guided controller outputs the clock progression speed and the phase-offset between the feet. This allows the controller to modulate the frequency of footsteps and adjust the left-right cadence. In principle, this means the controller can learn to produce asymmetric 2-beat patterns like skipping rather than performing only the basic, symmetrical, left-right gait pattern. Both blind and visually-guided controllers use the same reward function. It encourages adherence to the commands, minimized motor torque, and footsteps in accordance with the clock signal. See [] for further details.

Fig. 2 Figure 2 2 Fig. 2 2Controller schematic. The controller consists of a visually-guided component, trained on varied terrain, which modulates the output of a blind component trained on only flat ground. Fig. 2Controller schematic. The controller consists of a visually-guided component, trained on varied terrain, which modulates the output of a blind component trained on only flat ground.

Simulation Training The controllers are trained using the actor-critic proximal policy optimization (PPO) algorithm with gradient clipping, a standard model-free RL algorithm []. The simulator used for both training and testing is the MuJoCo physics engine [] using a model of the Cassie robot. Because we examined behavior only in simulation, we did not use dynamics randomization to aid in sim-to-real transfer as in prior work with Cassie [].

III section III III §III <tag close=" ">III</tag><text font="smallcaps">Experimental Setup</text>

Our analyses evaluate the performance and behavior of several visually-guided controllers, differing in the distribution of training terrains. All controllers output residual PD targets added to the outputs of the same pre-trained blind controller described in Section .

III-A subsection III-A III-A §III-A <tag close=" ">III-A</tag><text font="italic">Training Regimens</text>

We trained Cassie on standard, multi-test obstacle, combined standard and multi-test, and single-test obstacle terrains. The standard training terrains replicate prior robotics work [] on which our learning framework is based: flat, hills, ridges, blocks, and stairs, shown in Fig. . The multi-test obstacle terrains are recreations of test apparatuses used in experiments with infants: slopes, drop-offs, gaps, and bridges, shown in Fig. for infants in the real world and Cassie in simulation.

Fig. 3 Figure 3 3 Fig. 3 3Standard terrains. Left-to-right: flat, hills, ridges, blocks, and stairs. Fig. 3Standard terrains. Left-to-right: flat, hills, ridges, blocks, and stairs.

Table shows the frequency of each terrain for the standard, multi-test obstacle, combined standard and multi-test, and single-test obstacle training regimens. Cassie received 20k iterations for the single-test obstacle regimen and 110k iterations for each of the others. The standard regimen uses a range of commands for forward speed, lateral speed, and turn rate. The test-obstacle regimens use only one command: straight forward at $⁢ 0.8 m s$ .

TABLE I Table I I TABLE I ITerrain frequencies by training regimen

TABLE ITerrain frequencies by training regimen
Terrain set	Standard					Test Obstacles
Terrain	Flat	Hills	Ridges	Blocks	Stairs	Slopes	Drop-offs	Bridges	Gaps
Standard	3%	7%	20%	35%	35%	0%	0%	0%	0%
Multi-test	0%	0%	0%	0%	0%	25%	25%	25%	25%
Combined	1.5%	3.5%	10%	18.5%	18.5%	12.5%	12.5%	12.5%	12.5%
Single-test	0%	0%	0%	0%	0%	100%	0%	0%	0%
Single-test	0%	0%	0%	0%	0%	0%	100%	0%	0%
Single-test	0%	0%	0%	0%	0%	0%	0%	100%	0%
Single-test	0%	0%	0%	0%	0%	0%	0%	0%	100%

III-B subsection III-B III-B §III-B <tag close=" ">III-B</tag><text font="italic">Testing Setup</text>

Our controller evaluation mirrored tests with infants, so we used only the four obstacle terrains. Thus, terrains used in the obstacle training regimens are identical to those used in testing.

Cassie began each test trial facing the obstacle (slope, drop-off, gap, or bridge) at a distance of 3-3.5m, with a lateral offset of 0-0.25m. Each obstacle had a difficulty parameter, ranging from 0-1, randomized in training, that linearly adjusts the relevant property for each terrain. Cassie received 50 trials at each of 101 difficulty levels, for a total of 5050 trials per obstacle. The downward angle of slopes (1.5m long) ranged from 0-90\textdegree, the step-down height of drop-offs ranged from 0-1.5m, bridge width (1.5m long) ranged from 0.02-1.02m, and gap width ranged from 0-1m.

IV section IV IV §IV <tag close=" ">IV</tag><text font="smallcaps">Behavioral Analysis Results</text>

Fig. 4 Figure 4 4 Fig. 4 4Evaluation on slopes, drop-offs, gaps, and bridges. Each column shows behavioral results for each test obstacle across continual, systematic increase in difficulty. Curves show blind, standard, multi-test obstacle, combined standard and multi-test obstacle, and single-test obstacle training regimens. Top row: Success rates for navigating the obstacles. Second row: Average speed of walking on or over the obstacle. Third row: Average speed of last two steps prior to the obstacle. Bottom row: Placement of last step relative to the edge of the obstacle. Fig. 4Evaluation on slopes, drop-offs, gaps, and bridges. Each column shows behavioral results for each test obstacle across continual, systematic increase in difficulty. Curves show blind, standard, multi-test obstacle, combined standard and multi-test obstacle, and single-test obstacle training regimens. Top row: Success rates for navigating the obstacles. Second row: Average speed of walking on or over the obstacle. Third row: Average speed of last two steps prior to the obstacle. Bottom row: Placement of last step relative to the edge of the obstacle. IV-A subsection IV-A IV-A §IV-A <tag close=" ">IV-A</tag><text font="italic">Success Rate</text>

Training improved Cassie’s success at navigating obstacles, but performance varied depending on the obstacle (Fig. , top row). Notably, the single-test obstacle regimens (green curves) ensured greater success on every obstacle compared with the other regimens. On slopes and drop-offs, every training regimen improved success relative to the blind controller (gray curves), and on the drop-offs, the multi-test and combined regimens improved performance relative to the standard regimen (red curve). However, on gaps and bridges, results were mixed. On gaps, the standard, multi-test, and combined regimens performed equivalently to the blind controller. And on bridges—even at the lowest difficulty level—blind, standard, multi-test, and combined regimens produced success rates of $< % 78$ . Yet, training to the test in the single-test obstacle regimen demonstrates that the gap and bridge obstacles were learnable.

IV-B subsection IV-B IV-B §IV-B <tag close=" ">IV-B</tag><text font="italic">Gait Modifications Mid-Obstacle</text>

Successful walking on more difficult obstacles was achieved in part by modifying gait after stepping on or over the obstacle (Fig. , second row)—that is, in the multiple steps on the slope or bridge and the single step to cross the drop-off or gap. On slopes, for example, the speed of the blind controller (gray curve) increased with difficulty as gravity pulled Cassie down steeper slopes; speed peaked at $∼$ 30% difficulty and Cassie failed thereafter. The standard, multi-test, and combined training controllers also increased speed with difficulty, but speed peaked at $∼$ 22% difficulty—before the peak of the blind controller—then decreased on steeper slopes as Cassie began using a braking strategy, resulting in success on steeper slopes than the blind controller could manage. The single-test training controller initially peaked even earlier at $∼$ 11% difficulty, before implementing a braking strategy. The second peak for the single-test controller resulted from Cassie slipping down the slope, and speed decreased as Cassie began jumping down the slope at even more difficult increments.

On drop-offs and gaps, speed increased with difficulty. For drop-offs, increased speed likely reflects effects of gravity pulling the body down, but for gaps, it likely reflects Cassie launching its body to span larger gaps. Consistent with the poor success rate on bridges, Cassie increased speed on narrower bridges—making narrower bridges more challenging.

Although mid-obstacle gait modifications indicate that Cassie modified its gait to cope with more difficult obstacles, we cannot definitively categorize such adjustments as prospective. Increased speed to launch over a gap is produced before crossing (i.e., prospective), but decreased speed after stepping onto the slope could be in reaction to feeling the slant (i.e., reactive), and increased speed on the drop-off may be entirely out of Cassie’s control (i.e., neither prospective nor reactive). The best test of planning, therefore, is gait modifications prior to encountering the obstacle.

IV-C subsection IV-C IV-C §IV-C <tag close=" ">IV-C</tag><text font="italic">Gait Modifications Prior to the Obstacle</text>

Cassie did not show compelling evidence of prospective speed adjustments prior to obstacles (Fig. , row 3). Speed and step length in the preceding two steps were constant or decreased only slightly across difficulty levels (e.g., from $∼$ 0.83 m/s at 0 difficulty to $∼$ 0.76 m/s at 50% difficulty).

Cassie did, however, show evidence of prospective gait modifications based on foot placement (Fig. , row 4). Cassie placed its last step prior to the obstacle closer to the edge as difficulty increased for the single-test regimen on slopes, drop-offs, and gaps, and for the standard, multi-test, and combined regimens on drop-offs and gaps. Placing the foot close to the edge is crucial, especially for drop-offs and gaps because it shortens the size of the step needed to cross. For drop-offs, the standard, multi-test, and combined regimen’s last step landed $∼$ 0.15m from the edge at difficulty 0, but dropped to $∼$ 0.08m on drop-offs at difficulty $∼$ 25-100%.

V section V V §V <tag close=" ">V</tag><text font="smallcaps">Discussion</text>

We applied experimental methods from research with human infants to the simulated bipedal robot Cassie. Systematic manipulation of Cassie’s training regimens and tests of performance outcomes revealed differential effects on learning based on a detailed characterization of behavior. Training specifically to the test (single-test obstacle regimen) resulted in higher success rates and more prospective gait modifications than training on a variety of non-test terrains (standard regimen), a variety of test obstacles (multi-test obstacle regimen), or a combination of non-test and test obstacles (combined regimen).

V-A subsection V-A V-A §V-A <tag close=" ">V-A</tag><text font="italic">Effects of Training</text>

Cassie showed markedly superior performance on each test obstacle when trained exclusively on that obstacle compared to the other training regimens. This indicates limited generalizability between these terrains, despite qualitative similarities. Hills are akin to slopes; ridges, blocks, and stairs are akin to drop-offs. Even the test obstacles approximate each other at some levels. The steepest slope is identical to the highest drop-off, and the widest gap approximates the gap around the bridge. The bridge obstacle is the most unique, requiring Cassie to not only handle the terrain in front of it, but to adjust navigation and guide foot placement to keep from falling off. We see from the poor single-test performance on bridges that it is also the most difficult. This may be explained somewhat by the reward function encouraging an uncompromising heading.

Interestingly, the success rates are scarcely improved from the standard regimen to the multi-test and combined regimens where Cassie is exposed to the test obstacles. The single-test controllers were trained for 20k iterations on the obstacle they would be evaluated on, whereas the controllers trained on the standard, multi-test obstacle, and combined standard and multi-test regimens were trained for 110k iterations spread across various terrains. The multi-test regimen therefore received 27.5k iterations on each test obstacle, but still shows inferior performance to single-test. This indicates that the variety of experience is hindering Cassie’s ability to learn the best strategies for each obstacle or possibly that longer training runs or larger models are required.

This interpretation is corroborated by the speeds mid-obstacle in Fig. , row 2. The single-test (green) curves stand apart from the others, indicating a different strategy was learned. This is most easily seen on slopes (column 1), where a braking behavior is adopted earliest (at a lower difficulty) by the single-test controller. Even more telling is the shift where it speeds up again, indicating a change in strategy not seen by the other controllers.

As discussed in Section , Cassie does not modulate its speed preceding an obstacle, but it does adjust foot placement, as illustrated in the last two rows of Fig. . The more the speed deviates from the commanded speed the more reward is lost, so an inflexible reward function may account for the consistent speeds. However, the intentional foot placement is most clearly seen on the gap obstacle in the bottom row of Fig. . Curiously, the single-test step lengths are larger and more similar to the blind controller’s, whereas the other regimens produce shorter preceding footsteps, possibly indicating a reluctance to step off the ledge. This combined with their low success rate indicates that they haven’t discovered an effective strategy (stepping across the gap) and are falling back on optimized failure modes.

V-B subsection V-B V-B §V-B <tag close=" ">V-B</tag><text font="italic">Differences Between Babies and Robots</text>

Cassie demonstrated motor skills more advanced than any infant—jumping down high drop-offs and recovering from near-catastrophic falls. But we tested simulated Cassie. Real Cassie—with a physical body—would have required repairs after such feats. Babies also must deal with the consequences of errors, but cannot be taken to the shop for repairs. Instead, their body is built to cope with frequent errors in learning to walk: infants are small, low to the ground, and move slowly, decreasing impact forces produced by a fall [].

Moreover, every baby learns things Cassie did not learn: Infants use a wide range of prospective gait modifications, invent and use alternative strategies for obstacles where modifications are insufficient, and avoid crossing impossible obstacles []. Most critical, babies generalize learning. No infant experiences single-test training. To acquire behavioral flexibility and prospective control of locomotion, infants must generalize from everyday experiences. In this regard, any 18-month-old can run circles around Cassie. Infants’ greater flexibility and adaptability may result from more powerful learning mechanisms than pure RL. One hypothesis is that infants are “learning to learn”. That is, they learn to generate and gather the relevant information and use it to guide their actions from moment to moment.

VI section VI VI §VI <tag close=" ">VI</tag><text font="smallcaps">Limitations and Future Work</text>

Our case study tested controller behavior only at a single point in training with a given regimen. Future work should investigate the development of learned behaviors at varied points in training to understand the trajectory of learning.

In contrast to infants, Cassie did not prospectively modify speed while approaching obstacles. Apparently, infants’ natural training regimen teaches them to prospectively modify their speed to better cope with obstacles even as they are learning to walk. Decreased speed is useful for walking down steep slopes, high drop-offs, and narrow bridges, and increased speed is useful to leap over wide gaps. However, Cassie received an explicit, fixed speed command, and was encouraged to abide by it in the reward function. Thus, speed adjustments must yield greater improvements in the reward than the reward lost due to the speed error, or else those behaviors won’t be learned. Our fixed speed command may have precluded Cassie’s discovery of speed adjustment. To improve robot learning and to better understand infant walking, future work should consider relaxing the speed command to give the controller greater flexibility to modify gait and adopt novel strategies.

VII section VII VII §VII <tag close=" ">VII</tag><text font="smallcaps">Conclusions</text>

The simulated bipedal robot Cassie learned to modify its gait via precise foot placement just-prior to the obstacle and modifying speed while walking on or stepping over the obstacle. However, systematic training and testing based on methods from developmental research with human infants revealed that every training regimen produced more limited generalization and less adaptive behavioral modifications than expected, but a greater ability to jump, launch, and recover balance after large-amplitude movements and near falls. Likely, babies beat bots because they are “learning to learn” rather than responding solely to rewards.

References [1] 1 1 K. E. Adolph and J. E. Hoch, “Motor development: Embodied, embedded, enculturated, and enabling,” Annual review of psychology, vol. 70, pp. 141–164, 2019. [2] 2 2 K. Adolph, B. Kaplan, and K. Kretch, “Infants on the edge: Beyond the visual cliff,” in Revisiting the classic studies: Developmental psychology. Sage Publications, 2021. [3] 3 3 K. E. Adolph, J. E. Hoch, and W. G. Cole, “Development (of walking): 15 suggestions,” Trends in Cognitive Sciences, vol. 22, no. 8, pp. 699–711, 2018. [4] 4 4 C. M. Hospodar and K. E. Adolph, “The development of gait and mobility: Form and function in infant locomotion,” Wiley Interdisciplinary Reviews: Cognitive Science, p. e1677, 2024. [5] 5 5 W. G. Cole, S. R. Robinson, and K. E. Adolph, “Bouts of steps: The organization of infant exploration,” Developmental psychobiology, vol. 58, no. 3, pp. 341–354, 2016. [6] 6 6 D. K. Lee, W. G. Cole, L. Golenia, and K. E. Adolph, “The cost of simplifying complex developmental phenomena: A new perspective on learning to walk,” Developmental science, vol. 21, no. 4, p. e12615, 2018. [7] 7 7 C. M. Hospodar, J. E. Hoch, D. K. Lee, P. E. Shrout, and K. E. Adolph, “Practice and proficiency: Factors that facilitate infant walking skill,” Developmental psychobiology, vol. 63, no. 7, p. e22187, 2021. [8] 8 8 K. E. Adolph, W. G. Cole, M. Komati, J. S. Garciaguirre, D. Badaly, J. M. Lingeman, G. L. Chan, and R. B. Sotsky, “How do you learn to walk? thousands of steps and dozens of falls per day,” Psychological science, vol. 23, no. 11, pp. 1387–1394, 2012. [9] 9 9 K. E. Adolph, B. I. Bertenthal, S. M. Boker, E. C. Goldfield, and E. J. Gibson, “Learning in the development of infant locomotion,” Monographs of the society for research in child development, pp. i–162, 1997. [10] 10 10 S. V. Gill, K. E. Adolph, and B. Vereijken, “Change in action: How infants learn to walk down slopes,” Developmental science, vol. 12, no. 6, pp. 888–902, 2009. [11] 11 11 K. S. Kretch and K. E. Adolph, “Cliff or step? posture-specific learning at the edge of a drop-off,” Child development, vol. 84, no. 1, pp. 226–240, 2013. [12] 12 12 ——, “No bridge too high: Infants decide whether to cross based on the probability of falling not the severity of the potential fall,” Developmental science, vol. 16, no. 3, pp. 336–351, 2013. [13] 13 13 ——, “The organization of exploratory behaviors in infant locomotor planning,” Developmental science, vol. 20, no. 4, p. e12421, 2017. [14] 14 14 H. Duan, B. Pandit, M. S. Gadde, B. van Marum, J. Dao, C. Kim, and A. Fern, “Learning vision-based bipedal locomotion for challenging terrain,” 2023. [15] 15 15 X. B. Peng and M. van de Panne, “Learning locomotion skills using deeprl: Does the choice of action space matter?” in Proceedings of the ACM SIGGRAPH/Eurographics Symposium on Computer Animation. ACM, 2017, p. 12. [16] 16 16 Z. Xie, G. Berseth, P. Clary, J. Hurst, and M. van de Panne, “Feedback control for cassie with deep reinforcement learning,” in 2018 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS). IEEE, 2018, pp. 1241–1246. [17] 17 17 J. Tan, T. Zhang, E. Coumans, A. Iscen, Y. Bai, D. Hafner, S. Bohez, and V. Vanhoucke, “Sim-to-real: Learning agile locomotion for quadruped robots,” in Proc. of Robotics: Science and Systems XIV. Pittsburgh, Pennsylvania: Robotics: Science and Systems Foundation, 6 2018. [Online]. Available: http://www.roboticsproceedings.org/rss14/p10.html [18] 18 18 J. Hwangbo, J. Lee, A. Dosovitskiy, D. Bellicoso, V. Tsounis, V. Koltun, and M. Hutter, “Learning agile and dynamic motor skills for legged robots,” Science Robotics, vol. 4, no. 26, 2019. [Online]. Available: https://robotics.sciencemag.org/content/4/26/eaau5872 [19] 19 19 V. Tsounis, M. Alge, J. Lee, F. Farshidian, and M. Hutter, “Deepgait: Planning and control of quadrupedal gaits using deep reinforcement learning,” IEEE Robotics and Automation Letters, vol. 5, no. 2, pp. 3699–3706, 2020. [20] 20 20 J. Siekmann, S. Valluri, J. Dao, L. Bermillo, H. Duan, A. Fern, and J. Hurst, “Learning memory-based control for human-scale bipedal locomotion,” in Proceedings of Robotics: Science and Systems, 7 2020. [21] 21 21 J. Siekmann, Y. Godse, A. Fern, and J. Hurst, “Sim-to-Real Learning of All Common Bipedal Gaits via Periodic Reward Composition,” in IEEE International Conference on Robotics and Automation (ICRA), 2021. [22] 22 22 J. Siekmann, K. Green, J. Warila, A. Fern, and J. Hurst, “Blind Bipedal Stair Traversal via Sim-to-Real Reinforcement Learning,” in Proceedings of Robotics: Science and Systems, vol. abs/2105.08328, Virtual, 7 2021. [Online]. Available: https://arxiv.org/abs/2105.08328 [23] 23 23 J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” 2017. [Online]. Available: https://arxiv.org/abs/1707.06347 [24] 24 24 E. Todorov, T. Erez, and Y. Tassa, “MuJoCo: A physics engine for model-based control,” in 2012 IEEE/RSJ International Conference on Intelligent Robots and Systems. IEEE, 2012, pp. 5026–5033. [25] 25 25 F. Yu, R. Batke, J. Dao, J. Hurst, K. Green, and A. Fern, “Dynamic bipedal maneuvers through sim-to-real reinforcement learning,” 2022. [26] 26 26 J. Dao, H. Duan, and A. Fern, “Sim-to-real learning for humanoid box loco-manipulation,” 2023. [27] 27 27 D. Crowley, J. Dao, H. Duan, K. Green, J. Hurst, and A. Fern, “Optimizing bipedal locomotion for the 100m dash with comparison to human running,” in 2023 IEEE International Conference on Robotics and Automation (ICRA), 2023, pp. 12 205–12 211. [28] 28 28 D. Han and K. E. Adolph, “The impact of errors in infant development: Falling like a baby,” Developmental science, vol. 24, no. 5, p. e13069, 2021.