Open X-Embodiment: Robotic Learning Datasets and RT-X Models
Abstract
Large, high-capacity models trained on diverse datasets have shown remarkable successes on efficiently tackling downstream applications. In domains from NLP to Computer Vision, this has led to a consolidation of pretrained models, with general pretrained backbones serving as a starting point for many applications. Can such a consolidation happen in robotics? Conventionally, robotic learning methods train a separate model for every application, every robot, and even every environment. Can we instead train generalist X-robot policy that can be adapted efficiently to new robots, tasks, and environments? In this paper, we provide datasets in standardized data formats and models to make it possible to explore this possibility in the context of robotic manipulation, alongside experimental results that provide an example of effective X-robot policies. We assemble a dataset from 22 different robots collected through a collaboration between 21 institutions, demonstrating 527 skills (160266 tasks). We show that a high-capacity model trained on this data, which we call RT-X, exhibits positive transfer and improves the capabilities of multiple robots by leveraging experience from other platforms. More details can be found on the project website https://robotics-transformer-x.github.io{robotics-transformer-x.github.io}.
Community
Introduces RT-X modes: trained models from a large corpus of 160k tasks (dataset) for the context of generic/generalised robotic manipulation; X-embodiment training (data from multiple robotic platforms). Trains RT-1, RT-2 models on different platforms, better than data from only evaluation domain. Dataset has scenes and trajectories from different robots (Franka, xArm, Sawyer, Google robot, etc.), multiple manipulation skills (pick & place, move, actions, etc.), over multiple object shapes and types (materials, appliances, etc.). Data in RLDS format (tfrecord serial data). Both RT models use text instructions and images; RT-1 uses FiLM, conditioned EfficientNet, and a transformer; RT-2 uses a VLM (PaLI: ViT and UL2 LLM) and a de-tokenizer; both predict action tokens for the robots. Categorical cross-entropy objective in output space; RT-1 trained on mixed dataset, RT-2 has co-fine-tuning. RT-1-X (combined dataset) outperforms RT-1 trained on specific datasets (for tasks), RT-2-X performs better (larger model - 55B). RT-2-X can learn generalised tasks for unseen objects (performing Bridge dataset tasks - originally for WidowX - on Google Robot). Short history (instead of no history) improves RT-2-X generalization further (tested on 5B size); larger model is better (55B). More experiments with different robot sensing and actuating modalities needed (all arms). From CMU, ETHz, Google, NYU, SJTU, Stanford, UC Berkeley, UIUC, etc. (21 institutions).
Links: website, DeepMind Blog, arxiv, GitHub
Models citing this paper 1
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper