Continuous Latent Action Models: Towards Capable Robot Policies From Videos

Learning robot policies using imitation learning requires collecting large amounts of costly action-labeled expert demonstrations, which fundamentally limits the scale of training data. A promising approach to address this bottleneck is to harness the abundance of unlabeled video demonstrations, learning latent action labels for these demonstrations in an unsupervised way. However, we find that existing methods struggle when applied to complex robot tasks requiring fine-grained motions. We design continuous latent action models (CLAM) which incorporate two key ingredients we find necessary for learning to solve complex continuous control tasks from unlabeled observation data: (a) using continuous latent action labels instead of discrete representations, and (b) jointly training an action decoder to ensure that the latent action space can be easily grounded to real actions with relatively few labeled examples. Importantly, the labeled examples can be collected from non-optimal play data, enabling CLAM to learn performant policies without access to any expert action-labeled data. We demonstrate on several continuous control benchmarks in DMControl (locomotion), MetaWorld, and CALVIN (manipulation) that CLAM significantly outperforms prior state-of-the-art methods, remarkably with a $2-3 \times$ improvement in task success rate to the best baseline.

We introduce several novel contributions to the paradigm of learning latent actions from unlabeled observation data:

We demonstrate that latent actions should not be discretized, contrary to prior works, but rather remain continuous to be effective for fine-grained control tasks in locomotion and manipulation.
We propose to jointly train the latent action model and action decoder for grounding latent actions to the environment. We demonstrate that a jointly trained latent action space improves downstream policy performance on both locomotion and manipulation tasks.
We learn a performant control policy without ever training on labeled expert demonstrations, instead leveraging labels from random or play data.

CLAM consists of two stages. In Stage 1, we train a latent action model (LAM). We then use this LAM in Stage 2 to train a latent action policy. Stage 1 is the pretaining stage of CLAM. We assume access to a large unlabeled dataset of observations for training the latent action model (LAM). A LAM consists of two models: a forward dynamics model (FDM) that predicts the dynamics of the environment and an inverse dynamics model (IDM) that inverts this process by predicting the action that was performed between two subsequent observations. The FDM learns a prediction of the next observation given an observation of the current state and the action performed in that state. To ground the learned latent actions, we additionally learn a latent action decoder which predicts the environment action from the latent action.

After CLAM pretraining, we use the latent IDM to annotate the entire observation dataset with latent actions. We then train a latent action policy, using imitation learning. During inference time, our learned policy predicts latent actions given an observation, which the action decoder will decode into actions that are executable in the environment.

We compare CLAM to several state-of-the-art methods in both state- and image-based observations. We show quantitative results across multiple simulated environments including locomotion tasks in DMControl (Todorov et al. 2012) and robot manipulation tasks in the MetaWorld (Yu et al. 2020) and CALVIN (Mees et al. 2022) benchmarks shown below.

We compare our approach to several state-of-the-art baselines. Since each baseline uses a different neural architecture, and some utilize pre-trained off-the-shelf models, we normalize the general architecture and omit pre-trained and language-conditioned components.

FINDING 1: CLAM outperforms all baselines and nearly matches the performance of BC with expert data in both state- and image-based experiments. CLAM improves upon the best baseline VPT by more than 2× average normalized return on the DMControl (locomotion) tasks and around 2−3× success rate on the MetaWorld (manipulation) tasks.

(Left) Latent action dimension directly affects the model’s expressivity. Increasing the latent action dimension improves the model expressivity for policy learning. Up until a latent dimension of 4, the learned latent action space fails to be useful for imitation learning. However, a latent dimension of 8 has sufficient capacity, achieving 57% success rate on the Assembly task.
(Right) CLAM scales with more unlabeled target task data. CLAM scales with the amount of unlabeled video data. The performance of the downstream policy improves as we annotate more trajectories using the pretrained CLAM.

(Left) Increasing the amount of non-expert action-labeled data improves the action decoder performance. We vary the number of labeled trajectories for training the action decoder. While BC performance struggles to learn from non-expert data, our method improves with more data.
(Right) We also evaluate the robustness of CLAM to varying expertise of data. We learn a better policy than BC with the same amount of labeled random trajectories. Unsurprisingly, with expert data, our method recovers an optimal policy.

CLAM: Continuous Latent Action Models
for Robot Learning from Unlabeled Demonstrations

Abstract

Overview