DisDP: Robust Imitation Learning via Disentangled Diffusion Policies

Pankhuri Vanjani¹, Paul Mattes¹, Kevin Daniel Kuryshev¹, Xiaogang Jia¹, Vedant Dave², Rudolf Lioutikov¹

Accepted to RLC 2025 (Reinforcement Learning Conference)
Also accepted at the following RSS 2025 Workshops:
  • Workshop on Reliable Robotics
  • Workshop on Learned Robot Representations
  • ⭐ R3: Reasoning for Robust Robot Manipulation in the Open World (Spotlight talk)

Abstract

This work introduces Disentangled Diffusion Policy(DisDP), an Imitation Learning (IL) method that enhances robustness. Robot policies have to be robust against different perturbations, including sensor noise, complete sensor dropout, and environmental variations. Existing IL methods struggle to generalize under such conditions, as they typically assume consistent, noise-free inputs. To address this limitation, DisDP structures sensors into shared and private representations, pre-serving global features while retaining details from individual sensors. Additionally, Disentangled Behavior Cloning (DisBC) is introduced, a disentangled Behavior Cloning (BC) policy, to demonstrate the general applicance of disentanglement for IL. This structured representation improves resilience against sensor dropouts and perturbations. Evaluations on The Colosseum and Libero benchmarks demonstrate that disentangled policies achieve better performance in general and exhibit greater robustness to perturbations compared to their baseline policies.
disdp

Method Overview

The model processes multi-view image inputs by separating them into shared and private representations. The language instruction is encoded using Clip and each camera input is encoded using ResNet-18, followed by disentanglement modules that extract shared embeddings across all views and private embeddings for individual views. These embeddings are processed by a multimodal transformer encoder and serve as conditioning inputs to the denoising transformer decoder for action prediction. The model is trained with a combination of diffusion loss, multiview disentanglement loss, and orthogonality loss to enforce representation separation. This structured representation learning enhances robustness to sensor noise, failures, and environmental variations.

Method

Overview of DisDP architecture

Experiments and Results

We evaluate DisDP on two challenging robot imitation learning benchmarks: The Colosseum and Libero to systematically evaluate robustness, generalization, and interpretability.

Baselines

We compare DisDP and its behavior cloning variant (DisBC) against:

  • BC: Standard Behavior Cloning (Transformer-based)
  • BESO-ACT: Diffusion-based policy with action chunking
  • BESO-ACT-Dropout: BESO-ACT with random sensor dropout during training

Disentanglement improves overall performance and demonstrates higher resilience to sensor noise and dropouts

HAT

Colosseum no variation Dataset Evaluation with Noisy and Masked Camera Views. The numbers in the column View(s) correspond to the specific camera: 0 left view, 1 right view, 2 wrist view, and 3 front view. Dual camera dropouts are only reported for 0 1 and 1 2 because other combinations achieve low success rate for all methods. The evaluation examines how noisy sensors and sensor failures affect task success rates and assesses the resilience of different methods under these conditions. The disentangled methods perform much better compared to their baseline implementations. Especially the DisBC has a small decrease in performance, when adding noise.

Better generalization to unseen environmental variations

HAT

Colosseum results on variations, comparison between BESO-ACT and DisDP. The no-variation condition serves as a baseline, showing the highest performance. Spatial, textural, and lighting variations significantly impact success rates, with camera pose and table texture masking causing the most degradation.

Interpretable latent space

HAT

Saliency maps for disentangled embeddings. In the close box task, the shared embeddings capture the box edges, which are crucial for task completion and visible across different views. In contrast, the private embeddings focus on specific details, such as robot joints and table shadows, which contribute to task execution, while others capture unique but less relevant scene elements.

Get in Touch

Interested in collaborating or trying DisDP?
Email: pankhuri.vanjani@kit.edu
Code coming soon

Citation

@inproceedings{vanjanidisdp, title={DisDP: Robust Imitation Learning via Disentangled Diffusion Policies}, author={Vanjani, Pankhuri and Mattes, Paul and Jia, Xiaogang and Dave, Vedant and Lioutikov, Rudolf}, booktitle={Reinforcement Learning Conference} }