Abstract

Method Overview
The model processes multi-view image inputs by separating them into shared and private representations. The language instruction is encoded using Clip and each camera input is encoded using ResNet-18, followed by disentanglement modules that extract shared embeddings across all views and private embeddings for individual views. These embeddings are processed by a multimodal transformer encoder and serve as conditioning inputs to the denoising transformer decoder for action prediction. The model is trained with a combination of diffusion loss, multiview disentanglement loss, and orthogonality loss to enforce representation separation. This structured representation learning enhances robustness to sensor noise, failures, and environmental variations.

Overview of DisDP architecture
Experiments and Results
We evaluate DisDP on two challenging robot imitation learning benchmarks: The Colosseum and Libero to systematically evaluate robustness, generalization, and interpretability.
Baselines
We compare DisDP and its behavior cloning variant (DisBC) against:
- BC: Standard Behavior Cloning (Transformer-based)
- BESO-ACT: Diffusion-based policy with action chunking
- BESO-ACT-Dropout: BESO-ACT with random sensor dropout during training
Disentanglement improves overall performance and demonstrates higher resilience to sensor noise and dropouts

Colosseum no variation Dataset Evaluation with Noisy and Masked Camera Views. The numbers in the column View(s) correspond to the specific camera: 0 left view, 1 right view, 2 wrist view, and 3 front view. Dual camera dropouts are only reported for 0 1 and 1 2 because other combinations achieve low success rate for all methods. The evaluation examines how noisy sensors and sensor failures affect task success rates and assesses the resilience of different methods under these conditions. The disentangled methods perform much better compared to their baseline implementations. Especially the DisBC has a small decrease in performance, when adding noise.
Better generalization to unseen environmental variations

Colosseum results on variations, comparison between BESO-ACT and DisDP. The no-variation condition serves as a baseline, showing the highest performance. Spatial, textural, and lighting variations significantly impact success rates, with camera pose and table texture masking causing the most degradation.
Interpretable latent space

Saliency maps for disentangled embeddings. In the close box task, the shared embeddings capture the box edges, which are crucial for task completion and visible across different views. In contrast, the private embeddings focus on specific details, such as robot joints and table shadows, which contribute to task execution, while others capture unique but less relevant scene elements.
Get in Touch
Interested in collaborating or trying DisDP?
Email: pankhuri.vanjani@kit.edu
Code coming soon
Citation
@inproceedings{vanjanidisdp, title={DisDP: Robust Imitation Learning via Disentangled Diffusion Policies}, author={Vanjani, Pankhuri and Mattes, Paul and Jia, Xiaogang and Dave, Vedant and Lioutikov, Rudolf}, booktitle={Reinforcement Learning Conference} }