Decoupled Asynchronous Multimodal Vision Language Action Model

Pankhuri Vanjani1, Zhuoyue Li1, Jakub Suliga1, Moritz Reuss2, Gianluca Geraci1, Xinkai Jiang1, Rudolf Lioutikov1,3
1Intuitive Robots Lab (IRL), Karlsruhe Institute of Technology (KIT), 2NVIDIA, 3Robotics Institute Germany (RIG)
click to pause / play
click to pause / play

Synchronous versus asynchronous multimodal processing: a synchronous VLA forces every modality onto a single slow clock, whereas DAM-VLA processes each modality at its own native rate.

Abstract

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality; letting each update and retain information at its own sensor rate yields stronger representations and more robust control. We present Decoupled Asynchronous Multimodal Vision Language Action (DAM-VLA), which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2% vs. 40.95%) while sustaining smooth, reactive 100 Hz control.

Method

DAM-VLA architecture. Each modality stream encodes tokens into independent latent buffers at its sensor rate: vision periodically, proprioception and force/torque at high frequency. The action expert reads all buffers continuously via parallel gated cross-attention (GCA) pathways; a global-gate pathway for visual memory and an input-dependent gate pathway for force/torque; adding new modalities through dedicated cross-attention modules that preserve the pretrained self-attention structure.

click to pause / play

Contributions

  1. Asynchronous multimodal architecture. A decoupled design in which each modality stream updates independently at its natural frequency, with per-modality temporal context windows sized to the meaningful horizon of that signal.
  2. Improved performance through asynchronous representations. By preserving the natural information structure of each sensor, DAM-VLA learns multimodal representations that lead to higher task success rates than synchronous baselines.
  3. Reduced inference latency. Decoupling action generation from slow modality update cycles (such as periodic VLM re-encoding) lets the policy act continuously at the control frequency, reducing end-to-end latency and increasing effective control rate.

Results

On seven contact-rich tasks with a Franka Emika Panda, DAM-VLA reaches 95.2% average success; more than double the strongest synchronous baseline (40.95%); while running reactively at 100 Hz control frequency.

ConfigurationIsolates AsyncForceMem.Integ.
Baselines
X-VLA25std. VLA regime (25 Hz)
X-VLA100naive high-freq. (100 Hz)
Ours
X-VLAAFMconcat. baselineconcat.
DAM-VLA/F/Masync. alone
DAM-VLA/Fmemory contributionGCA
DAM-VLA/Mforce contributionGCA
DAM-VLA (Ours)full modelGCA

Average Success Rate (%)

Per-Task Success Rate (%)

Model Scarf Whiteboard Button Handwash Lego Socket Sweep

Light = low, dark blue = high. 15 trials per task.

Clean success Partial success
 
X-VLA25
X-VLA100
X-VLAAFM
DAM-VLA/F/M
DAM-VLA/F
DAM-VLA/M
DAM-VLA (Ours)
Button
13.3
6.7
13.3
40
86.7
86.7
93.3
Handwash
0
0
86.7
20
40
80
100
 
X-VLA25
X-VLA100
X-VLAAFM
DAM-VLA/F/M
DAM-VLA/F
DAM-VLA/M
DAM-VLA (Ours)
Lego
0
0
0
0
0
13.3
93.3
Socket
6.7
0
6.7
6.7
6.7
13.3
80

Blue: clean success; orange: partial success.

1. Synchronous processing hits a performance ceiling despite frequency scaling

click to pause / play

2. Synchronous baselines vs asynchronous decoupled baseline

click to pause / play

3. Integration mechanism for new modalities matters: gated cross-attention preserves pretrained representations

click to pause / play

BibTeX

Coming soon.