DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action Model

Abstract

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality; letting each update and retain information at its own sensor rate yields stronger representations and more robust control. We present Decoupled Asynchronous Multimodal Vision Language Action (DAM-VLA), which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2% vs. 40.95%) while sustaining smooth, reactive 100 Hz control.

Method

DAM-VLA architecture. Each modality stream encodes tokens into independent latent buffers at its sensor rate: vision periodically, proprioception and force/torque at high frequency. The action expert reads all buffers continuously via parallel gated cross-attention (GCA) pathways; a global-gate pathway for visual memory and an input-dependent gate pathway for force/torque; adding new modalities through dedicated cross-attention modules that preserve the pretrained self-attention structure.

Configuration	Isolates	Async	Force	Mem.	Integ.
Baselines
X-VLA₂₅	std. VLA regime (25 Hz)	✗	✗	✗	—
X-VLA₁₀₀	naive high-freq. (100 Hz)	✗	✗	✗	—
Ours
X-VLA_AFM	concat. baseline	✓	✓	✓	concat.
DAM-VLA_/F/M	async. alone	✓	✗	✗	—
DAM-VLA_/F	memory contribution	✓	✗	✓	GCA
DAM-VLA_/M	force contribution	✓	✓	✗	GCA
DAM-VLA (Ours)	full model	✓	✓	✓	GCA

Configuration

Isolates

Async

Force

Mem.

Integ.

Baselines

X-VLA₂₅

std. VLA regime (25 Hz)

✗

—

X-VLA₁₀₀

naive high-freq. (100 Hz)

✗

—

Ours

X-VLA_AFM

concat. baseline

✓

concat.

DAM-VLA_/F/M

async. alone

✓

✗

—

DAM-VLA_/F

memory contribution

✓

✗

✓

GCA

DAM-VLA_/M

force contribution

✓

✗

GCA

DAM-VLA (Ours)

full model

✓

GCA

BibTeX

@misc{vanjani2026damvladecoupledasynchronousmultimodal, title={DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model}, author={Pankhuri Vanjani and Zhuoyue Li and Jakub Suliga and Moritz Reuss and Gianluca Geraci and Xinkai Jiang and Rudolf Lioutikov}, year={2026}, eprint={2606.12105}, archivePrefix={arXiv}, primaryClass={cs.RO}, url={https://arxiv.org/abs/2606.12105}, }

Decoupled Asynchronous Multimodal Vision Language Action Model

Abstract

Method

Contributions

Results

1. Synchronous processing hits a performance ceiling despite frequency scaling

2. Synchronous baselines vs asynchronous decoupled baseline

3. Integration mechanism for new modalities matters: gated cross-attention preserves pretrained representations

BibTeX