Synchronous versus asynchronous multimodal processing: a synchronous VLA forces every modality onto a single slow clock, whereas DAM-VLA processes each modality at its own native rate.
Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality; letting each update and retain information at its own sensor rate yields stronger representations and more robust control. We present Decoupled Asynchronous Multimodal Vision Language Action (DAM-VLA), which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2% vs. 40.95%) while sustaining smooth, reactive 100 Hz control.
DAM-VLA architecture. Each modality stream encodes tokens into independent latent buffers at its sensor rate: vision periodically, proprioception and force/torque at high frequency. The action expert reads all buffers continuously via parallel gated cross-attention (GCA) pathways; a global-gate pathway for visual memory and an input-dependent gate pathway for force/torque; adding new modalities through dedicated cross-attention modules that preserve the pretrained self-attention structure.
On seven contact-rich tasks with a Franka Emika Panda, DAM-VLA reaches 95.2% average success; more than double the strongest synchronous baseline (40.95%); while running reactively at 100 Hz control frequency.
| Configuration | Isolates | Async | Force | Mem. | Integ. |
|---|---|---|---|---|---|
| Baselines | |||||
| X-VLA25 | std. VLA regime (25 Hz) | ✗ | ✗ | ✗ | — |
| X-VLA100 | naive high-freq. (100 Hz) | ✗ | ✗ | ✗ | — |
| Ours | |||||
| X-VLAAFM | concat. baseline | ✓ | ✓ | ✓ | concat. |
| DAM-VLA/F/M | async. alone | ✓ | ✗ | ✗ | — |
| DAM-VLA/F | memory contribution | ✓ | ✗ | ✓ | GCA |
| DAM-VLA/M | force contribution | ✓ | ✓ | ✗ | GCA |
| DAM-VLA (Ours) | full model | ✓ | ✓ | ✓ | GCA |
Average Success Rate (%)
Per-Task Success Rate (%)
| Model | Scarf | Whiteboard | Button | Handwash | Lego | Socket | Sweep |
|---|
Light = low, dark blue = high. 15 trials per task.
Blue: clean success; orange: partial success.
Coming soon.