FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

CoRL 2025

Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to 50% of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by 20% through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers competitive performance with bigger VLAs across 190 tasks spanning ten simulation and real world benchmarks and demonstrates robustness across diverse robotic embodiments. In addition, FLOWER achieves a new SoTA of 4.53 on the CALVIN ABC benchmark.
BESO

Summary

BESO

Figure: FLOWER Architecture. Our VLA combines half of a Florence-2 VLM with a Flow Transformer architecture featuring action-specific Global-AdaLN-Zero conditioning and individual encoders and decoders for different action spaces. FLOWER achieves state-of-the-art performance on CALVIN and LIBERO benchmarks with only 950M parameters and just 4 hours of fine-tuning on 4 GPUs.

FLOWER demonstrates that efficient vision-language-action (VLA) policies can be achieved by strategically pruning 30-50% of a pretrained vision-language model’s final layers and conditioning the action generation module on intermediate embeddings instead. This “intermediate fusion” approach reallocates model capacity from less relevant final layers to the policy head, which is more critical for robotic tasks. The model incorporates a novel “Action-Space Global-AdaLN” normalization technique that reduces parameters by 20% while effectively handling diverse robotic action spaces through shared modulation weights and lightweight LoRA adapters. At just 950 million parameters, FLOWER achieves superior performance compared to much larger baselines while requiring significantly less computational resources - only 200 H100 GPU hours for pretraining versus the massive clusters needed by multi-billion parameter models. This approach makes high-performing VLA policies more accessible and practical for real-world deployment on commodity hardware.


Experimental Results

CALVIN

FLOWER achieves new state-of-the-art results, reaching an exceptional first-task success rate of 99.3% on CALVIN ABC and also achieves SoTA on both other benchmark variants D and ABCD. It significantly improves upon previous best-performing methods, while only requiring 4 hours of finetuning on 4 GPUs to achieve these results.

Train→Test Method 1 2 3 4 5 Avg. Len.
ABC→D Diff-P-CNN 63.5% 35.3% 19.4% 10.7% 6.4% 1.35±0.05
  MDT 63.1% 42.9% 24.7% 15.1% 9.1% 1.55
  RoboFlamingo 82.4% 61.9% 46.6% 33.1% 23.5% 2.47
  SuSIE 87.0% 69.0% 49.0% 38.0% 26.0% 2.69
  DeerVLA 84.8% 72.3% 54.9% 44.6% 33.5% 2.90
  GR-1 85.4% 71.2% 59.6% 49.7% 40.1% 3.06
  OpenVLA 91.3% 77.8% 62.0% 52.1% 43.5% 3.27
  3DDA 93.8% 80.3% 66.2% 53.3% 41.2% 3.35
  RoboDual 94.4% 82.7% 72.1% 62.4% 54.4% 3.66
  MoDE 96.2% 88.9% 81.1% 71.8% 63.5% 4.01±0.04
  Seer 96.3% 91.6% 86.1% 80.3% 74.0% 4.28
  VPP 95.7% 91.2% 86.3% 81.0% 75.0% 4.29
  FLOWER (ours) 99.3% 96.0% 90.3% 82.3% 75.5% 4.44±0.04
  FLOWER (ours) w/ PrT 99.4% 95.8% 90.7% 84.9% 77.8% 4.53±0.04
ABCD→D Diff-P-CNN 86.3% 72.7% 60.1% 51.2% 41.7% 3.16±0.06
  RoboFlamingo 96.4% 89.6% 82.4% 74.0% 66.0% 4.09
  DeerVLA 99.1% 93.3% 82.1% 74.6% 63.8% 4.13
  GR-1 94.9% 89.6% 84.4% 78.9% 73.1% 4.21
  MoDE 97.1% 92.5% 87.9% 83.5% 77.9% 4.39±0.04
  MDT 98.6% 95.8% 91.6% 86.2% 80.1% 4.52±0.02
  FLOWER (ours) 98.9% 96.7% 93.9% 90.2% 85.5% 4.62±0.03
  FLOWER (ours) w/ PrT 99.2% 96.9% 96.9% 92.3% 88.3% 4.67±0.04
D→D MDT 93.7% 84.5% 74.1% 64.4% 55.6% 3.72±0.05
  RoboUniView 96.2% 88.8% 77.6% 66.6% 56.3% 3.85
  FLOWER (ours) w/ PrT 97.4% 92.4% 86.9% 81.3% 74.9% 4.35±0.02

Table: Experimental Results for the CALVIN ABCD and ABC settings.

LIBERO

FLOWER consistently achieves very strong results across all LIBERO variants. Notably, it attains a 94.9% success rate on the challenging LIBERO-Long variant, matching performance of models with several maginutes more pretraining compute like $\pi_{0.5}$. For reaching the finetuning performance, FLOWER only requires 4 hours of finetuning on 4 GPUs.

Method Spatial Object Goal Long 90 Avg
Diff-P-CNN 78.3 ± 1.1% 92.5 ± 0.7% 68.3 ± 1.2% 50.5 ± 1.3% - 72.4 ± 0.7%
Octo 78.9 ± 1.0% 85.7 ± 0.9% 84.6 ± 0.9% 51.1 ± 1.3% - 75.1 ± 0.6%
OpenVLA 84.7 ± 0.9% 88.4 ± 0.8% 79.2 ± 1.0% 53.7 ± 1.3% - 76.5 ± 0.6%
OpenVLA-OFT 97.6% 98.4% 97.9% 94.5% - 97.1%
CoA-VLA 85.3 ± 0.9% 93.1 ± 1.0% 85.8 ± 0.9% 55.0 ± 1.2% - 79.8 ± 0.5%
Baku - - - 86.0% 90.0% -
MiniVLA - - - - 86.0% -
MoDE - - - 94.0% 95.0% -
π₀ 96.8% 98.8% 95.8% 85.2% - 94.2%
π₀-FAST 96.4% 96.8% 88.6% 60.2% - 85.5%
π₀.₅-ki (from scratch) 96.6% 97.2% 94.6% 84.8% 92.7% 93.3%
π₀.₅-ki (from generalist) 98.0% 97.8% 95.6% 85.8% 96.0% 94.3%
FLOWER 97.5 ± 0.8% 99.1 ± 0.4% 96.1 ± 0.6% 94.9 ± 1.2% 94.7 ± 1.0% 96.9 ± 0.7%

Table: Experimental Results for the LIBERO benchmark. Average without LIBERO-90.

SIMPLER Benchmark

FLOWER outperforms Octo and OpenVLA across both Bridge and Google Robot benchmarks with only 200 GPU-hours of pretraining, achieving notably stronger performance on the Bridge benchmark, although slightly trailing RT-1X on Google Robot.

Method Open/Close Drawer Move Near Open Top Drawer and Place Apple Pick Coke Can Average
RT-1-X 59.7 31.7 21.3 56.7 42.4
Octo 22.7 4.2 0.0 17.0 11.0
CrossFormer 0.5 4.6 0.0 0.0 1.3
OpenVLA 35.6 46.2 0.0 16.3 24.5
FLOWER Cross-X Pret 27.8 43.3 0.0 56.3 31.9

Table: Experimental Results for the SIMPLER Google Robot Benchmark.

Method Put Carrot on Plate Spoon on Towel Stack the Blocks Eggplant in Yellow Basket Average
RT-1-X 4 0 0 0 1.1
Octo 8 12 0 43 16
CrossFormer 15 15 0 92 30
OpenVLA 0 0 0 4 1.0
FLOWER 13 71 8 88 45

Table: Experimental Results for the SIMPLER Bridge Benchmark.

Bi-Manual Aloha

FLOWER demonstrates strong performance in high-frequency bi-manual tasks, notably outperforming the specialist Action Chunking Transformer (ACT) on the challenging insertion task by a substantial margin. This validates its efficacy in handling diverse and complex robot action spaces.

ALOHA results

Figure: Experimental Results for the Aloha Simulation Tasks.

Real-World Generalization

FLOWER shows good performance in generalization experiments conducted in real-world settings, consistently surpassing the baseline OpenVLA by 28% on average across all scenarios tested, including flashlight-only conditions, environments with distracting objects, and interactions with novel items. However, it shows room for improvement on certain complex tasks, underlining potential avenues for future research.

Move the banana from right stove to sink. (FLOWER w/ bg distractors)

Move the banana from right stove to sink. (OpenVLA w/ bg distractors)

Open the oven door. (FLOWER w/ bg distractors)

Open the oven door. (OpenVLA w/ bg distractors)

Move the banana from right stove to sink. (FLOWER w/ flashlight)

Move the banana from right stove to sink. (OpenVLA w/ flashlight)

Move the pot from the right stove to sink. (FLOWER w/ flashlight)

Move the pot from the right stove to sink. (OpenVLA w/ flashlight)

Real-World Single-Task and Multi-Task Long Horizon Experiments

Representative evaluation videos of FLOWER on all 20 real-world tasks and 3 multi-task long horizon sequences.

Single Task

Move banana from right stove to oven tray

Move banana from right stove to sink

Move banana from sink to right stove

Move banana from tray to right stove (FAILURE)

Close the ice box

Close the microwave

Close the oven

Open the ice box

Open the microwave

Open the oven

Pick up toast and put it in the sink

Move pot from left stove to sink

Move pot from left to right stove

Move pot from right stove to sink

Move pot from right to left stove

Move pot from sink to left stove (FAILURE)

Move pot from sink to right stove

Pull the oven tray

Push the oven tray

Push down the toaster lever (FAILURE)

Multi-Task Long Horizon

See the appendix in the paper for a listing of individual tasks in each sequence.

Open/Close All sequence, video shows 6/6 evaluation.

Oven sequence, video shows 0/5 evaluation.

Stovetop Sink sequence, video show 3/5 evaluation.

Cross-Action Space Flow Transformer

FLOWER effectively manages heterogeneous robotics actions using action-specific decoder/encoder modules within a shared Transformer architecture, enabling efficient weight-sharing and dimension-specific adaptation. It incorporates an action-specific global AdaLN-Zero approach, which injects global conditioning signals (e.g., timesteps) with fewer parameters by sharing modulation signals across all layers. Positional encoding is achieved using Rotary embeddings, enabling flexible handling of variable-length sequences without adding parameters. Finally, FLOWER employs RMSNorm and SwiGLU activations with action-specific normalization layers and sandwich structure, coupled with Q-normalization in attention layers, ensuring stability, computational efficiency, and high performance.

FLOWER ARCH

Figure: Cross-Action Space Flow Transformer

BibTeX

@inproceedings{reuss2025flower, title={{FLOWER}: Democratizing Generalist Robot Policies with Efficient Vision-Language-Flow Models}, author={Moritz Reuss and Hongyi Zhou and Marcel R{"u}hle and {"O}mer Erdin{\c{c}} Ya{\u{g}}murlu and Fabian Otto and Rudolf Lioutikov}, booktitle={9th Annual Conference on Robot Learning}, year={2025}, url={https://openreview.net/forum?id=JeppaebLRD} }

Acknowledgements

The work presented here was funded by the German Research Foundation (DFG) – 448648559.

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

MDT-V Overview

The Multimodal Diffusion Transformer (MDT) is a novel framework that learns versatile behaviors from multimodal goals with minimal language annotations. Leveraging a transformer backbone, MDT aligns image and language-based goal embeddings through two self-supervised objectives, enabling it to tackle long-horizon manipulation tasks. In benchmark tests like CALVIN and LIBERO, MDT outperforms prior methods by 15% while using fewer parameters. Its effectiveness is demonstrated in both simulated and real-world environments, highlighting its potential in settings with sparse language data.

Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models

NILS Overview

Using pre-trained vision-language models, NILS detects objects, identifies changes, segments tasks, and annotates behavior datasets. Evaluations on the BridgeV2 and kitchen play datasets demonstrate its effectiveness in annotating diverse, unstructured robot demonstrations while addressing the limitations of traditional human labeling methods.