FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

Paper Pretraining Code Finetuning Code Model Weights

Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to 50% of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by 20% through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers competitive performance with bigger VLAs across 190 tasks spanning ten simulation and real world benchmarks and demonstrates robustness across diverse robotic embodiments. In addition, FLOWER achieves a new SoTA of 4.53 on the CALVIN ABC benchmark.

Summary

Figure: FLOWER Architecture. Our VLA combines half of a Florence-2 VLM with a Flow Transformer architecture featuring action-specific Global-AdaLN-Zero conditioning and individual encoders and decoders for different action spaces. FLOWER achieves state-of-the-art performance on CALVIN and LIBERO benchmarks with only 950M parameters and just 4 hours of fine-tuning on 4 GPUs.

FLOWER demonstrates that efficient vision-language-action (VLA) policies can be achieved by strategically pruning 30-50% of a pretrained vision-language model’s final layers and conditioning the action generation module on intermediate embeddings instead. This “intermediate fusion” approach reallocates model capacity from less relevant final layers to the policy head, which is more critical for robotic tasks. The model incorporates a novel “Action-Space Global-AdaLN” normalization technique that reduces parameters by 20% while effectively handling diverse robotic action spaces through shared modulation weights and lightweight LoRA adapters. At just 950 million parameters, FLOWER achieves superior performance compared to much larger baselines while requiring significantly less computational resources - only 200 H100 GPU hours for pretraining versus the massive clusters needed by multi-billion parameter models. This approach makes high-performing VLA policies more accessible and practical for real-world deployment on commodity hardware.

Experimental Results

CALVIN

FLOWER achieves new state-of-the-art results, reaching an exceptional first-task success rate of 99.3% on CALVIN ABC and also achieves SoTA on both other benchmark variants D and ABCD. It significantly improves upon previous best-performing methods, while only requiring 4 hours of finetuning on 4 GPUs to achieve these results.

Train→Test	Method	1	2	3	4	5	Avg. Len.
ABC→D	Diff-P-CNN	63.5%	35.3%	19.4%	10.7%	6.4%	1.35±0.05
	MDT	63.1%	42.9%	24.7%	15.1%	9.1%	1.55
	RoboFlamingo	82.4%	61.9%	46.6%	33.1%	23.5%	2.47
	SuSIE	87.0%	69.0%	49.0%	38.0%	26.0%	2.69
	DeerVLA	84.8%	72.3%	54.9%	44.6%	33.5%	2.90
	GR-1	85.4%	71.2%	59.6%	49.7%	40.1%	3.06
	OpenVLA	91.3%	77.8%	62.0%	52.1%	43.5%	3.27
	3DDA	93.8%	80.3%	66.2%	53.3%	41.2%	3.35
	RoboDual	94.4%	82.7%	72.1%	62.4%	54.4%	3.66
	MoDE	96.2%	88.9%	81.1%	71.8%	63.5%	4.01±0.04
	Seer	96.3%	91.6%	86.1%	80.3%	74.0%	4.28
	VPP	95.7%	91.2%	86.3%	81.0%	75.0%	4.29
	FLOWER (ours)	99.3%	96.0%	90.3%	82.3%	75.5%	4.44±0.04
	FLOWER (ours) w/ PrT	99.4%	95.8%	90.7%	84.9%	77.8%	4.53±0.04
ABCD→D	Diff-P-CNN	86.3%	72.7%	60.1%	51.2%	41.7%	3.16±0.06
	RoboFlamingo	96.4%	89.6%	82.4%	74.0%	66.0%	4.09
	DeerVLA	99.1%	93.3%	82.1%	74.6%	63.8%	4.13
	GR-1	94.9%	89.6%	84.4%	78.9%	73.1%	4.21
	MoDE	97.1%	92.5%	87.9%	83.5%	77.9%	4.39±0.04
	MDT	98.6%	95.8%	91.6%	86.2%	80.1%	4.52±0.02
	FLOWER (ours)	98.9%	96.7%	93.9%	90.2%	85.5%	4.62±0.03
	FLOWER (ours) w/ PrT	99.2%	96.9%	96.9%	92.3%	88.3%	4.67±0.04
D→D	MDT	93.7%	84.5%	74.1%	64.4%	55.6%	3.72±0.05
	RoboUniView	96.2%	88.8%	77.6%	66.6%	56.3%	3.85
	FLOWER (ours) w/ PrT	97.4%	92.4%	86.9%	81.3%	74.9%	4.35±0.02

Table: Experimental Results for the CALVIN ABCD and ABC settings.

LIBERO

FLOWER consistently achieves very strong results across all LIBERO variants. Notably, it attains a 94.9% success rate on the challenging LIBERO-Long variant, matching performance of models with several maginutes more pretraining compute like $\pi_{0.5}$. For reaching the finetuning performance, FLOWER only requires 4 hours of finetuning on 4 GPUs.

Method	Spatial	Object	Goal	Long	90	Avg
Diff-P-CNN	78.3 ± 1.1%	92.5 ± 0.7%	68.3 ± 1.2%	50.5 ± 1.3%	-	72.4 ± 0.7%
Octo	78.9 ± 1.0%	85.7 ± 0.9%	84.6 ± 0.9%	51.1 ± 1.3%	-	75.1 ± 0.6%
OpenVLA	84.7 ± 0.9%	88.4 ± 0.8%	79.2 ± 1.0%	53.7 ± 1.3%	-	76.5 ± 0.6%
OpenVLA-OFT	97.6%	98.4%	97.9%	94.5%	-	97.1%
CoA-VLA	85.3 ± 0.9%	93.1 ± 1.0%	85.8 ± 0.9%	55.0 ± 1.2%	-	79.8 ± 0.5%
Baku	-	-	-	86.0%	90.0%	-
MiniVLA	-	-	-	-	86.0%	-
MoDE	-	-	-	94.0%	95.0%	-
π₀	96.8%	98.8%	95.8%	85.2%	-	94.2%
π₀-FAST	96.4%	96.8%	88.6%	60.2%	-	85.5%
π₀.₅-ki (from scratch)	96.6%	97.2%	94.6%	84.8%	92.7%	93.3%
π₀.₅-ki (from generalist)	98.0%	97.8%	95.6%	85.8%	96.0%	94.3%
FLOWER	97.5 ± 0.8%	99.1 ± 0.4%	96.1 ± 0.6%	94.9 ± 1.2%	94.7 ± 1.0%	96.9 ± 0.7%

Table: Experimental Results for the LIBERO benchmark. Average without LIBERO-90.

SIMPLER Benchmark

FLOWER outperforms Octo and OpenVLA across both Bridge and Google Robot benchmarks with only 200 GPU-hours of pretraining, achieving notably stronger performance on the Bridge benchmark, although slightly trailing RT-1X on Google Robot.

Method	Open/Close Drawer	Move Near	Open Top Drawer and Place Apple	Pick Coke Can	Average
RT-1-X	59.7	31.7	21.3	56.7	42.4
Octo	22.7	4.2	0.0	17.0	11.0
CrossFormer	0.5	4.6	0.0	0.0	1.3
OpenVLA	35.6	46.2	0.0	16.3	24.5
FLOWER Cross-X Pret	27.8	43.3	0.0	56.3	31.9

Table: Experimental Results for the SIMPLER Google Robot Benchmark.

Method	Put Carrot on Plate	Spoon on Towel	Stack the Blocks	Eggplant in Yellow Basket	Average
RT-1-X	4	0	0	0	1.1
Octo	8	12	0	43	16
CrossFormer	15	15	0	92	30
OpenVLA	0	0	0	4	1.0
FLOWER	13	71	8	88	45

Table: Experimental Results for the SIMPLER Bridge Benchmark.

Bi-Manual Aloha

FLOWER demonstrates strong performance in high-frequency bi-manual tasks, notably outperforming the specialist Action Chunking Transformer (ACT) on the challenging insertion task by a substantial margin. This validates its efficacy in handling diverse and complex robot action spaces.

Figure: Experimental Results for the Aloha Simulation Tasks.

Real-World Generalization

FLOWER shows good performance in generalization experiments conducted in real-world settings, consistently surpassing the baseline OpenVLA by 28% on average across all scenarios tested, including flashlight-only conditions, environments with distracting objects, and interactions with novel items. However, it shows room for improvement on certain complex tasks, underlining potential avenues for future research.

Move the banana from right stove to sink. (FLOWER w/ bg distractors)

Move the banana from right stove to sink. (OpenVLA w/ bg distractors)

Open the oven door. (FLOWER w/ bg distractors)

Open the oven door. (OpenVLA w/ bg distractors)

Move the banana from right stove to sink. (FLOWER w/ flashlight)

Move the banana from right stove to sink. (OpenVLA w/ flashlight)

Move the pot from the right stove to sink. (FLOWER w/ flashlight)

Move the pot from the right stove to sink. (OpenVLA w/ flashlight)

Real-World Single-Task and Multi-Task Long Horizon Experiments

Representative evaluation videos of FLOWER on all 20 real-world tasks and 3 multi-task long horizon sequences.

Single Task

Move banana from right stove to oven tray

Move banana from right stove to sink

Move banana from sink to right stove

Move banana from tray to right stove (FAILURE)

Close the ice box

Close the microwave

Close the oven

Open the ice box

Open the microwave

Open the oven

Pick up toast and put it in the sink

Move pot from left stove to sink

Move pot from left to right stove

Move pot from right stove to sink

Move pot from right to left stove

Move pot from sink to left stove (FAILURE)

Move pot from sink to right stove

Pull the oven tray

Push the oven tray

Push down the toaster lever (FAILURE)

Multi-Task Long Horizon

See the appendix in the paper for a listing of individual tasks in each sequence.

Open/Close All sequence, video shows 6/6 evaluation.

Oven sequence, video shows 0/5 evaluation.

Stovetop Sink sequence, video show 3/5 evaluation.

Cross-Action Space Flow Transformer

FLOWER effectively manages heterogeneous robotics actions using action-specific decoder/encoder modules within a shared Transformer architecture, enabling efficient weight-sharing and dimension-specific adaptation. It incorporates an action-specific global AdaLN-Zero approach, which injects global conditioning signals (e.g., timesteps) with fewer parameters by sharing modulation signals across all layers. Positional encoding is achieved using Rotary embeddings, enabling flexible handling of variable-length sequences without adding parameters. Finally, FLOWER employs RMSNorm and SwiGLU activations with action-specific normalization layers and sandwich structure, coupled with Q-normalization in attention layers, ensuring stability, computational efficiency, and high performance.

Figure: Cross-Action Space Flow Transformer

BibTeX

@inproceedings{reuss2025flower, title={{FLOWER}: Democratizing Generalist Robot Policies with Efficient Vision-Language-Flow Models}, author={Moritz Reuss and Hongyi Zhou and Marcel R{"u}hle and {"O}mer Erdin{\c{c}} Ya{\u{g}}murlu and Fabian Otto and Rudolf Lioutikov}, booktitle={9th Annual Conference on Robot Learning}, year={2025}, url={https://openreview.net/forum?id=JeppaebLRD} }

Acknowledgements

The work presented here was funded by the German Research Foundation (DFG) – 448648559.

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

The Multimodal Diffusion Transformer (MDT) is a novel framework that learns versatile behaviors from multimodal goals with minimal language annotations. Leveraging a transformer backbone, MDT aligns image and language-based goal embeddings through two self-supervised objectives, enabling it to tackle long-horizon manipulation tasks. In benchmark tests like CALVIN and LIBERO, MDT outperforms prior methods by 15% while using fewer parameters. Its effectiveness is demonstrated in both simulated and real-world environments, highlighting its potential in settings with sparse language data.

Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models

Using pre-trained vision-language models, NILS detects objects, identifies changes, segments tasks, and annotates behavior datasets. Evaluations on the BridgeV2 and kitchen play datasets demonstrate its effectiveness in annotating diverse, unstructured robot demonstrations while addressing the limitations of traditional human labeling methods.