FLOWER: Democratizing Generalist Robot Policies with Efficient Vision-Language-Action Flow Policies

Abstract

Developing efficient Vision-Language-Action (VLA) policies is crucial for practical robotics deployment, yet current approaches face prohibitive computational costs and resource requirements. Existing diffusion-based VLA policies require multi-billion-parameter models and massive datasets to achieve strong performance. We tackle this efficiency challenge with two contributions: intermediate-modality fusion, which reallocates capacity to the diffusion head by pruning up to 50% of LLM layers, and action-specific Global-AdaLN conditioning, which cuts parameters by 20% through modular adaptation. We integrate these advances into a novel 950 M-parameter VLA called FLOWER. Pretrained in just 200 H100 GPU hours, FLOWER delivers competitive performance with bigger VLAs across 190 tasks spanning ten simulation and real world benchmarks and demonstrates robustness across diverse robotic embodiments. In addition, FLOWER achieves a new SoTA of 4.53 on the CALVIN ABC benchmark.
BESO

Experimental Results

BESO

Figure: FLOWER Architecture.

CALVIN

FLOWER achieves new state-of-the-art results, reaching an exceptional first-task success rate of 99.3% on CALVIN ABC. It significantly improves upon previous best-performing methods, attaining the highest mean sequence lengths on both CALVIN ABC and ABCD benchmarks, demonstrating superior long-horizon generalization and efficiency.

Train→Test Method 1 2 3 4 5 Avg. Len.
ABC→D Diff-P-CNN 63.5% 35.3% 19.4% 10.7% 6.4% 1.35±0.05
  MDT 63.1% 42.9% 24.7% 15.1% 9.1% 1.55
  RoboFlamingo 82.4% 61.9% 46.6% 33.1% 23.5% 2.47
  SuSIE 87.0% 69.0% 49.0% 38.0% 26.0% 2.69
  DeerVLA 84.8% 72.3% 54.9% 44.6% 33.5% 2.90
  GR-1 85.4% 71.2% 59.6% 49.7% 40.1% 3.06
  OpenVLA 91.3% 77.8% 62.0% 52.1% 43.5% 3.27
  3DDA 93.8% 80.3% 66.2% 53.3% 41.2% 3.35
  RoboDual 94.4% 82.7% 72.1% 62.4% 54.4% 3.66
  MoDE 96.2% 88.9% 81.1% 71.8% 63.5% 4.01±0.04
  Seer 96.3% 91.6% 86.1% 80.3% 74.0% 4.28
  VPP 95.7% 91.2% 86.3% 81.0% 75.0% 4.29
  FLOWER (ours) 99.3% 96.0% 90.3% 82.3% 75.5% 4.44±0.04
  FLOWER (ours) w/ PrT 99.4% 95.8% 90.7% 84.9% 77.8% 4.53±0.04
ABCD→D Diff-P-CNN 86.3% 72.7% 60.1% 51.2% 41.7% 3.16±0.06
  RoboFlamingo 96.4% 89.6% 82.4% 74.0% 66.0% 4.09
  DeerVLA 99.1% 93.3% 82.1% 74.6% 63.8% 4.13
  GR-1 94.9% 89.6% 84.4% 78.9% 73.1% 4.21
  MoDE 97.1% 92.5% 87.9% 83.5% 77.9% 4.39±0.04
  MDT 98.6% 95.8% 91.6% 86.2% 80.1% 4.52±0.02
  FLOWER (ours) 98.9% 96.7% 93.9% 90.2% 85.5% 4.62±0.03
  FLOWER (ours) w/ PrT 99.2% 96.9% 96.9% 92.3% 88.3% 4.67±0.04
D→D MDT 93.7% 84.5% 74.1% 64.4% 55.6% 3.72±0.05
  RoboUniView 96.2% 88.8% 77.6% 66.6% 56.3% 3.85
  FLOWER (ours) w/ PrT 97.4% 92.4% 86.9% 81.3% 74.9% 4.35±0.02

Table: Experimental Results for the CALVIN ABCD and ABC settings.

SIMPLER Benchmark

FLOWER outperforms Octo and OpenVLA across both Bridge and Google Robot benchmarks with only 200 GPU-hours of pretraining, achieving notably stronger performance on the Bridge benchmark, although slightly trailing RT-1X on Google Robot.

Method Open/Close Drawer Move Near Open Top Drawer and Place Apple Pick Coke Can Average
RT-1-X 59.7 31.7 21.3 56.7 42.4
Octo 22.7 4.2 0.0 17.0 11.0
CrossFormer 0.5 4.6 0.0 0.0 1.3
OpenVLA 35.6 46.2 0.0 16.3 24.5
FLOWER Cross-X Pret 27.8 43.3 0.0 56.3 31.9

Table: Experimental Results for the SIMPLER Google Robot Benchmark.

Method Put Carrot on Plate Spoon on Towel Stack the Blocks Eggplant in Yellow Basket Average
RT-1-X 4 0 0 0 1.1
Octo 8 12 0 43 16
CrossFormer 15 15 0 92 30
OpenVLA 0 0 0 4 1.0
FLOWER 13 71 8 88 45

Table: Experimental Results for the SIMPLER Bridge Benchmark.

LIBERO

FLOWER consistently achieves outstanding results across all LIBERO variants, uniquely surpassing a 93% success rate. Notably, it attains a 93.5% success rate on the challenging LIBERO-Long variant, markedly outperforming other generalist methods (approx. 54%) and highlighting robust long-horizon capability.

Method Spatial Object Goal Long 90 Avg
Diff-P-CNN 78.3 ± 1.1% 92.5 ± 0.7% 68.3 ± 1.2% 50.5 ± 1.3% - 72.4 ± 0.7%
Octo 78.9 ± 1.0% 85.7 ± 0.9% 84.6 ± 0.9% 51.1 ± 1.3% - 75.1 ± 0.6%
OpenVLA 84.7 ± 0.9% 88.4 ± 0.8% 79.2 ± 1.0% 53.7 ± 1.3% - 76.5 ± 0.6%
OpenVLA-OFT 97.6% 98.4% 97.9% 94.5% - 97.1%
CoA-VLA 85.3 ± 0.9% 93.1 ± 1.0% 85.8 ± 0.9% 55.0 ± 1.2% - 79.8 ± 0.5%
Baku - - - 86.0% 90.0% -
MiniVLA - - - - 86.0% -
MoDE - - - 94.0% 95.0% -
π₀ 96.8% 98.8% 95.8% 85.2% - 94.2%
π₀-FAST 96.4% 96.8% 88.6% 60.2% - 85.5%
π₀.₅-ki (from scratch) 96.6% 97.2% 94.6% 84.8% 92.7% 93.3%
π₀.₅-ki (from generalist) 98.0% 97.8% 95.6% 85.8% 96.0% 94.3%
FLOWER 97.5 ± 0.8% 99.1 ± 0.4% 96.1 ± 0.6% 94.9 ± 1.2% 94.7 ± 1.0% 96.9 ± 0.7%

Table: Experimental Results for the LIBERO benchmark. Average without LIBERO-90.

Bi-Manual Aloha

FLOWER demonstrates strong performance in high-frequency bi-manual tasks, notably outperforming the specialist Action Chunking Transformer (ACT) on the challenging insertion task by a substantial margin. This validates its efficacy in handling diverse and complex robot action spaces.

ALOHA results

Figure: Experimental Results for the Aloha Simulation Tasks.

Real-World Generalization

FLOWER excels in generalization experiments conducted in real-world settings, consistently surpassing the baseline OpenVLA by 28% on average across all scenarios tested, including flashlight-only conditions, environments with distracting objects, and interactions with novel items. However, it shows room for improvement on certain complex tasks, underlining potential avenues for future research.

Move the banana from right stove to sink. (FLOWER w/ bg distractors)

Move the banana from right stove to sink. (OpenVLA w/ bg distractors)

Open the oven door. (FLOWER w/ bg distractors)

Open the oven door. (OpenVLA w/ bg distractors)

Move the banana from right stove to sink. (FLOWER w/ flashlight)

Move the banana from right stove to sink. (OpenVLA w/ flashlight)

Move the pot from the right stove to sink. (FLOWER w/ flashlight)

Move the pot from the right stove to sink. (OpenVLA w/ flashlight)

Real-World Single-Task and Multi-Task Long Horizon Experiments

Representative evaluation videos of FLOWER on all 20 real-world tasks and 3 multi-task long horizon sequences.

Single Task

Move banana from right stove to oven tray

Move banana from right stove to sink

Move banana from sink to right stove

Move banana from tray to right stove (FAILURE)

Close the ice box

Close the microwave

Close the oven

Open the ice box

Open the microwave

Open the oven

Pick up toast and put it in the sink

Move pot from left stove to sink

Move pot from left to right stove

Move pot from right stove to sink

Move pot from right to left stove

Move pot from sink to left stove (FAILURE)

Move pot from sink to right stove

Pull the oven tray

Push the oven tray

Push down the toaster lever (FAILURE)

Multi-Task Long Horizon

See the appendix in the paper for a listing of individual tasks in each sequence.

Open/Close All sequence, video shows 6/6 evaluation.

Oven sequence, video shows 0/5 evaluation.

Stovetop Sink sequence, video show 3/5 evaluation.

Cross-Action Space Flow Transformer

FLOWER effectively manages heterogeneous robotics actions using action-specific decoder/encoder modules within a shared Transformer architecture, enabling efficient weight-sharing and dimension-specific adaptation. It incorporates an action-specific global AdaLN-Zero approach, which injects global conditioning signals (e.g., timesteps) with fewer parameters by sharing modulation signals across all layers. Positional encoding is achieved using Rotary embeddings, enabling flexible handling of variable-length sequences without adding parameters. Finally, FLOWER employs RMSNorm and SwiGLU activations with action-specific normalization layers and sandwich structure, coupled with Q-normalization in attention layers, ensuring stability, computational efficiency, and high performance.

FLOWER ARCH

Figure: Cross-Action Space Flow Transformer

Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

MDT-V Overview

The Multimodal Diffusion Transformer (MDT) is a novel framework that learns versatile behaviors from multimodal goals with minimal language annotations. Leveraging a transformer backbone, MDT aligns image and language-based goal embeddings through two self-supervised objectives, enabling it to tackle long-horizon manipulation tasks. In benchmark tests like CALVIN and LIBERO, MDT outperforms prior methods by 15% while using fewer parameters. Its effectiveness is demonstrated in both simulated and real-world environments, highlighting its potential in settings with sparse language data.

Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models

NILS Overview

Using pre-trained vision-language models, NILS detects objects, identifies changes, segments tasks, and annotates behavior datasets. Evaluations on the BridgeV2 and kitchen play datasets demonstrate its effectiveness in annotating diverse, unstructured robot demonstrations while addressing the limitations of traditional human labeling methods.