Abstract

Experimental Results

Figure: FLOWER Architecture.
CALVIN
FLOWER achieves new state-of-the-art results, reaching an exceptional first-task success rate of 99.3% on CALVIN ABC. It significantly improves upon previous best-performing methods, attaining the highest mean sequence lengths on both CALVIN ABC and ABCD benchmarks, demonstrating superior long-horizon generalization and efficiency.
Train→Test | Method | 1 | 2 | 3 | 4 | 5 | Avg. Len. |
---|---|---|---|---|---|---|---|
ABC→D | Diff-P-CNN | 63.5% | 35.3% | 19.4% | 10.7% | 6.4% | 1.35±0.05 |
MDT | 63.1% | 42.9% | 24.7% | 15.1% | 9.1% | 1.55 | |
RoboFlamingo | 82.4% | 61.9% | 46.6% | 33.1% | 23.5% | 2.47 | |
SuSIE | 87.0% | 69.0% | 49.0% | 38.0% | 26.0% | 2.69 | |
DeerVLA | 84.8% | 72.3% | 54.9% | 44.6% | 33.5% | 2.90 | |
GR-1 | 85.4% | 71.2% | 59.6% | 49.7% | 40.1% | 3.06 | |
OpenVLA | 91.3% | 77.8% | 62.0% | 52.1% | 43.5% | 3.27 | |
3DDA | 93.8% | 80.3% | 66.2% | 53.3% | 41.2% | 3.35 | |
RoboDual | 94.4% | 82.7% | 72.1% | 62.4% | 54.4% | 3.66 | |
MoDE | 96.2% | 88.9% | 81.1% | 71.8% | 63.5% | 4.01±0.04 | |
Seer | 96.3% | 91.6% | 86.1% | 80.3% | 74.0% | 4.28 | |
VPP | 95.7% | 91.2% | 86.3% | 81.0% | 75.0% | 4.29 | |
FLOWER (ours) | 99.3% | 96.0% | 90.3% | 82.3% | 75.5% | 4.44±0.04 | |
FLOWER (ours) w/ PrT | 99.4% | 95.8% | 90.7% | 84.9% | 77.8% | 4.53±0.04 | |
ABCD→D | Diff-P-CNN | 86.3% | 72.7% | 60.1% | 51.2% | 41.7% | 3.16±0.06 |
RoboFlamingo | 96.4% | 89.6% | 82.4% | 74.0% | 66.0% | 4.09 | |
DeerVLA | 99.1% | 93.3% | 82.1% | 74.6% | 63.8% | 4.13 | |
GR-1 | 94.9% | 89.6% | 84.4% | 78.9% | 73.1% | 4.21 | |
MoDE | 97.1% | 92.5% | 87.9% | 83.5% | 77.9% | 4.39±0.04 | |
MDT | 98.6% | 95.8% | 91.6% | 86.2% | 80.1% | 4.52±0.02 | |
FLOWER (ours) | 98.9% | 96.7% | 93.9% | 90.2% | 85.5% | 4.62±0.03 | |
FLOWER (ours) w/ PrT | 99.2% | 96.9% | 96.9% | 92.3% | 88.3% | 4.67±0.04 | |
D→D | MDT | 93.7% | 84.5% | 74.1% | 64.4% | 55.6% | 3.72±0.05 |
RoboUniView | 96.2% | 88.8% | 77.6% | 66.6% | 56.3% | 3.85 | |
FLOWER (ours) w/ PrT | 97.4% | 92.4% | 86.9% | 81.3% | 74.9% | 4.35±0.02 |
Table: Experimental Results for the CALVIN ABCD and ABC settings.
SIMPLER Benchmark
FLOWER outperforms Octo and OpenVLA across both Bridge and Google Robot benchmarks with only 200 GPU-hours of pretraining, achieving notably stronger performance on the Bridge benchmark, although slightly trailing RT-1X on Google Robot.
Method | Open/Close Drawer | Move Near | Open Top Drawer and Place Apple | Pick Coke Can | Average |
---|---|---|---|---|---|
RT-1-X | 59.7 | 31.7 | 21.3 | 56.7 | 42.4 |
Octo | 22.7 | 4.2 | 0.0 | 17.0 | 11.0 |
CrossFormer | 0.5 | 4.6 | 0.0 | 0.0 | 1.3 |
OpenVLA | 35.6 | 46.2 | 0.0 | 16.3 | 24.5 |
FLOWER Cross-X Pret | 27.8 | 43.3 | 0.0 | 56.3 | 31.9 |
Table: Experimental Results for the SIMPLER Google Robot Benchmark.
Method | Put Carrot on Plate | Spoon on Towel | Stack the Blocks | Eggplant in Yellow Basket | Average |
---|---|---|---|---|---|
RT-1-X | 4 | 0 | 0 | 0 | 1.1 |
Octo | 8 | 12 | 0 | 43 | 16 |
CrossFormer | 15 | 15 | 0 | 92 | 30 |
OpenVLA | 0 | 0 | 0 | 4 | 1.0 |
FLOWER | 13 | 71 | 8 | 88 | 45 |
Table: Experimental Results for the SIMPLER Bridge Benchmark.
LIBERO
FLOWER consistently achieves outstanding results across all LIBERO variants, uniquely surpassing a 93% success rate. Notably, it attains a 93.5% success rate on the challenging LIBERO-Long variant, markedly outperforming other generalist methods (approx. 54%) and highlighting robust long-horizon capability.
Method | Spatial | Object | Goal | Long | 90 | Avg |
---|---|---|---|---|---|---|
Diff-P-CNN | 78.3 ± 1.1% | 92.5 ± 0.7% | 68.3 ± 1.2% | 50.5 ± 1.3% | - | 72.4 ± 0.7% |
Octo | 78.9 ± 1.0% | 85.7 ± 0.9% | 84.6 ± 0.9% | 51.1 ± 1.3% | - | 75.1 ± 0.6% |
OpenVLA | 84.7 ± 0.9% | 88.4 ± 0.8% | 79.2 ± 1.0% | 53.7 ± 1.3% | - | 76.5 ± 0.6% |
OpenVLA-OFT | 97.6% | 98.4% | 97.9% | 94.5% | - | 97.1% |
CoA-VLA | 85.3 ± 0.9% | 93.1 ± 1.0% | 85.8 ± 0.9% | 55.0 ± 1.2% | - | 79.8 ± 0.5% |
Baku | - | - | - | 86.0% | 90.0% | - |
MiniVLA | - | - | - | - | 86.0% | - |
MoDE | - | - | - | 94.0% | 95.0% | - |
π₀ | 96.8% | 98.8% | 95.8% | 85.2% | - | 94.2% |
π₀-FAST | 96.4% | 96.8% | 88.6% | 60.2% | - | 85.5% |
π₀.₅-ki (from scratch) | 96.6% | 97.2% | 94.6% | 84.8% | 92.7% | 93.3% |
π₀.₅-ki (from generalist) | 98.0% | 97.8% | 95.6% | 85.8% | 96.0% | 94.3% |
FLOWER | 97.5 ± 0.8% | 99.1 ± 0.4% | 96.1 ± 0.6% | 94.9 ± 1.2% | 94.7 ± 1.0% | 96.9 ± 0.7% |
Table: Experimental Results for the LIBERO benchmark. Average without LIBERO-90.
Bi-Manual Aloha
FLOWER demonstrates strong performance in high-frequency bi-manual tasks, notably outperforming the specialist Action Chunking Transformer (ACT) on the challenging insertion task by a substantial margin. This validates its efficacy in handling diverse and complex robot action spaces.

Figure: Experimental Results for the Aloha Simulation Tasks.
Real-World Generalization
FLOWER excels in generalization experiments conducted in real-world settings, consistently surpassing the baseline OpenVLA by 28% on average across all scenarios tested, including flashlight-only conditions, environments with distracting objects, and interactions with novel items. However, it shows room for improvement on certain complex tasks, underlining potential avenues for future research.
Move the banana from right stove to sink. (FLOWER w/ bg distractors)
Move the banana from right stove to sink. (OpenVLA w/ bg distractors)
Open the oven door. (FLOWER w/ bg distractors)
Open the oven door. (OpenVLA w/ bg distractors)
Move the banana from right stove to sink. (FLOWER w/ flashlight)
Move the banana from right stove to sink. (OpenVLA w/ flashlight)
Move the pot from the right stove to sink. (FLOWER w/ flashlight)
Move the pot from the right stove to sink. (OpenVLA w/ flashlight)
Real-World Single-Task and Multi-Task Long Horizon Experiments
Representative evaluation videos of FLOWER on all 20 real-world tasks and 3 multi-task long horizon sequences.
Single Task
Move banana from right stove to oven tray
Move banana from right stove to sink
Move banana from sink to right stove
Move banana from tray to right stove (FAILURE)
Close the ice box
Close the microwave
Close the oven
Open the ice box
Open the microwave
Open the oven
Pick up toast and put it in the sink
Move pot from left stove to sink
Move pot from left to right stove
Move pot from right stove to sink
Move pot from right to left stove
Move pot from sink to left stove (FAILURE)
Move pot from sink to right stove
Pull the oven tray
Push the oven tray
Push down the toaster lever (FAILURE)
Multi-Task Long Horizon
See the appendix in the paper for a listing of individual tasks in each sequence.
Open/Close All sequence, video shows 6/6 evaluation.
Oven sequence, video shows 0/5 evaluation.
Stovetop Sink sequence, video show 3/5 evaluation.
Cross-Action Space Flow Transformer
FLOWER effectively manages heterogeneous robotics actions using action-specific decoder/encoder modules within a shared Transformer architecture, enabling efficient weight-sharing and dimension-specific adaptation. It incorporates an action-specific global AdaLN-Zero approach, which injects global conditioning signals (e.g., timesteps) with fewer parameters by sharing modulation signals across all layers. Positional encoding is achieved using Rotary embeddings, enabling flexible handling of variable-length sequences without adding parameters. Finally, FLOWER employs RMSNorm and SwiGLU activations with action-specific normalization layers and sandwich structure, coupled with Q-normalization in attention layers, ensuring stability, computational efficiency, and high performance.

Figure: Cross-Action Space Flow Transformer
Related Projects
Multimodal Diffusion Transformer: Learning Versatile Behavior from Multimodal Goals

The Multimodal Diffusion Transformer (MDT) is a novel framework that learns versatile behaviors from multimodal goals with minimal language annotations. Leveraging a transformer backbone, MDT aligns image and language-based goal embeddings through two self-supervised objectives, enabling it to tackle long-horizon manipulation tasks. In benchmark tests like CALVIN and LIBERO, MDT outperforms prior methods by 15% while using fewer parameters. Its effectiveness is demonstrated in both simulated and real-world environments, highlighting its potential in settings with sparse language data.
Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models

Using pre-trained vision-language models, NILS detects objects, identifies changes, segments tasks, and annotates behavior datasets. Evaluations on the BridgeV2 and kitchen play datasets demonstrate its effectiveness in annotating diverse, unstructured robot demonstrations while addressing the limitations of traditional human labeling methods.