This work introduces SPARC (Spatial Annotations from Robot Demonstrations with Reliability Calibration), a risk-aware framework that automatically labels robot demonstrations with structured spatial annotations and assigns each annotation a reliability score. Structured spatial annotations — bounding boxes, object trajectories, and manipulation phase labels — benefit a broad range of robotics applications, from training grounded robot policies and embodied foundation models to motion planning and hierarchical task composition.
Existing automated pipelines generate such annotations at scale but provide no reliable quality signal: detector confidence is poorly calibrated for annotation correctness, forcing a choice between accepting noisy labels or discarding useful samples. In contrast to existing automated pipelines, SPARC leverages the spatio-temporal structure inherent to robot tasks to generate a reliability signal, thus reducing noisy labels and retaining more useful samples.
We further introduce IA-Bench, a benchmark that measures model accuracy in grounding the locations of interacted objects in robot demonstrations. On 1.7k human-annotated demonstrations spanning diverse embodiments and scenarios, SPARC significantly outperforms detection-only baselines in object localization accuracy while also retaining three times more samples at high-precision operating points. Our experiments demonstrate that models fine-tuned on our annotations achieve state-of-the-art results on object-grounding and pointing benchmarks among similarly sized models, while remaining competitive on broader spatial-reasoning suites without any manually verified or annotated training data. Furthermore, policies trained on SPARC-generated annotations significantly outperform baselines in cluttered, visually ambiguous real-world scenes.
Detector baseline (left) vs. SPARC (right) on the same demonstration.
We train reasoning VLAs on 250 demonstrations across 10 cluttered tasks. Reasoning Annotation quality drives the performance gap in visually ambiguous scenes.
In cluttered scenes with visually similar objects, detector-based annotation selects the wrong object. SPARC uses interaction evidence to localize the correct one.
SPARC automatically generates diverse VQA samples from robot demonstrations. Each group shows annotation types from the same trajectory.
Cases where SPARC selects the wrong object (IoU < 0.5). Includes high-confidence failures (R > 0.6) and low-confidence failures. Gray box = ground-truth annotation.
@inproceedings{sparc2026,
title = {{SPARC}: Reliable Spatial Annotations from Robot Demonstrations at Scale},
author = {Blank, Nils and Mattes, Paul and Li, Maximilian Xiling and Suliga, Jakub and Roth, Thomas and Reuss, Moritz and Vanjani, Pankhuri and Lioutikov, Rudolf},
year = {2026}
}