SPARC: Reliable Spatial Annotations from Robot Demonstrations at Scale

This work introduces SPARC (Spatial Annotations from Robot Demonstrations with Reliability Calibration), a risk-aware framework that automatically labels robot demonstrations with structured spatial annotations and assigns each annotation a reliability score. Structured spatial annotations — bounding boxes, object trajectories, and manipulation phase labels — benefit a broad range of robotics applications, from training grounded robot policies and embodied foundation models to motion planning and hierarchical task composition.

Existing automated pipelines generate such annotations at scale but provide no reliable quality signal: detector confidence is poorly calibrated for annotation correctness, forcing a choice between accepting noisy labels or discarding useful samples. In contrast to existing automated pipelines, SPARC leverages the spatio-temporal structure inherent to robot tasks to generate a reliability signal, thus reducing noisy labels and retaining more useful samples.

We further introduce IA-Bench, a benchmark that measures model accuracy in grounding the locations of interacted objects in robot demonstrations. On 1.7k human-annotated demonstrations spanning diverse embodiments and scenarios, SPARC significantly outperforms detection-only baselines in object localization accuracy while also retaining three times more samples at high-precision operating points. Our experiments demonstrate that models fine-tuned on our annotations achieve state-of-the-art results on object-grounding and pointing benchmarks among similarly sized models, while remaining competitive on broader spatial-reasoning suites without any manually verified or annotated training data. Furthermore, policies trained on SPARC-generated annotations significantly outperform baselines in cluttered, visually ambiguous real-world scenes.