Abstract
Datasets providing high-quality visual annotations in manipulation-relevant scenes remain scarce. We introduce MOPS, a dataset generation framework that combines 3D assets from PartNet-Mobility and RoboCasa with a zero-shot LLM-based augmentation pipeline to automatically normalize object scale and generate part-level affordance annotations, describing how an object part can be manipulated (e.g., a mug handle is “graspable,” a drawer is “pullable”). Built on ManiSkill3, MOPS produces photorealistic indoor scenes with pixel-perfect ground truth for class, part, and instance segmentation, multi-label affordances, depth, surface normals, and 6D poses, spanning 54 affordance types across 137 object categories. Human verification confirms 97.3% accuracy of the zero-shot affordance labels. We validate MOPS on three vision benchmarks of increasing scene complexity and show that ground-truth affordance masks improve imitation learning success rates on 24 RoboCasa manipulation tasks by 7.9 percentage points over RGB-only baselines, with predicted affordances still yielding measurable gains. The dataset and framework are publicly available.
Annotation Modalities
Rich, multi-modal ground truth for every scene




Key Features
Photorealistic Simulation
High-quality rendering via ManiSkill3 & SAPIEN, built on a normalized asset pipeline with automatic part-level annotation across multiple 3D libraries.
LLM-Powered Annotation
Zero-shot asset augmentation using large language models for automatic part-level labeling, scale normalization, and semantic understanding — 97.3% accurate against human verification.
Multi-Modal Ground Truth
RGB, depth, surface normals, part segmentation, affordance maps (graspable, pushable, …), and 6D pose — all pixel-aligned.
Diverse Environments
Kitchen environments, cluttered tabletops, and isolated object scenarios spanning 137 object categories and 54 affordance labels.
Results
Dataset Comparison
Taxonomic coverage vs. existing affordance datasets
| Dataset | Aff. | Cat. | Obj. |
|---|---|---|---|
| RGB-D Part | 7 | 17 | 105 |
| 3D-AffNet | 16 | 23 | 22,949 |
| MOPS (Total) | 54 | 137 | 3,353 |
MOPS leads on affordance label and category breadth; 3D-AffNet has more raw instances.
Robot Manipulation
Imitation learning on 24 RoboCasa tasks · 10 seeds each
| Policy Inputs | Success | Gain |
|---|---|---|
| RGB only | 13.33% | — |
| + MOPS Affordances | 21.25% | +7.92 |
MOPS affordances provide a consistent boost across all 24 tasks.
Getting Started
Citation
If you use MOPS in your research, please cite our work
@article{li2026mops,
title = {Multi-Objective Photoreal Simulation (MOPS) Dataset
for Computer Vision in Robot Manipulation},
author = {Maximilian Xiling Li and Paul Mattes and
Nils Blank and Rudolf Lioutikov},
year = {2026}
}