MOPS: Multi-Objective Photoreal Simulation Dataset for Computer Vision in Robot Manipulation

Abstract

Datasets providing high-quality visual annotations in manipulation-relevant scenes remain scarce. We introduce MOPS, a dataset generation framework that combines 3D assets from PartNet-Mobility and RoboCasa with a zero-shot LLM-based augmentation pipeline to automatically normalize object scale and generate part-level affordance annotations, describing how an object part can be manipulated (e.g., a mug handle is “graspable,” a drawer is “pullable”). Built on ManiSkill3, MOPS produces photorealistic indoor scenes with pixel-perfect ground truth for class, part, and instance segmentation, multi-label affordances, depth, surface normals, and 6D poses, spanning 54 affordance types across 137 object categories. Human verification confirms 97.3% accuracy of the zero-shot affordance labels. We validate MOPS on three vision benchmarks of increasing scene complexity and show that ground-truth affordance masks improve imitation learning success rates on 24 RoboCasa manipulation tasks by 7.9 percentage points over RGB-only baselines, with predicted affordances still yielding measurable gains. The dataset and framework are publicly available.

Annotation Modalities

Rich, multi-modal ground truth for every scene

RGB

Depth

Surface Normals

Segmentation

Key Features

🎨

Photorealistic Simulation

High-quality rendering via ManiSkill3 & SAPIEN, built on a normalized asset pipeline with automatic part-level annotation across multiple 3D libraries.

🤖

LLM-Powered Annotation

Zero-shot asset augmentation using large language models for automatic part-level labeling, scale normalization, and semantic understanding — 97.3% accurate against human verification.

🏷️

Multi-Modal Ground Truth

RGB, depth, surface normals, part segmentation, affordance maps (graspable, pushable, …), and 6D pose — all pixel-aligned.

🏠

Diverse Environments

Kitchen environments, cluttered tabletops, and isolated object scenarios spanning 137 object categories and 54 affordance labels.

Results

Dataset Comparison

Taxonomic coverage vs. existing affordance datasets

Dataset	Aff.	Cat.	Obj.
RGB-D Part	7	17	105
3D-AffNet	16	23	22,949
MOPS (Total)	54	137	3,353

MOPS leads on affordance label and category breadth; 3D-AffNet has more raw instances.

Robot Manipulation

Imitation learning on 24 RoboCasa tasks · 10 seeds each

21.25%

Success Rate

RGB + MOPS Affordances

+7.92pp

Absolute Gain

over RGB-only baseline

Policy Inputs	Success	Gain
RGB only	13.33%	—
+ MOPS Affordances	21.25%	+7.92

MOPS affordances provide a consistent boost across all 24 tasks.

Getting Started

Alpha

Early release — actively developed, API may change. Code and dataset are now publicly available:

⚙️ mops-repo — Generation framework & benchmarks Available 🤗 HuggingFace — MOPS dataset collection Available

📖 Setup & Installation Guide →

Citation

If you use MOPS in your research, please cite our work

@article{li2026mops,
  title   = {Multi-Objective Photoreal Simulation (MOPS) Dataset
             for Computer Vision in Robot Manipulation},
  author  = {Maximilian Xiling Li and Paul Mattes and
             Nils Blank and Rudolf Lioutikov},
  year    = {2026}
}