MOPS: Multi-Objective Photoreal Simulation Dataset
for Computer Vision in Robot Manipulation

Maximilian X. Li, Paul Mattes, Nils Blank, Rudolf Lioutikov
Intuitive Robots Lab, Karlsruhe Institute of Technology, Germany

Abstract

Datasets providing high-quality visual annotations in manipulation-relevant scenes remain scarce. We introduce MOPS, a dataset generation framework that combines 3D assets from PartNet-Mobility and RoboCasa with a zero-shot LLM-based augmentation pipeline to automatically normalize object scale and generate part-level affordance annotations, describing how an object part can be manipulated (e.g., a mug handle is “graspable,” a drawer is “pullable”). Built on ManiSkill3, MOPS produces photorealistic indoor scenes with pixel-perfect ground truth for class, part, and instance segmentation, multi-label affordances, depth, surface normals, and 6D poses, spanning 54 affordance types across 137 object categories. Human verification confirms 97.3% accuracy of the zero-shot affordance labels. We validate MOPS on three vision benchmarks of increasing scene complexity and show that ground-truth affordance masks improve imitation learning success rates on 24 RoboCasa manipulation tasks by 7.9 percentage points over RGB-only baselines, with predicted affordances still yielding measurable gains. The dataset and framework are publicly available.


Annotation Modalities

Rich, multi-modal ground truth for every scene


Key Features

🎨

Photorealistic Simulation

High-quality rendering via ManiSkill3 & SAPIEN, built on a normalized asset pipeline with automatic part-level annotation across multiple 3D libraries.

🤖

LLM-Powered Annotation

Zero-shot asset augmentation using large language models for automatic part-level labeling, scale normalization, and semantic understanding — 97.3% accurate against human verification.

🏷️

Multi-Modal Ground Truth

RGB, depth, surface normals, part segmentation, affordance maps (graspable, pushable, …), and 6D pose — all pixel-aligned.

🏠

Diverse Environments

Kitchen environments, cluttered tabletops, and isolated object scenarios spanning 137 object categories and 54 affordance labels.


Results

Dataset Comparison

Taxonomic coverage vs. existing affordance datasets

Dataset Aff. Cat. Obj.
RGB-D Part 7 17 105
3D-AffNet 16 23 22,949
MOPS (Total) 54 137 3,353

MOPS leads on affordance label and category breadth; 3D-AffNet has more raw instances.

Robot Manipulation

Imitation learning on 24 RoboCasa tasks · 10 seeds each

21.25%
Success Rate
RGB + MOPS Affordances
+7.92pp
Absolute Gain
over RGB-only baseline
Policy Inputs Success Gain
RGB only 13.33%
+ MOPS Affordances 21.25% +7.92

MOPS affordances provide a consistent boost across all 24 tasks.


Getting Started


Citation

If you use MOPS in your research, please cite our work

@article{li2026mops,
  title   = {Multi-Objective Photoreal Simulation (MOPS) Dataset
             for Computer Vision in Robot Manipulation},
  author  = {Maximilian Xiling Li and Paul Mattes and
             Nils Blank and Rudolf Lioutikov},
  year    = {2026}
}