Accepted to CVPR 2026
Abstract
Summary
Figure: SIR extracts initial node embeddings from the image and connects all objects in a fully connected graph. Afterwards the sparsification module reduces the fully connected graph to an important sub-graph, which is embedded using a GNN with attention layers. This embedding serves as the scene representation for a downstream action generator.
SIR demonstrates that explainable and robust robot policies can be achieved by leveraging Scene Graphs (SGs) as an intermediate representation, where a trainable module end-to-end sparsifies a fully-connected visual graph into a minimal, task-relevant sub-graph. This sparsification process makes the model intrinsically explainable by explicitly revealing exactly which objects and interactions the policy considers critical for action generation. Crucially, these learned sub-graphs act as a powerful diagnostic tool, by analyzing when the sub-graph deviates from human expectations—such as relying on distractor nodes or completely omitting key target objects. SIR successfully uncovers hidden dataset flaws like spurious correlations and positional biases. Evaluated on RoboCasa, the sparse graph policies achieve superior performance (19.5% vs. 14.81% success rate) compared to opaque image-based baselines and demonstrate remarkable robustness, showing almost no performance degradation when novel distractor objects are introduced. This approach provides a transparent alternative to black-box embeddings, enabling both high-performing control and deep insights into model debugging.
Experiments and Results
The main experiments were conducted on the 24 atomic tasks of RoboCasa using the Multimodal Diffusion Transformer (MDT) as action generator. We investigated 4 different experiment directions all compared to a standard image-based model:
- Can scene graph observations be used as scene representations for action generation models?
- What initial node features yield the best results?
- Are scene graphs inherently more robust to distractor objects?
- What information can be extracted from the generated sub-graphs during rollouts?
1. Scene Graphs as Structured Scene Representations
| Observation | Pick/Place (8) | Doors (4) | Drawers (2) | Knobs (2) | Levers (3) | Buttons (3) | Insert (2) | Avg (24) |
|---|---|---|---|---|---|---|---|---|
| Image | 1.19 | 25.13 | 49.75 | 7.25 | 23.67 | 17.00 | 4.75 | 14.81 |
| FCG | 0.06 | 28.62 | 39.25 | 14.00 | 40.00 | 18.83 | 4.75 | 16.98 |
| SIR (Ours) | 0.13 | 30.25 | 46.25 | 16.50 | 48.50 | 21.83 | 4.75 | 19.50 |
The average success rate of MDT using the left and right images on RoboCasa was around 14.81 percent. In comparison, the fully-connected graph achieved 16.98 percent and our SIR model (which used sparsified sub-graphs) achieved 19.50 percent. Therefore, we conclude that scene graphs as a scene representation in goal-conditioned imitation learning is a valid alternative compared to raw images yielding even better results.
2. Initial Node Features
We experimented with different initial nodes features and combinations, where combining them would mean putting all information concatenated into the graph node. We achieved the best performance, when combining bounding box coordinates and cropped image feature, as they combine the positional information of objects and their visual structure.
3. Scene Graphs and Distractor Objects
During rollouts we added between 3 to 9 distractor objects visible in the scene. This reduced the performance of image-based models over 3 percent over all 24 tasks of RoboCasa. Meanwhile, graph-based models did not decrease at all, execpt for one ablation, where the sub-graph is constructed using TopK with a fixed k value for all tasks. This again shows the advantages of graph-based methods, compared to raw vision-based models.
4. Explainable Sub-Graphs
SIR introduces different ways of creating sub-graphs that can be embedded to provide scene understanding for the action generation model. SIR itself sparsifies the fully-connected graph by selecting the TopK nodes, where k is adaptive per task and the nodes are predefined during training. This pushes the model towards nodes that should be important for task execution but makes it flexible enough to adapt the decision making during rollouts. This introduces three different sub-graphs which can be observed:
- The expected sub-graph which the model was trained to predict
- The distractor node sub-graph where the model is adding nodes which should be not task relevant
- The missing node sub-graph where the model omits nodes that should have important information for task execution
We observed all three sub-graph variants with SIR, which concludes in the following observations for the RoboCasa datasets:
- When the model added nodes which seem to be unimportant for the task itself, this could indicate that there are spurious correlations that the model exploits
- When the model is removing seemingly important nodes, this could mean that the tasks can be solved based on positional bias, because positions of interaction objects are not important
Conclusion
In this paper, we introduced SIR, a method to generate and use learned, sparsified SGs as an intermediate representation for robot policy learning in GCIL. Our investigation shows that graph-based representations achieve higher average success rates than image-based baselines and are a highly effective architecture for integrating diverse modalities like point clouds.Furthermore, graph-based policies are significantly more robust to distractor objects, showing almost no performance degradation where image-based policies fail. Our most critical finding is that the learned, sparsified sub-graphs serve as a powerful tool for model and dataset debugging. By analysing when the model’s graph deviates from human intuition, such as by including distractor nodes or excluding key task-relevant nodes, we successfully identified significant spurious correlations and positional biases in the dataset. This demonstrates that an end-to-end learned, explainable representation like SIR can uncover flaws in training data. Such insights would be completely masked by non-end-to-end methods, like Vision Language Model (VLM) pre-filtering, which would always select the ”correct” objects and hide these biases. For future work, we plan to extend our node selection method to allow the model to learn how many nodes are important, rather than relying on heuristics. We also aim to further leverage our explanation-based analysis to evaluate and improve other models and datasets.
Citation
TBD