SIR: Structured Image Representations for Explainable Robot Learning

Accepted to CVPR 2026

Abstract

Existing robot policies based on learned visual embeddings lack explicit structure and are sensitive to visual distractions. Thus, the representations that drive their behaviour are often opaque, making their decision-making process difficult to interpret. To address this, we introduce Structured Image Representations (SIR), a method that leverages Scene Graphs (SGs) as an intermediate representation for robot policy learning. Our approach first constructs a fully connected graph, using 2D or 3D image-derived features as initial node representations. Then, a module learns to sparsify this graph end-to-end, creating a minimal, task-relevant sub-graph that is passed to the action generation model. This process makes our model intrinsically explainable. Evaluations on RoboCasa show that our sparse graph policies outperform image-based baselines on average with 19.5% vs 14.81% success rate. We also demonstrate that our graph-based representations are significantly more robust to distractor objects, showing almost no performance degradation, as opposed to image representations. Most importantly, we show that the learned sparse graphs are a powerful tool for model analysis. By analysing when the model’s sub-graph deviates from human expectation, such as by including distractor nodes or omitting key objects, we successfully uncover dataset biases, including spurious correlations and positional biases.

Summary

Figure: SIR extracts initial node embeddings from the image and connects all objects in a fully connected graph. Afterwards the sparsification module reduces the fully connected graph to an important sub-graph, which is embedded using a GNN with attention layers. This embedding serves as the scene representation for a downstream action generator.

SIR demonstrates that explainable and robust robot policies can be achieved by leveraging Scene Graphs (SGs) as an intermediate representation, where a trainable module end-to-end sparsifies a fully-connected visual graph into a minimal, task-relevant sub-graph. This sparsification process makes the model intrinsically explainable by explicitly revealing exactly which objects and interactions the policy considers critical for action generation. Crucially, these learned sub-graphs act as a powerful diagnostic tool, by analyzing when the sub-graph deviates from human expectations—such as relying on distractor nodes or completely omitting key target objects. SIR successfully uncovers hidden dataset flaws like spurious correlations and positional biases. Evaluated on RoboCasa, the sparse graph policies achieve superior performance (19.5% vs. 14.81% success rate) compared to opaque image-based baselines and demonstrate remarkable robustness, showing almost no performance degradation when novel distractor objects are introduced. This approach provides a transparent alternative to black-box embeddings, enabling both high-performing control and deep insights into model debugging.

Experiments and Results

The main experiments were conducted on the 24 atomic tasks of RoboCasa using the Multimodal Diffusion Transformer (MDT) as action generator. We investigated 4 different experiment directions all compared to a standard image-based model:

Can scene graph observations be used as scene representations for action generation models?
What initial node features yield the best results?
Are scene graphs inherently more robust to distractor objects?
What information can be extracted from the generated sub-graphs during rollouts?

1. Scene Graphs as Structured Scene Representations

Observation	Pick/Place (8)	Doors (4)	Drawers (2)	Knobs (2)	Levers (3)	Buttons (3)	Insert (2)	Avg (24)
Image	1.19	25.13	49.75	7.25	23.67	17.00	4.75	14.81
FCG	0.06	28.62	39.25	14.00	40.00	18.83	4.75	16.98
SIR (Ours)	0.13	30.25	46.25	16.50	48.50	21.83	4.75	19.50

The average success rate of MDT using the left and right images on RoboCasa was around 14.81 percent. In comparison, the fully-connected graph achieved 16.98 percent and our SIR model (which used sparsified sub-graphs) achieved 19.50 percent. Therefore, we conclude that scene graphs as a scene representation in goal-conditioned imitation learning is a valid alternative compared to raw images yielding even better results.

2. Initial Node Features

We experimented with different initial nodes features and combinations, where combining them would mean putting all information concatenated into the graph node. We achieved the best performance, when combining bounding box coordinates and cropped image feature, as they combine the positional information of objects and their visual structure.

3. Scene Graphs and Distractor Objects

During rollouts we added between 3 to 9 distractor objects visible in the scene. This reduced the performance of image-based models over 3 percent over all 24 tasks of RoboCasa. Meanwhile, graph-based models did not decrease at all, execpt for one ablation, where the sub-graph is constructed using TopK with a fixed k value for all tasks. This again shows the advantages of graph-based methods, compared to raw vision-based models.

4. Explainable Sub-Graphs

SIR introduces different ways of creating sub-graphs that can be embedded to provide scene understanding for the action generation model. SIR itself sparsifies the fully-connected graph by selecting the TopK nodes, where k is adaptive per task and the nodes are predefined during training. This pushes the model towards nodes that should be important for task execution but makes it flexible enough to adapt the decision making during rollouts. This introduces three different sub-graphs which can be observed:

The expected sub-graph which the model was trained to predict
The distractor node sub-graph where the model is adding nodes which should be not task relevant
The missing node sub-graph where the model omits nodes that should have important information for task execution

We observed all three sub-graph variants with SIR, which concludes in the following observations for the RoboCasa datasets:

When the model added nodes which seem to be unimportant for the task itself, this could indicate that there are spurious correlations that the model exploits
When the model is removing seemingly important nodes, this could mean that the tasks can be solved based on positional bias, because positions of interaction objects are not important

Conclusion

In this paper, we introduced SIR, a method to generate and use learned, sparsified SGs as an intermediate representation for robot policy learning in GCIL. Our investigation shows that graph-based representations achieve higher average success rates than image-based baselines and are a highly effective architecture for integrating diverse modalities like point clouds.Furthermore, graph-based policies are significantly more robust to distractor objects, showing almost no performance degradation where image-based policies fail. Our most critical finding is that the learned, sparsified sub-graphs serve as a powerful tool for model and dataset debugging. By analysing when the model’s graph deviates from human intuition, such as by including distractor nodes or excluding key task-relevant nodes, we successfully identified significant spurious correlations and positional biases in the dataset. This demonstrates that an end-to-end learned, explainable representation like SIR can uncover flaws in training data. Such insights would be completely masked by non-end-to-end methods, like Vision Language Model (VLM) pre-filtering, which would always select the ”correct” objects and hide these biases. For future work, we plan to extend our node selection method to allow the model to learn how many nodes are important, rather than relying on heuristics. We also aim to further leverage our explanation-based analysis to evaluate and improve other models and datasets.

Citation

TBD