Tenure-track Professor, Machine Learning & Robotics
Karlsruhe Institute of Technology (KIT) · Intuitive Robots Lab
Emmy Noether Research Group
RIG Cluster on Learning and Multimodal AI for Robotics
Abstract
Behavior foundation models, and the broader path from large language models to vision-language and vision-language-action models, are reshaping robot learning. Designing, training, and running them involves a series of practical choices that strongly shape how these models behave. This hands-on tutorial works through those choices directly, by building and dissecting a minimal behavior foundation model live, in code.
We start with a concise framing: what sets behavior foundation models apart from classical approaches, and how architectures evolve from language to vision-language to action. We then move into code, building a compact and deliberately minimal model that lets us demonstrate different design and architecture choices and their effects.
From there, we explore the decisions that most shape these models by discussing and incorporating various components and observing the consequences. Central foci are action representation and spatio-temporal reasoning, alongside the mechanisms used to condition models on observations and task parameters. We also cover frequently encountered practical challenges, and close with an outlook on where the field is heading.
The goal is to see past the high-level architecture diagrams and understand the concrete decisions behind these models. Attendees will leave ready to start building and contributing in this area.
Topics Covered
Code repository, slides, and companion materials will appear here in the coming days.