Comma.ai Reading Syllabus

ML foundations

Imitation learning, policy gradient, and distributed RL — the primitives everything builds on

Start here

Imitation Learning [1] in paper

Easy

A Framework for Behavioural Cloning

Bain & Sammut · 1995 · Machine Intelligence 15

The baseline the paper explicitly improves on. §2 cites [1] as the i.i.d. method that fails under compounding errors. Before reading anything else, understand why BC fails — it's the motivation for the entire paper.

~2 hrs

What to read

Read Chapter 3 (behavioural cloning formulation) and Chapter 5 (failure analysis). The key result is that BC minimises one-step loss but compounding errors cause quadratic degradation over a horizon — a policy that makes ε error per step makes O(εT²) total error.

What to skip

Chapter 2 (game-theoretic framing) and Chapter 6 (symbolic AI discussion) are dated and not relevant. The paper is from 1995 — treat it as a clean statement of the problem, not a state-of-the-art solution.

Key concept

The covariate shift problem: training distribution p(s) assumes expert states, but at test time the learner visits its own states q(s) ≠ p(s). As soon as one mistake is made, future states diverge. This is why comma.ai needs on-policy training.

E[loss] ∝ ε × T² where ε = one-step error T = episode horizon

Comma.ai connection

§5 "off-policy learning" = exactly BC on expert demonstrations
Table 2: off-policy policy passes 5/24 convergence tests despite lowest trajectory MAE — this is the BC failure mode made empirical
The entire motivation for §5.1 on-policy IMPALA training comes from this paper's failure analysis

Imitation Learning prerequisite

Medium

A Reduction of Imitation Learning to No-Regret Online Learning (DAgger)

Ross, Gordon & Bagnell · AISTATS 2011

The canonical fix for BC's compounding error problem. DAgger iteratively queries the expert on states the learner actually visits. The comma.ai paper replaces the expert oracle with the Plan Model — same architecture, different supervision source.

~3 hrs

What to read

§2 (problem formulation), §3 (DAgger algorithm), Theorem 2 (linear regret bound). DAgger achieves O(εT) error instead of BC's O(εT²) by training on the learner's own state distribution.

What to skip

The SEARN comparison (§4) and the structured prediction framing. Focus on the core DAgger loop: run policy → collect states → query expert on those states → add to dataset → retrain.

Key algorithm

for i = 1..N: π_i = (1-β_i)π* + β_i π̂_{i-1} collect D_i = {s: rollout with π_i} query expert: label D_i with π*(s) retrain π̂_i on D_1 ∪ ... ∪ D_i

Comma.ai connection

§5.1 IMPALA is DAgger with the Plan Model replacing π* (the expert oracle)
IMPALA actors = DAgger's rollout step
World Model Plan Head = DAgger's expert query step
Central learner = DAgger's retraining step

Reinforcement Learning prerequisite

Medium

Proximal Policy Optimization Algorithms

Schulman, Wolski, Dhariwal, Radford, Klimov · OpenAI 2017

Background for IMPALA. You need the actor-critic setup, the policy gradient theorem, and the surrogate objective before §5.1 makes sense. Read this before IMPALA.

~3 hrs

What to read

§2 (background — policy gradient and actor-critic), §3 (clipped surrogate objective), §4 (algorithm). The clipped objective is the key innovation — it prevents destructively large policy updates by limiting the ratio π_new/π_old.

What to skip

§5 (experiments on Mujoco) and §6 (Atari comparison). These are application-specific. The algorithm section is what matters.

Key equation

L_CLIP(θ) = E[min(r_t(θ)Â_t, clip(r_t(θ), 1-ε, 1+ε)Â_t)] where r_t(θ) = π_θ(a|s) / π_θ_old(a|s)

Comma.ai connection

IMPALA ([7]) inherits the actor-critic separation from PPO
Comma.ai uses imitation loss instead of PPO's surrogate — but the distributed actor/learner split is identical
The V-trace correction in IMPALA serves the same role as the clipped ratio in PPO: correcting for off-policyness

Distributed RL [7] in paper

Hard

IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures

Espeholt et al. · DeepMind · ICML 2018

Directly cited as [7]. The exact architecture used in §5.1 — N actors generate rollouts in parallel, one central learner updates the policy. This is not background reading; the paper's on-policy training is a literal implementation of IMPALA.

~4 hrs

What to read

§2 (IMPALA architecture), §3 (V-trace algorithm), §3.1 (off-policy correction). The V-trace correction is crucial: because actors use a slightly stale policy snapshot, the data they generate is technically off-policy. V-trace corrects for this with importance weights clipped at ρ̄ and c̄.

What to skip

§5 (multi-task Atari results) and §6 (DMLab comparison). The architecture section is what you need — not the game-specific results.

Key architecture

Actors (×N): run latest policy π_old collect rollout h^{π,wp} send to learner via queue Learner (×1): receive rollouts from actors compute V-trace targets update π with imitation loss broadcast new π to actors

Comma.ai connection

§5.1 adopts IMPALA's actor-learner split exactly
Each actor rolls out h^{π,wp} = {(o_t, a_t, â^wp_t)} where â^wp comes from the Plan Model
The parameter server pattern (learner broadcasts updated policy to actors) is explicit in §5.1
Comma.ai replaces the RL reward signal with imitation loss on â^wp — so no V-trace correction is needed

Localisation stack — pose p_t

The GPS + vision fusion system that produces the 6-DOF pose conditioning signal for the World Model

Often skipped — read it

Localisation [17] in paper

Hard

A Multi-State Constraint Kalman Filter for Vision-Aided Inertial Navigation

Mourikis & Roumeliotis · ICRA 2007

This is what produces p_t — the pose that conditions the World Model. The World Model is fundamentally different from pose-free world models precisely because of MSCKF. Understanding the pose signal is essential to understanding why the architecture works the way it does.

~5 hrs

What to read

§II (system overview — IMU model, camera model), §III (filter formulation — state vector, prediction step, update step), §IV (camera measurement model). Focus on what the output is: a 6-DOF pose p_t = (x,y,z,φ,θ,ψ) at every camera frame, fused from IMU integration and visual feature tracks.

What to skip

§V (observability analysis) and §VI (complexity analysis). These are theoretical results that matter for filter design but not for understanding how comma.ai uses the output. You can treat MSCKF as a black box that outputs SE(3) poses.

Key insight

MSCKF tracks camera poses across multiple frames as part of the filter state. Feature observations constrain the relative pose between frames — this is what gives drift-free visual odometry without needing a map. The pose output is smooth, globally consistent, and metric-scale (because IMU provides scale).

Comma.ai connection

§2.2 explicitly states pose p_t is from a "tightly coupled GPS/Vision MSCKF [14,17,26]"
The pose is the transition signal for the World Model (§2.4): w maps (images, poses) → next image
Using pose instead of raw actions means the WM is Vehicle Model-independent — you can augment VM parameters without retraining the WM
In comma2k19 dataset, poses are pre-computed — you don't need to run MSCKF yourself

Localisation [14] in paper

Hard

High-Precision, Consistent EKF-Based Visual-Inertial Odometry

Li & Mourikis · IJRR 2013

The tightly-coupled GPS+Vision extension of MSCKF. Cited as [14]. This is the specific variant comma.ai uses — GPS pseudoranges are fused directly into the filter state rather than loosely post-processed.

~4 hrs

What to read

§III (tightly-coupled GPS measurement model) and §IV (consistency analysis). The key improvement over MSCKF 2007: GPS pseudoranges constrain absolute scale and drift, while camera feature tracks constrain relative pose. The combination gives globally consistent trajectory estimates.

What to skip

The observability proofs (Appendix A). These are mathematically important for filter design correctness but not needed to understand how comma.ai uses the pose output.

Practical significance

Without GPS fusion, visual odometry drifts — after 1 km of driving, position error might be 10–20 m. With tight GPS fusion, drift is centimetric. This matters for the World Model: if p_t drifts, the conditioning signal becomes inconsistent and generated frames won't match the commanded trajectory.

Comma.ai connection

The comma2k19 dataset provides these poses pre-computed at every frame
p_t = (x,y,z,φ,θ,ψ) ∈ ℝ⁶ is directly from this filter's output
In urban canyons (GPS degraded), the visual part takes over — the filter is robust by design

Dataset [26] in paper

Easy

A Commute in Data: The comma2k19 Dataset

Schäfer, Santana, Haden, Biasini · comma.ai 2018

The dataset format comma.ai uses in the paper. §4.5 says "400k segments, 1 minute each, 5 Hz" — all from comma2k19 format. Read this before trying to use any comma.ai open-source components so you understand the data pipeline.

~1.5 hrs

What to read

§2 (data collection setup), §3 (data format — segment structure, sensor modalities), §4 (MSCKF pose pipeline). Pay attention to the segment format: each segment is a fixed-length clip of driving with synchronised camera frames, IMU readings, GPS, and pre-computed MSCKF poses.

Key format details

Segments are 1-minute clips. Video at 20 Hz (downsampled to 5 Hz for World Model training). Pose in SE(3) comma coordinate frame. Two cameras: wide (120°) and narrow (~28°). The narrow camera is what most models use for forward driving.

Dataset structure

segment/ video.hevc (compressed video) processed_log/ CAN/ (speed, steering) GPS/ (raw + corrected) IMU/ (accel, gyro) frame_times.npy (frame timestamps) global_poses.npy (6-DOF MSCKF poses)

Comma.ai connection

§4.5: "400k segments, each 1 minute" — this is the exact comma2k19 segment format scaled up
The 5 Hz downsampling in §4.5 matches the native 5-Hz MSCKF pose output
Download: github.com/commaai/comma2k19
The full internal dataset has millions of segments — comma2k19 is 2,000 hours, a 5% sample

Generative models — VAE + diffusion

The image encoder (§4.1), the DiT architecture (§4.2), and the Rectified Flow objective (§4.2.2) all come from here

Heaviest layer

Generative Models prerequisite

Medium

Auto-Encoding Variational Bayes

Kingma & Welling · ICLR 2014

The World Model operates in VAE latent space, not pixel space. You need to understand what a VAE is — specifically the reparameterisation trick and the ELBO — before the VAE compression in §4.1 makes sense.

~3 hrs

What to read

§2 (variational lower bound), §2.4 (reparameterisation trick), §3 (SGVB estimator). The core idea: instead of directly maximising p(x), maximise the ELBO = E[log p(x|z)] - KL(q(z|x) || p(z)). The first term is reconstruction, the second is a regulariser that keeps the latent space smooth.

Key equations

ELBO = E_q[log p(x|z)] - KL(q||p) Reparameterisation: z = μ(x) + σ(x) ⊙ ε, ε ~ N(0,I) (makes gradient flow through sampling)

Intuition

The encoder maps an image to a distribution in latent space (not a point). During training, you sample from this distribution and try to reconstruct the image. The KL term forces the distribution to stay close to a Gaussian prior — this makes the latent space smooth and interpolatable.

Comma.ai connection

§4.1 uses the Stable Diffusion VAE as a fixed codec — 64×64 RGB → 8×8×4 latents
Scale factor 0.18215 (from LDM paper) normalises the latent distribution to unit variance
The World Model never touches pixels — it operates entirely in latent space
Policy also operates on raw pixels — only the WM uses latents

Generative Models [23] in paper

Medium

High-Resolution Image Synthesis with Latent Diffusion Models

Rombach, Blattmann, Lorenz, Esser, Ommer · CVPR 2022

Directly cited as [23] — the exact VAE comma.ai uses off the shelf. The scale factor 0.18215 in §4.1 comes directly from this paper. More importantly, Stable Diffusion's encoder/decoder is the video encoder for the entire World Model.

~4 hrs

What to read

§3 (LDM formulation — perceptual compression + diffusion in latent space), Appendix A (autoencoder architecture). The key design decision: train a powerful VAE first, then train a diffusion model in the latent space. This is 4–8× cheaper than pixel-space diffusion.

The VAE you're using

Model: vae-ft-mse-840000-ema-pruned Compression: 8× spatial (64px → 8px) Latent channels: 4 Scale factor: 0.18215 HuggingFace: stabilityai/sd-vae-ft-mse

Why latent space

Pixel-space diffusion at 64×64 has 64×64×3 = 12,288 dimensions. Latent space has 8×8×4 = 256 dimensions — 48× smaller. The World Model (DiT) operates on latents, so each training step is 48× cheaper. This is why the paper can afford to train a large DiT at all.

Comma.ai connection

§4.1: "we use the pretrained Stable Diffusion image VAE [23]"
The VAE is frozen — it is never fine-tuned on driving data
Frames downscaled to 128×256 before VAE → latents are 16×32 (not 8×8, the 64×64 model produces 8×8 from 64×64 inputs)
The scale factor 0.18215 normalises latent variance to ~1.0 for stable diffusion training

Diffusion [16] in paper

Medium

Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

Liu, Gong, Liu · ICLR 2023

This is L_RF — the core training objective of the World Model (§4.2.2, Eq.5). The noise interpolation equation o_τ = τε + (1-τ)o is Eq.4 in the paper. Understanding Rectified Flow is necessary to understand what the DiT is actually learning.

~4 hrs

What to read

§2 (rectified flow formulation), §3 (training objective), §4 (reflow for straighter paths). The key insight: instead of DDPM's complex noise schedule, interpolate linearly between data (τ=0) and noise (τ=1). The model learns to predict the velocity field (o - ε), which is the direction from noise to data.

Key equations — match to §4.2.2

Interpolation (Eq.4 in paper): o_τ = τε + (1-τ)o Training objective (L_RF in §4.2.2): L_RF = ||w(o_τ, p, τ) - (o - ε)||² (model predicts the "data direction") Euler sampling (§4.3, Eq.6): o_{τ+Δτ} = o_τ + Δτ·w(o_τ, p, τ+Δτ)

Why 15 steps work

RF produces straighter probability flow ODE paths than DDPM — the optimal transport coupling makes paths nearly linear. Straight paths mean fewer Euler steps are needed. With DDPM you'd need 50–1000 steps; with RF, 15 steps is sufficient for driving video quality.

Comma.ai connection

§4.2.2: "we adopt the Rectified Flow (RF) objective [16]"
τ ~ LogitNormal(0.0, 1.0) [8] concentrates training at mid-noise levels
§4.3: 15 Euler steps Δτ = 1/15 for sequential sampling
The model predicts (o - ε), not o directly — this is the velocity parameterisation

Architecture [19] in paper

Hard

Scalable Diffusion Models with Transformers

Peebles & Xie · ICCV 2023

The transformer backbone of the World Model. AdaLN conditioning, patch tokenisation, and the scaling laws all come from here. This is the single most important architecture paper in the stack — the World Model is a direct extension of DiT to 3D video sequences.

~5 hrs

What to read

§3 (DiT architecture — patch embeddings, conditioning strategies), §4 (AdaLN-Zero), §5 (scaling experiments). Read Figure 3 carefully — it shows all conditioning variants. The comma.ai paper uses AdaLN (Figure 3c) where conditioning vectors modulate LayerNorm scale and shift.

AdaLN — the key block

AdaLN(x, c): x_norm = LayerNorm(x) # no learnable γ,β γ, β = Linear(c).chunk(2) return x_norm * (1 + γ) + β c = sum(pose_embed, τ_embed, world_t_embed) (comma.ai adds world-timestep conditioning)

Comma.ai extensions to DiT

3D input: patch table extended to (frame × height × width) then flattened
Causal mask: frame-wise triangular mask enables KV-caching for autoregressive sampling
Multi-conditioning: pose + τ + world-timestep all summed before AdaLN
Plan Head added: residual FF blocks on pooled context tokens → trajectory
Three sizes: 250M (GPT-2), 500M (GPT-medium), 1B (GPT-large)

Comma.ai connection

§4.2.1: "we use the DiT architecture [19], adapted to 3D inputs"
The scaling results in Fig.5 of the paper mirror DiT's scaling law: more params + more data → lower LPIPS
GPT-2 model sizes [21] are used to define the three DiT variants

Diffusion [8] in paper

Medium

Scaling Rectified Flow Transformers for High-Resolution Image Synthesis (SD3)

Esser, Kulal, Blattmann et al. · ICML 2024

Cited as [8] for a single specific thing: the LogitNormal(0.0, 1.0) noise schedule for τ. §4.2.2 uses this exact distribution to sample noise timesteps during training. Also introduces MM-DiT which is architecturally related to multi-modal conditioning.

~3 hrs

What to read

§3.1 (noise schedule analysis), §3.2 (LogitNormal distribution), §4 (MM-DiT architecture for background understanding). The LogitNormal distribution for τ concentrates training samples at mid-noise levels (neither fully clean nor fully noisy), which empirically gives better perceptual quality than uniform sampling.

LogitNormal schedule

u ~ Normal(0, 1) τ = sigmoid(u) ∈ (0, 1) This concentrates τ around 0.5 — the model trains more on mid-denoising steps where most of the perceptual information is recovered.

Why this matters

Uniform τ ~ Uniform(0,1) wastes capacity on near-clean (τ≈0) and near-noise (τ≈1) levels that contribute little to sample quality. LogitNormal focuses 70% of training on the 0.2–0.8 range where perceptual quality is determined. §4.4 uses LogitNormal(0.0, 0.25) for the noise augmentation on context frames.

Comma.ai connection

§4.2.2: "we sample the noise timestep τ ~ Logit-Normal(0.0, 1.0) [8]"
§4.4 noise augmentation: context frames noised at τ ~ LogitNormal(0.0, 0.25) — narrower distribution = less noise on context

Architecture [32] in paper

Easy

Understanding and Improving Layer Normalization

Xu, Sun, Zhang, Zhao, Lin · NeurIPS 2019

Cited as [32] for the AdaLN formulation used in §4.2.1. A short paper that's worth reading to understand the theoretical grounding of the conditioning mechanism — why you can steer a transformer's activations by modulating LayerNorm's affine parameters.

~1.5 hrs

What to read

§2 (LayerNorm formulation), §4 (adaptive variants). The key result: the γ (scale) and β (shift) parameters of LayerNorm have an outsized effect on the network's internal representations. By making these parameters functions of an external conditioning signal, you can steer the network's behaviour without touching the attention weights.

AdaLN formulation

Standard LN: y = (x-μ)/σ * γ + β (γ, β are learned params) AdaLN: γ, β = MLP(c) (γ, β are functions of conditioning c) AdaLN-Zero: init MLP to output (γ=0, β=0) → identity at init, stable training

Comma.ai connection

§4.2.1: "conditioning signals ... passed to the Adaptive Layer Norm layer (AdaLN) [32]"
The conditioning vector c = sum(pose_embed, τ_embed, world_t_embed)
AdaLN-Zero initialisation (DiT paper) means the model starts as a pure transformer and gradually learns to use the conditioning

Practical note

In implementation, you initialise the final linear layer of the MLP (that produces γ, β) to zero. This means at the start of training, AdaLN is an identity function — the conditioning has no effect and the model trains stably. As training progresses, the MLP learns to use the conditioning signal.

World models for control

The conceptual lineage of using a learned simulator to train a policy — from Ha & Schmidhuber to GAIA-1

Read after Layer 3

World Models [10] in paper

Medium

Recurrent World Models Facilitate Policy Evolution

Ha & Schmidhuber · NeurIPS 2018

The paper that named the paradigm and gave it a framework. The core idea — train a generative model of the environment, train a policy inside the model's imagination — is the direct ancestor of everything comma.ai does.

~4 hrs

What to read

§2 (world model components: V model, M model, C controller), §3 (dream training — train policy entirely in the world model), §4 (results on Car Racing and VizDoom). The MDN-RNN architecture predicts a distribution over next latents, not a deterministic next state — this models uncertainty explicitly.

The architecture lineage

Ha & Schmidhuber 2018: V = VAE (compress frames) M = MDN-RNN (predict next latent) C = CMA-ES linear controller Comma.ai 2025: V = SD VAE (same role) M = DiT with RF (replaces MDN-RNN) C = FastViT + Transformer (replaces CMA-ES)

Key insight

The policy C never sees the real environment during training — it trains entirely on imagined experiences from M. Transfer from dream to reality works because V compresses real observations and M's predictions are in the same latent space. This is exactly what comma.ai does with the World Model.

Comma.ai connection

[10] is one of only 3 world model papers cited in the introduction (§1)
The "dream training" concept = comma.ai's on-policy training inside the WM simulator
Main upgrade: DiT produces photorealistic frames; MDN-RNN produces abstract latent distributions
Future anchoring (§2.5) is not in this paper — it's comma.ai's novel contribution

World Models [2] in paper

Medium

Video PreTraining (VPT): Learning to Act by Watching Unlabeled Online Videos

Baker, Akkaya et al. · OpenAI · NeurIPS 2022

Cited as [2] for the concept behind Future-Anchored World Models (§2.5). VPT trains non-causal models conditioned on future observations. This is the intellectual origin of "future anchoring" — conditioning on what happens later to generate consistent trajectories now.

~3 hrs

What to read

§2 (inverse dynamics model — IDM), §3 (behavioural cloning from IDM labels), §4 (fine-tuning with RL). The IDM is trained to predict what action was taken between two observed frames. This labels unlabeled video with pseudo-actions — analogous to how comma.ai uses future frames to label current-frame trajectories.

The future-conditioning idea

VPT IDM: a_t = IDM(o_t, o_{t+1}) ← uses future frame Comma.ai Future Anchoring: w: h^w_{T,F} → p(o_T | h^w_{T,F}) where h^w_{T,F} includes future anchor F (uses future observations to guide generation)

Recovery pressure

Without future anchoring, the World Model doesn't know where the episode ends up. With future anchoring (showing the model frames from a few seconds ahead), the model generates trajectories that converge toward that future state — even from a bad current state. This is "recovery pressure" (§2.5).

Comma.ai connection

§2.5: "we can train non-causal World Models similar to [2] conditioned on future observations"
Future anchoring is the key mechanism that makes the Plan Model work — without it, the Plan Model doesn't know what "good" looks like
F = (f_s, f_e) defines the future horizon: f_s is when anchoring starts, f_e is when it ends

Driving WM [11] in paper

Medium

GAIA-1: A Generative World Model for Autonomous Driving

Hu, Russell, Yeo et al. · Wayve 2023

The most direct predecessor to comma.ai's approach — a driving-specific generative world model. Read it to understand what comma.ai improved on: GAIA-1 uses discrete tokens and GPT-style autoregression; comma.ai uses continuous latents and diffusion.

~3 hrs

What to read

§3 (model architecture — tokeniser, world model, decoder), §4 (training), §5 (video generation results). Focus on the differences in design choices vs comma.ai: GAIA uses VQVAE discrete tokens, GPT-style next-token prediction, and text+action conditioning. No diffusion, no pose conditioning.

Key differences from comma.ai

GAIA-1: Tokeniser: VQVAE (discrete tokens) WM: GPT autoregressive transformer Conditioning: text prompts + actions Policy: NOT trained inside WM Evaluation: video quality only Comma.ai: Tokeniser: SD VAE (continuous latents) WM: DiT + Rectified Flow Conditioning: 6-DOF pose Policy: trained inside WM (§5.1) Evaluation: real ADAS deployment

Comma.ai connection

[11] cited in the introduction as the driving world model precedent
GAIA-1 shows that video generation on driving data is tractable at scale
Comma.ai adds the crucial step: actually training a policy inside the WM and deploying it
GAIA-1 doesn't use future anchoring — it can't produce recovery-pressure trajectories

Why comma.ai chose diffusion

VQVAE tokenisation loses fine spatial detail (important for detecting lane markings, traffic lights). Continuous latent + diffusion preserves this detail. Also, diffusion is more naturally amenable to conditioning on continuous signals like 6-DOF pose — discrete tokens need special handling for continuous conditions.

Driving WM [3] in paper

Medium

Navigation World Models

Bar, Zhou, Tran, Darrell, LeCun · Meta 2024

Cited as [3]. The closest architectural predecessor to comma.ai's pose-conditioned video generation. NWM uses camera pose (not actions) as the conditioning signal for a video world model — this is exactly the design choice comma.ai makes in §2.4.

~3 hrs

What to read

§3 (architecture — pose-conditioned video generation), §4 (planning with NWM). Focus on how camera pose is used as the transition signal: the model generates the next frame conditioned on where the camera will be next, not what action was taken. This is the key design idea shared with comma.ai.

Pose as control signal

NWM / Comma.ai design: w: (frames, poses, next_pose) → next_frame vs. action-conditioned design: w: (frames, actions) → next_frame Benefit: WM is independent of the vehicle model. Augment VM parameters → no WM retraining needed.

Comma.ai connection

§4.2.1: conditioning signals include "vehicle poses" — directly from NWM design
§2.4: "using the pose as the transition signal ... enables augmenting the Vehicle Model's parameters without needing to retrain the World Model"
Bar et al. condition on future pose for planning — comma.ai adds future anchoring on top of this

What NWM doesn't have

No future anchoring — cannot produce recovery-pressure trajectories
No Plan Head — NWM is a world model only, not a plan model
No on-policy training inside the WM
Not deployed on real hardware — evaluation is video quality only

World Models [29] in paper

Medium

Diffusion Models Are Real-Time Game Engines (GameNGen)

Valevski, Leviathan, Arar, Fruchter · Google DeepMind · ICLR 2025

Cited as [29] specifically for the noise augmentation technique (§4.4). GameNGen proposes adding noise to context frames during training to make the model robust to its own autoregressive drift — comma.ai adopts this technique directly.

~4 hrs

What to read

§3.2 (autoregressive drift problem), §3.3 (noise augmentation fix), §4 (results on Doom). The core problem: when you generate frame T+1 from generated frame T (not ground-truth frame T), small errors in frame T compound into large errors in frame T+1 and beyond. The fix: during training, randomly add noise to context frames so the model learns to be robust to imperfect inputs.

Noise augmentation technique

For 30% of training samples: Sample τ_ctx ~ LogitNormal(0, 0.25) for context frames t = 1..T-1 Add noise: o_t → τ_ctx·ε + (1-τ_ctx)·o_t Don't noise: future anchor frames Only compute loss on target frame T Effect: model learns to denoise target frame even when context frames are imperfect

Why this is critical

Without noise augmentation, a World Model trained on clean frames fails catastrophically in autoregressive rollout — after 10 steps the generated frames look nothing like real driving. With noise augmentation, the model stays coherent for 40+ frames (see Fig.6 in the paper — LPIPS stays stable across the rollout).

Comma.ai connection

§4.4: "we use a noise level augmentation technique ... A similar technique was proposed in [29]"
Comma.ai difference: they don't discretise noise levels (GameNGen used a discrete set)
This is what makes Fig.6 (left) in the paper show stable LPIPS across 40 simulated frames
Aug prob = 0.3, σ = 0.25 for context frames (anchor frames are never noised)

Policy architecture + planning

FastViT extractor, temporal transformer, MHP loss, and the information bottleneck — the actual driving policy

Read after Layer 1

Architecture [31] in paper

Medium

Attention Is All You Need

Vaswani, Shazeer, Parmar et al. · NeurIPS 2017

Cited as [31]. The temporal model in §5 is a Transformer encoder that reads the last 2 seconds of FastViT features and outputs action + trajectory. You need to understand self-attention, positional encoding, and encoder-only architecture before reading §5.

~4 hrs

What to read

§3 (model architecture — scaled dot-product attention, multi-head, positional encoding), §3.1 (encoder-decoder structure). For comma.ai's use case, focus on the encoder-only path. The temporal model is a stack of self-attention + FFN blocks reading a sequence of feature vectors, not a sequence of tokens.

Key equations

Attention(Q,K,V) = softmax(QK^T/√d_k)V Multi-head: concat(head_1,...,head_h)W^O where head_i = Attention(QW_i^Q, KW_i^K, VW_i^V) Positional encoding: PE(pos,2i) = sin(pos/10000^{2i/d}) PE(pos,2i+1) = cos(pos/10000^{2i/d})

Comma.ai connection

§5: "a small Transformer [31] based temporal model"
Input: FastViT features over last 2 seconds (at 20 Hz = ~40 frames)
Output: action logits + 5-hypothesis trajectory plan
Frozen during on-policy training — only the temporal model is updated (§5.1)

What to skip

The decoder (§3.3) and the full encoder-decoder translation results. Comma.ai uses an encoder-only architecture (like BERT, not GPT). The multi-head attention reading is the essential part.

Architecture [30] in paper

Hard

FastViT: A Fast Hybrid Vision Transformer Using Structural Reparameterization

Vasu, Gabriel, Zhu, Tuzel, Ranjan · Apple · ICCV 2023

Directly cited as [30] — the exact feature extractor architecture. At inference, FastViT reparameterises all conv branches into a single conv, enabling mobile-speed inference. This is why the policy runs real-time on the comma.ai device hardware.

~4 hrs

What to read

§3 (RepMixer block — the core innovation), §4 (model variants), §5 (latency benchmarks). The key idea: during training, use a multi-branch architecture (depthwise conv + 1×1 conv + identity). At inference, reparameterise all branches into a single depthwise conv with no computational overhead.

Reparameterisation

Training: y = DW_3x3(x) + DW_1x1(x) + x (3 branches, expensive) Inference (reparameterised): y = DW_3x3_merged(x) (single conv, same result, 3× faster)

Why it matters for comma.ai

The policy runs at 20 Hz on a Snapdragon chip in a comma.ai device. Standard ViT would be too slow. FastViT's reparameterisation gives ViT-level accuracy with CNN-level inference speed. The extractor processes two camera streams simultaneously.

Comma.ai connection

§5: "a supervised feature extractor based on the FastViT architecture [30]"
Trained jointly on lane lines, road edges, lead car, ego trajectory — all as auxiliary heads
Frozen during on-policy training — only the temporal Transformer is updated
The information bottleneck (§5.2) is applied to FastViT's output before the temporal model

Planning [5] in paper

Medium

Multimodal Trajectory Predictions for Autonomous Driving

Cui, Radosavljevic, Chou et al. · ICRA 2019

Cited as [5]. The MHP (Multi-Hypothesis Planning) loss is used in both the Plan Head (§4.2.2) and the policy's trajectory output (§5). 5 hypotheses, winner-takes-all, heteroscedastic Laplace NLL. Understanding this loss is essential to understanding what the plan head is doing.

~3 hrs

What to read

§3 (multi-hypothesis framework), §3.2 (loss function — winner-takes-all), §3.3 (Laplace prior). The core problem: driving is multimodal (turn left vs turn right are both valid). A unimodal Gaussian loss averages the modes and predicts a straight path. MHP with winner-takes-all lets each hypothesis specialise on one mode.

MHP loss — what comma.ai uses

n_hyp = 5 hypotheses, each predicts: μ_k (trajectory mean, 2D per step) log_σ_k (log scale — heteroscedastic) log_w_k (hypothesis weight) Laplace NLL per hypothesis k: NLL_k = log_σ_k + |x - μ_k| / σ_k Winner-takes-all: loss = NLL_{k*} where k* = argmin_k NLL_k (only best hypothesis gets gradient)

Why Laplace not Gaussian

Laplace prior: NLL = log σ + |x-μ|/σ (L1 in the residual). Gaussian prior: NLL = log σ + (x-μ)²/2σ² (L2 in the residual). Laplace is more robust to outliers — a single frame where the expert makes an unusual manoeuvre doesn't blow up the loss. Also, real trajectory distributions have heavier tails than Gaussian.

Comma.ai connection

§4.2.2: "The Plan Head output T uses a Multi-hypothesis Planning loss (MHP) [5] with 5 hypotheses"
§5: policy trajectory head also uses MHP with 5 hypotheses + Laplace prior
At inference: pick hypothesis with highest log-weight; take its mean as the predicted trajectory
During IMPALA rollout: the Plan Model's best hypothesis provides â^wp for the learner

Autonomous Driving [4] in paper

Easy

End to End Learning for Self-Driving Cars (DAVE-2)

Bojarski, Del Testa et al. · NVIDIA 2016

Cited as [4] as the founding end-to-end driving paper. Read it in 1 hour to understand where the field started — raw pixels to steering angle, nothing else. Then read comma.ai's paper to see how much further it goes.

~1.5 hrs

What to read

The whole paper — it's only 9 pages. A 9-layer CNN maps a single dashcam frame to a steering angle. Trained on human driving data with MSE loss. No temporal model, no plan head, no world model, no uncertainty estimation. The simplest possible E2E policy — and it works on highways.

The gap to comma.ai

DAVE-2 (2016): CNN(frame_t) → steering_t Trained: supervised on expert frames Inference: single frame, no history Comma.ai (2025): FastViT(frame_{t-2s..t}) → features Transformer(features) → action + 5 trajectories Trained: on-policy in world model Inference: 2s history, two cameras

Comma.ai connection

§1: "End-to-End (E2E) learning ... [4]" — DAVE-2 is the starting point of the E2E tradition
Comma.ai claims [§1]: "to our knowledge, this is the first work to show how E2E training, without handcrafted features, can be used in a real-world ADAS"
DAVE-2 doesn't count because it uses handcrafted features implicitly (road segmentation preprocessing)

Why still worth reading

DAVE-2 crystallises the core E2E argument: don't decompose into perception + planning + control; learn the mapping end-to-end and let the network figure out what intermediate representations are useful. Every subsequent E2E paper (including comma.ai's) is a refinement of this core bet.

The comma.ai paper itself

arXiv:2504.19077 — now every citation maps to something you've read. Read in section order.

Final destination

§2 — Formulation Read 1st

Easy

§2 Formulation — the three equations that define the entire system

Goff, Hogan, Hotz et al. · comma.ai · arXiv 2025

Skim §2.1–2.3 on first read. The three equations (policy, world model, future-anchored WM) set all notation. Everything else is implementation. The Vehicle Model (§2.3) is the bridge between the WM's pose output and the policy's action input.

~30 min

The three core equations

Eq.1 — Policy: π: h^π_T → p(a_{T+1} | h^π_T) h^π_T = {(o1,a1),...,(oT,aT)} Eq.2 — World Model: w: h^w_T → p(o_T | h^w_T) h^w_T = {(p1,o1),...,(pT,)} Eq.3 — Future-Anchored WM: w: h^w_{T,F} → p(o_T | h^w_{T,F}) h^w_{T,F} = {anchor F} + {context T}

What to focus on

The distinction between state space S and observation space O (§2.1) — policy only sees images, not full state
The Vehicle Model (§2.3) forward and inverse: forward gives next pose from action; inverse gives action from trajectory
Future anchoring (§2.5): F = (f_s, f_e) where f_s > T — anchor is always in the future
"Recovery pressure" (§2.5) — with future anchoring the model learns to recover from bad states

Key architectural insight

The Plan Model can be a separate model or trained jointly with the World Model. Comma.ai trains them jointly to leverage shared representations. The Plan Model can also work with any simulator — reprojective or WM — because it only uses the image + pose history, not the simulator internals.

What to skip on first read

§2.6 in depth — it makes more sense after reading §4 (the DiT). Come back to §2.6 after you understand how the Plan Head is implemented in the DiT.

§3 — Reprojective Sim Read 2nd

Easy

§3 Reprojective Simulation — six limitations that motivate §4

Goff, Hogan, Hotz et al. · comma.ai · arXiv 2025

Read §3.1 carefully. Each of the six limitations is an explicit reason the World Model is needed. The most important is shortcut learning: artefacts correlated with Δpose let the policy cheat without learning real driving behaviour.

~30 min

The six limitations (§3.1)

1. Static scene assumption (counterfactual problem) Other agents don't react to the ego vehicle 2. Depth estimation inaccuracies Noisy depth → geometric artefacts 3. Occlusions → inpainting artefacts Regions that become visible must be hallucinated 4. Reflections and lighting No physics of light → night driving fails 5. Limited range (< 4m translation) Larger Δpose → more artefacts 6. Shortcut learning [KEY] Artefacts ∝ |Δpose| → policy exploits the correlation to predict action without learning any real visual understanding

Why shortcut learning is the killer

The information bottleneck (§5.2) is specifically introduced to fight this. By adding Gaussian noise to the features, the bottleneck forces the policy to learn features that generalise — not artefact-specific patterns. Without §5.2, a policy trained in the reprojective simulator would fail immediately in the real world.

Table 2 context

Reprojective simulator: 24/24 convergence (good) — but this is partly because of shortcut learning
WM simulator: 24/24 convergence too — but without the cheating
Field results: WM policy has 52.49% engaged distance vs reprojective's 48.10%
The WM's advantage grows with deployment time — no shortcut features to exploit

Novel view synthesis reference

§3 cites [27] (Seitz & Dyer 1996 — View Morphing) as the technique. Read it if you want to understand the geometric details of reprojection. Not essential for understanding the comma.ai paper — the limitations are more important than the technique.

§4 — World Model Read 3rd

Hard

§4 World Model Simulation — DiT, Rectified Flow, Plan Head, noise augmentation

Goff, Hogan, Hotz et al. · comma.ai · arXiv 2025

The core technical contribution. §4.2.2 (Eq.4–5) is the heart. Read with the DiT paper and Rectified Flow paper open side-by-side. §4.4 (noise augmentation) is subtle but critical — it's what makes autoregressive rollout stable.

~3 hrs

§4.2.2 — the training objective (read carefully)

Eq.4 (noise): o_τ = τε + (1-τ)o Eq.5 (total loss): L = L_RF + α·L_T L_RF = ||w(o_τ,p,ε,τ) - (o-ε)||² L_T = MHP(w(o_τ,p,ε,τ), T) τ ~ LogitNormal(0.0, 1.0) [8] α = 1.0 (equal weighting)

§4.4 — noise augmentation (read carefully)

For 30% of training samples: ctx frames 1..T-1: add τ_ctx~LogitNormal(0,0.25) noise anchor frames f_s..f_e: NO noise (τ=0) model input τ=0 for all non-target frames loss computed only on target frame T Effect: robust to own autoregressive errors

§4.3 — sequential sampling

15 Euler steps, Δτ = 1/15 (τ goes 1→0)
Each step: predict velocity v = w(o_τ, p, τ); update o_τ -= Δτ·v
After sampling o_T: shift context window, append o_T, repeat for T+1
KV-caching enabled by the frame-wise causal mask

Scaling results (Fig.5)

250M → 500M → 1B: LPIPS improves (lower = better, baseline 0.148 from VAE compression)
100k → 200k → 400k segments: LPIPS improves — both scale directions matter
500M on 400k is the default for all experiments

§5 — Policy Read 4th

Medium

§5 Driving Policy Training — the payoff: Table 2 and real-world deployment

Goff, Hogan, Hotz et al. · comma.ai · arXiv 2025

Table 2 is the payoff. Off-policy: 5/24 convergence tests despite best trajectory MAE. On-policy WM: 24/24. This is the BC failure mode from Layer 1 made empirical. §5.2 (information bottleneck) is the practical trick that makes real-world transfer work.

~2 hrs

Table 2 — the key result

Off-pol Reproj WM Lane center: 5/24 24/24 24/24 Lane change: 8/20 20/20 19/20 Offpol MAE: 0.361 0.369 0.394 Off-policy has LOWEST MAE but FAILS on-policy tests. This is the compounding error argument, empirically.

§5.2 Information bottleneck

Add white Gaussian noise to FastViT output: z_noised = z + ε, ε ~ N(0, 1/SNR) SNR = 10 → capacity ≈ 700 bits Prevents policy from exploiting: - Reprojective sim artefacts (§3.1) - Simulator-specific pixel patterns Forces learning of real visual features

Table 3 — field results (500 users, 2 months)

Reprojective WM Number of trips: 47,047 40,026 Engaged % time: 27.63% 29.92% Engaged % dist: 48.10% 52.49% Users engage WM policy 4.4 percentage points more by distance — meaningful in ADAS context.

Comma.ai novel claims (§1)

"First work to show E2E training, without handcrafted features, used in a real-world ADAS"
"First use of a world model simulator for on-policy training of a policy deployed in the real world"
Both claims are about real deployment, not just simulation results

Reuse guide — comma.ai components

What is open, where to find it, and how to use it in your own project

Dataset — comma2k19

Download: github.com/commaai/comma2k19

2,000+ hours commute driving, fully open

Includes: 20 Hz video, MSCKF poses, IMU, GPS, CAN

Segment format: 1-minute clips, pre-processed poses

Licence: MIT — free for research and commercial use

Larger internal dataset: billions of frames, not public

VAE — Stable Diffusion encoder

Model: stabilityai/sd-vae-ft-mse on HuggingFace

64×64 → 8×8×4 latents, scale factor 0.18215

For 128×256 (comma.ai resolution): latents are 16×32×4

Load with diffusers: AutoencoderKL.from_pretrained(...)

Freeze encoder — do not fine-tune on driving data

Licence: CreativeML Open RAIL-M — check before commercial use

openpilot — open ADAS

github.com/commaai/openpilot — MIT licence

Policy runs as selfdrive/modeld/ service

Model weights: supercombo.onnx (publicly downloadable)

Input: 2 cameras × 12 frames × 128×256, 20 Hz

Output: trajectory (33 pts × 3D), lane lines, lead vehicle

Training framework: tinygrad (not PyTorch)

Policy model weights

ONNX export: selfdrive/modeld/models/supercombo.onnx

Load in Python: onnxruntime.InferenceSession

Input normalisaton: pixels / 128.0 - 1.0 (same as paper)

FastViT weights included — no separate download needed

Temporal model weights also included in the ONNX

Can run inference on CPU: ~50 ms/frame on modern laptop

MSCKF pose pipeline

Comma.ai's implementation: github.com/commaai/laika

For comma2k19: poses are pre-computed — skip running MSCKF

Output format: 6-DOF SE(3) in comma coordinate frame

Coordinate frame: x=forward, y=left, z=up (ISO 8855)

For custom data: run laika or use open_vins as alternative

GPS required for metric scale — pure VIO will drift

Build your own DiT WM

Start from: facebookresearch/DiT (official PyTorch)

Add: 3D patchify — flatten (frame, h, w) patch table

Add: frame-wise causal mask for KV-caching during rollout

Add: pose embedding (linear → SiLU → linear) summed into AdaLN

Add: world-timestep embedding (nn.Embedding) summed in

Add: Plan Head — 3× residual FF → (n_hyp × traj_len × 5)

Loss: RF + α·MHP, τ ~ LogitNormal(0, 1), α = 1.0

Learning to Drive from a World Model
Full reading syllabus

ML foundations

Localisation stack — pose p_t

Generative models — VAE + diffusion

World models for control

Policy architecture + planning

The comma.ai paper itself

Reuse guide — comma.ai components

Suggested 4-week reading schedule

Learning to Drive from a World ModelFull reading syllabus

ML foundations

Localisation stack — pose p_t

Generative models — VAE + diffusion

World models for control

Policy architecture + planning

The comma.ai paper itself

Reuse guide — comma.ai components

Suggested 4-week reading schedule

Learning to Drive from a World Model
Full reading syllabus