Find the failed layer first
Look at where the failure happened: materialization, meshing, solving, post-processing, or reporting. Each layer has its own evidence (CaseSpec validation errors, mesh log, solver log, function-object output, report build log). Diagnose from the right layer — not from the chat transcript.
Use the hypothesis ledger
ADR 0016 requires a recorded hypothesis before any rerun. The agent will write a RecoveryDecisionRecord with the suspected cause, the change it intends to make, and how it will know the fix worked. This prevents blind-retries that mask the real bug.
Recognize common patterns
- Divergence — usually solver-stiffness, Courant number, or under-relaxation. Check residual history, Co-number, and physics regime.
- Floating-point failure — bad initial condition, missing reference cell, malformed BC.
- Mesh quality — failed mesh-quality gate before the run even starts. Solver pack will refuse to launch.
- Marker mismatch (SU2) — boundary marker tags in config don't match the mesh.
- Missing function-object — a QoI extraction asked for runtime sampling that wasn't declared at run time.
Fix the spec, not the file
If you change a generated dictionary by hand, the next materialization will wipe it out. Always trace the fix back to the CaseSpec, fix it there, and re-materialize.
Spawn a diagnose subagent
For thorny failures, the kernel can spawn a focused diagnose subagent with elevated read tools (solver log, parser output, knowledge search, web search) but no write tools. It returns a typed finding; the parent agent decides whether to act on it.