Error Recovery

Evaluator-optimizer diagnosis, targeted fixes, and retry-aware recovery for simulation failures.

Simulation errors are common. SimPilot handles them with an evaluator-optimizer loop: run a command, evaluate the outcome, gather targeted context, apply a narrow fix, and retry with full awareness of what has already been tried.

Command evaluation

Every recovery cycle starts from the actual command result, not a canned pattern:

Explicit failure: the command returned a non-zero exitCode
Silent solver failure: the command exited cleanly but solver-aware log analysis still detected a failed run
Success: no failure markers are present, so recovery is skipped

This keeps recovery grounded in concrete evidence from the workspace instead of speculative rewrites.

Recovery chain

When a command fails, the agent gathers context in a fixed order and stops as soon as it has enough evidence:

1. Skills and local references

Load the relevant skill, then inspect agent_resources/ for the exact tutorial, reference dictionary, or reusable script that matches the failed command.

2. Knowledge, memory, and docs

Search the internal knowledge base, query personal or organization memory when prior fixes matter, and consult external docs through searchDocs for authoritative syntax or solver behavior.

3. Targeted fix

Edit only the specific file or setting implicated by the evidence. Recovery explicitly avoids regenerating the entire case when a narrow correction is sufficient.

4. Retry and compare

Rerun the command and compare the new result against the previous failure. Retries are tracked as fixed, still_failing, or different_error.

5. Late escalation

If local sources are exhausted, the agent can escalate to webSearch / retrieveUrl or delegate focused troubleshooting to agent("error-diagnostician").

Retry history

Recovery is retry-aware. Every completed repair cycle records:

The failure summary
The fix that was attempted
The retry outcome (fixed, still_failing, or different_error)

This prevents the agent from repeating the same losing edit and makes later retries more deliberate.

Error-diagnostician subagent

For repeated or ambiguous failures, SimPilot can delegate to a dedicated error-diagnostician subagent. That diagnostician can:

Inspect logs and workspace files with read-only runCommand
Search internal knowledge, organization memory, and personal memory
Consult searchDocs, webSearch, and retrieveUrl
Return a focused diagnosis with the next targeted fix to try

The diagnostician is prompt-guided, not mandatory. It is used when deeper investigation is warranted, not for every routine failure.

Web search as late fallback

Web search is available during recovery, but it is intentionally late in the chain. SimPilot first exhausts local skills, workspace references, internal knowledge, memory, and official docs. Only then does it search the broader web for edge cases, version-specific behavior, or missing documentation.

When web search is used, the sources are surfaced in the chat so you can see exactly what informed the fix.

Debugging protocols

The error recovery system still follows disciplined debugging protocols that prevent guesswork:

Pre-simulation inspection

Before running any solver (simpleFoam, pimpleFoam, blockMesh, snappyHexMesh, etc.), the agent must complete a mandatory checklist:

Mesh verification -- Run checkMesh, verify non-orthogonality < 70 degrees, max skewness < 4, aspect ratio < 100, and confirm all expected boundary patches exist
Field file consistency -- Verify dimensions match the solver type, patch names in 0/ files match constant/polyMesh/boundary, all required turbulence fields exist, and initial values are physically plausible
Scheme and solver consistency -- Confirm fvSchemes time scheme matches solver type, every div(phi,X) term has an explicit entry, fvSolution covers all solved fields, and the algorithm block name matches the solver
controlDict validation -- Confirm application keyword matches the intended solver, endTime is appropriate, and writeInterval/purgeWrite are set

The solver only runs after all checks pass.

Investigation-before-edit

When a simulation error occurs, the agent must investigate before modifying any files:

Read the actual error output (tail -n 50 log.<solver>)
Check mesh state (checkMesh)
Inspect residuals to identify divergence patterns
Check boundary conditions against the mesh
Only after diagnosing the root cause does the agent edit case files

This prevents the agent from rewriting entire cases when a targeted fix would suffice.

Forensic debugging

For persistent or complex failures, the agent follows a structured backward trace:

Isolate the symptom -- Identify exactly what failed, which field diverged, and at what iteration
Backward trace -- Follow the computation chain backward using actual numerical values
Quantitative physicality test -- Compare actual values against physics expectations
Classify the originating error -- Trace to mesh, boundary condition, numerical scheme, solver configuration, or physical setup
Prove before fix -- State specific evidence, explain the causal chain, predict what the fix will change, then implement

The agent is prohibited from trying random edits without evidence.

Post-success protocol

When a simulation converges with physically plausible results:

Stop making changes -- The agent does not optimize or retune a working simulation unless explicitly asked
Report clearly -- Final residuals, iteration count, key quantities, and any warnings
State the validation level -- Level 1 (case setup complete), Level 2 (solver converged), or Level 3 (physics validated)
Ask before proceeding -- No unsolicited parameter tuning, mesh refinement, or physics additions

PreviousMesh Generation NextBatch & Sweeps

Error Recovery

Evaluator-optimizer diagnosis, targeted fixes, and retry-aware recovery for simulation failures.

Command evaluation

Every recovery cycle starts from the actual command result, not a canned pattern:

Explicit failure: the command returned a non-zero exitCode
Silent solver failure: the command exited cleanly but solver-aware log analysis still detected a failed run
Success: no failure markers are present, so recovery is skipped

This keeps recovery grounded in concrete evidence from the workspace instead of speculative rewrites.

Recovery chain

When a command fails, the agent gathers context in a fixed order and stops as soon as it has enough evidence:

1. Skills and local references

Load the relevant skill, then inspect agent_resources/ for the exact tutorial, reference dictionary, or reusable script that matches the failed command.

2. Knowledge, memory, and docs

Search the internal knowledge base, query personal or organization memory when prior fixes matter, and consult external docs through searchDocs for authoritative syntax or solver behavior.

3. Targeted fix

Edit only the specific file or setting implicated by the evidence. Recovery explicitly avoids regenerating the entire case when a narrow correction is sufficient.

4. Retry and compare

Rerun the command and compare the new result against the previous failure. Retries are tracked as fixed, still_failing, or different_error.

5. Late escalation

If local sources are exhausted, the agent can escalate to webSearch / retrieveUrl or delegate focused troubleshooting to agent("error-diagnostician").

Retry history

Recovery is retry-aware. Every completed repair cycle records:

The failure summary
The fix that was attempted
The retry outcome (fixed, still_failing, or different_error)

This prevents the agent from repeating the same losing edit and makes later retries more deliberate.

Error-diagnostician subagent

For repeated or ambiguous failures, SimPilot can delegate to a dedicated error-diagnostician subagent. That diagnostician can:

Inspect logs and workspace files with read-only runCommand
Search internal knowledge, organization memory, and personal memory
Consult searchDocs, webSearch, and retrieveUrl
Return a focused diagnosis with the next targeted fix to try

The diagnostician is prompt-guided, not mandatory. It is used when deeper investigation is warranted, not for every routine failure.

Web search as late fallback

When web search is used, the sources are surfaced in the chat so you can see exactly what informed the fix.

Debugging protocols

The error recovery system still follows disciplined debugging protocols that prevent guesswork:

Pre-simulation inspection

Before running any solver (simpleFoam, pimpleFoam, blockMesh, snappyHexMesh, etc.), the agent must complete a mandatory checklist:

Mesh verification -- Run checkMesh, verify non-orthogonality < 70 degrees, max skewness < 4, aspect ratio < 100, and confirm all expected boundary patches exist
Field file consistency -- Verify dimensions match the solver type, patch names in 0/ files match constant/polyMesh/boundary, all required turbulence fields exist, and initial values are physically plausible
Scheme and solver consistency -- Confirm fvSchemes time scheme matches solver type, every div(phi,X) term has an explicit entry, fvSolution covers all solved fields, and the algorithm block name matches the solver
controlDict validation -- Confirm application keyword matches the intended solver, endTime is appropriate, and writeInterval/purgeWrite are set

The solver only runs after all checks pass.

Investigation-before-edit

When a simulation error occurs, the agent must investigate before modifying any files:

Read the actual error output (tail -n 50 log.<solver>)
Check mesh state (checkMesh)
Inspect residuals to identify divergence patterns
Check boundary conditions against the mesh
Only after diagnosing the root cause does the agent edit case files

This prevents the agent from rewriting entire cases when a targeted fix would suffice.

Forensic debugging

For persistent or complex failures, the agent follows a structured backward trace:

Isolate the symptom -- Identify exactly what failed, which field diverged, and at what iteration
Backward trace -- Follow the computation chain backward using actual numerical values
Quantitative physicality test -- Compare actual values against physics expectations
Classify the originating error -- Trace to mesh, boundary condition, numerical scheme, solver configuration, or physical setup
Prove before fix -- State specific evidence, explain the causal chain, predict what the fix will change, then implement

The agent is prohibited from trying random edits without evidence.

Post-success protocol

When a simulation converges with physically plausible results:

Stop making changes -- The agent does not optimize or retune a working simulation unless explicitly asked
Report clearly -- Final residuals, iteration count, key quantities, and any warnings
State the validation level -- Level 1 (case setup complete), Level 2 (solver converged), or Level 3 (physics validated)
Ask before proceeding -- No unsolicited parameter tuning, mesh refinement, or physics additions

PreviousMesh Generation NextBatch & Sweeps

Search Documentation

Error Recovery

Command evaluation

Recovery chain

1. Skills and local references

2. Knowledge, memory, and docs

3. Targeted fix

4. Retry and compare

5. Late escalation

Retry history

Error-diagnostician subagent

Web search as late fallback

Debugging protocols

Pre-simulation inspection

Investigation-before-edit

Forensic debugging

Post-success protocol

Search Documentation

Error Recovery

Command evaluation

Recovery chain

1. Skills and local references

2. Knowledge, memory, and docs

3. Targeted fix

4. Retry and compare

5. Late escalation

Retry history

Error-diagnostician subagent

Web search as late fallback

Debugging protocols

Pre-simulation inspection

Investigation-before-edit

Forensic debugging

Post-success protocol