Why AI Review Needs Stop Conditions
In the previous post, I described how structured manifests and browser verification make execution deterministic.
That solves execution. It does not solve review.
Deterministic execution without deterministic review is incomplete.
The Review Loop Problem
Before formalizing and automating the review protocol, the process was iterative but informal. At different stages, one model generated or updated step documents and another reviewed them. Revisions were applied and resubmitted manually.
Role separation existed. Governance was not yet formalized.
The pattern was predictable. Early passes found real issues. Later passes produced smaller findings, speculative concerns, or contradictions.
There was no convergence guarantee. The document grew. Cost grew. Signal declined.
Structured Findings
The first change was requiring structured output.
Instead of prose feedback, the reviewer must return JSON with stable finding IDs, explicit severity, and a gate justification describing a concrete failure scenario.
Blocking and warning findings require a falsifiable claim about what would break during execution.
If the justification is wrong, it can be disproven. If correct, the fix is clear.
Review shifts from opinion to evidence.
Severity as a Gate
Findings are classified into three tiers.
Blocking findings must violate governance, break executability, and require a document change.
Warning findings must describe a concrete failure scenario for this specific step.
Suggestions never fail the gate.
Without strict definitions, severity escalates. With defined criteria, the model cannot fail the gate without committing to a verifiable breakage.
Delta Mode
Fresh reviews send the full evidence bundle. Subsequent passes send only the document, a diff, and the prior review JSON.
The reviewer must produce dispositions for prior findings and report only net new issues.
Cost drops. Focus improves. Convergence accelerates.
When Review Stops
Review does not end when it feels complete. It ends when objective conditions are satisfied.
- Fresh review returns PASS with zero blocking and zero warning findings
- Executability review confirms governance compliance
- Suggestions have been evaluated for structural impact
Each cycle either improves the document or confirms stability.
Escape Hatch
Models misread evidence and occasionally contradict prior guidance.
If two consecutive fresh reviews produce only factually incorrect or marginal blocking findings, and the document remains executable and compliant, the gate passes.
Evidence is required. Dismissal without proof is not allowed.
This prevents infinite loops while preserving rigor.
Mechanical Enforcement
The review script validates the reviewer’s output.
Inconsistent structural assessments, missing justifications, or nonconverging severity counts trigger warnings.
Structured review defines the rules. Mechanical checks enforce them.
The final boundary is role separation.
Separation of Roles
In the current implementation, Claude Code performs execution and OpenAI performs review.
The providers are interchangeable. The boundary is not.
Both roles review, but they review for different reasons.
The reviewer evaluates the document cold against explicit criteria and produces structured findings.
After the reviewer driven passes converge, the executor performs an executability review. That review is not a second opinion on style. It is a compatibility check that the step can still be executed exactly as written after the reviewer driven edits.
Different roles. Different failure classes. Reduced bias.
The Principle
Unstructured AI review is nondeterministic. It has no convergence guarantee and no principled stop condition.
Structured review with explicit criteria, severity gates, delta mode, convergence rules, and mechanical enforcement transforms review into a bounded verification process.
The same principle that applies to execution applies to review.
Declare the criteria.
Declare the failure conditions.
Evaluate mechanically.
Decide based on evidence.
Governance matters more than model selection.