Why AI Review Needs More Than One Model

tim@timbutterfield.com (Tim Butterfield) — Mon, 04 May 2026 00:00:00 +0000

In earlier posts, I wrote about deterministic execution, browser verification, and stop conditions for AI driven workflows. Those addressed specific failure modes around execution and validation. Over time, another category of problems started to appear during review.

Using one AI to review the output of another was not a recent change in my workflow. Before cross-ai-review became a released workflow, I was already routinely having one model review implementation plans, prompts, architecture notes, and generated artifacts produced by another model before execution continued. In many cases, I would send the same instructions through multiple systems first to see where the interpretations diverged.

What changed was the degree of structure around it. The workflows moved from ad hoc experimentation into something more repeatable and governed. Different models surfaced different classes of issues, fixes introduced regressions, and ambiguity propagated downstream into later artifacts. The released cross-ai-review workflow was simply a formalization of patterns that had emerged through repeated use.

Different Models Miss Different Things

One of the first patterns that became impossible to ignore was that different models consistently caught different categories of issues. A document or implementation could pass several review cycles with one model and still produce entirely new findings when reviewed by another.

In one case, Codex iterated through multiple review passes on an artifact and the output appeared stable. Gemini then identified additional downstream integration inconsistencies and documentation drift that had not been surfaced earlier. After those fixes were applied, a later verify pass caught regressions introduced during the remediation process itself.

The interesting observation was not that one model was better. The interesting observation was that they behaved more like reviewers with different blind spots. That changes the engineering question. The question stops being “Which model is best?” and becomes “What kind of workflow emerges when models review each other?”

Execution Was Easier To Structure Than Review

Execution became fairly deterministic. Commands could be declared in manifests, browser checks could produce structured evidence, and stop conditions could terminate invalid execution paths mechanically instead of conversationally. Review behaved differently. Review drifted.

One pass would identify architectural inconsistencies. Another would identify documentation gaps. A third would introduce a change that accidentally created a downstream mismatch elsewhere in the system. Semantically correct fixes could still produce integration inconsistencies.

That was the point where prompting alone stopped being sufficient. The problem was no longer generating reviews. The problem became structuring review so that the process itself could converge instead of endlessly cycling between incomplete corrections.

Clarification And Verification Are Different Problems

One of the more important discoveries was realizing that clarification and verification are not the same activity. Early on, the workflows blended together. The model would review an artifact, identify concerns, and sometimes silently invent missing assumptions in order to continue. That became dangerous quickly.

A vague requirement like “support offline mode” can expand into completely different architectural implementations depending on intent. Does offline mode mean cached reads only, queued writes, full synchronization, conflict resolution, temporary local sessions, or peer to peer replication? The phrase sounds simple until implementation begins.

This led to separating the workflows into two different stages. The first stage focuses on clarification. The purpose is not implementation. It is ambiguity reduction. The model asks questions, presents options, explains tradeoffs, and recommends approaches while leaving final decisions to the human.

Only after intent becomes sufficiently clear does verification begin. Cross model verification of unresolved ambiguity only amplifies the ambiguity.

Verify Passes Became Necessary

Another unexpected discovery was that review fixes themselves often introduced regressions. A reviewer could correctly identify a problem while still introducing new inconsistencies elsewhere in the system during remediation. That forced the introduction of verify passes.

A verify pass is not simply another review iteration. Its purpose is narrower. It validates that prior fixes did not create downstream inconsistencies, related artifacts remain aligned, terminology did not drift, governance rules are still satisfied, and execution instructions remain valid. Without verify passes, the system could repeatedly reintroduce inconsistencies while appearing locally correct.

This also reinforced another lesson. Review without convergence rules eventually degrades into churn.

Audit Trails Matter More Than Chat History

Another major shift was moving away from conversational memory as the primary source of truth. Chat history is useful while actively working, but it becomes difficult to inspect later. Important reasoning disappears into long conversational chains that are hard to replay or verify independently.

The workflows evolved toward durable review artifacts instead. Structured findings and disposition tracking became more important than preserving raw conversational context. In several cases, older findings had to be revisited later after regressions appeared downstream. Having structured review artifacts made it possible to trace where the inconsistency entered the system and how the reasoning changed across iterations.

The review history became an engineering artifact rather than a temporary conversation.

The Workflow Became More Important Than The Prompt

This was probably the largest overall shift. At the beginning, most of the effort focused on improving prompts. Over time, the prompts mattered less than the structure around them.

The important pieces became role separation, structured findings, verify passes, convergence rules, deterministic validation, ambiguity reduction, auditability, and governance boundaries. The orchestration layer mattered more than any individual model because it defined how findings propagated, how regressions were verified, and how convergence was measured across iterations.

That does not make the models interchangeable. Different systems still have different strengths. What changed was the realization that the workflow around the models often determines the quality and stability of the final result more than any single prompt does.

There is also a practical limit to how much rigor makes sense. Running multi model verification loops for trivial edits, typo fixes, or simple mechanical changes can quickly become counterproductive. The governance overhead only becomes worthwhile once the complexity and ambiguity reach a level where review drift and regression risk become meaningful.

Closing Thoughts

AI assisted development is still evolving rapidly, and I do not think these workflows are finished or solved. But one thing has become increasingly clear through experimentation.

The long term problem is probably not discovering the single best model. The more important problem is designing trustworthy workflows where multiple models, deterministic tooling, structured verification, and human judgment work together inside governed engineering processes.

Generation is only one layer of the system. The workflow around the generation matters just as much.

Addendum

The released workflow discussed in this post grew out of these review patterns and experiments:

https://github.com/Tim-Butterfield/cross-ai-review

Review on Tim Butterfield's Eclectic Blather