Stronger Models Help Most After the Problem Has Been Shaped

One of the more subtle failures in AI-assisted development is not that the model gives a bad answer. It is that the model gives a good-looking answer before the team has found the right question. That failure is especially easy to miss in architecture work, where a polished decomposition can make unresolved boundaries feel settled, a confident recommendation can make tradeoffs look evaluated, and a clean diagram can make a design look real before its responsibilities have survived allocation.

The risk is not simply that AI might be wrong. The more interesting risk is that its output can resemble judgment while quietly skipping the work that judgment requires. In earlier posts I wrote about stop conditions, multi-model review, and the way AI makes missing judgment more expensive. This article continues that thread by looking at a related pattern: stronger models can be extremely useful, but their usefulness depends heavily on when they enter the workflow and what role they are allowed to play.

That is why I am increasingly skeptical of automatically reaching for the strongest and most expensive AI model first for complex software design work. Model capability matters, but capability applied too early can produce persuasive structure around an under-shaped problem. Cost also matters, although not in the simplistic sense of always preferring the cheaper option. If a model is substantially more expensive to use, it should be brought in where its additional reasoning capacity is likely to change the outcome, not merely where it can produce the first plausible draft. At the same time, underinvesting in reasoning for decisions that will shape a long-lived system can be far more expensive than the model cost being avoided. In design work, the better question is how to match model capability, model cost, workflow maturity, and decision consequence.

The Real Problem Was Not Model Choice

The case involved the architecture of a SaaS product with significant long-term variation. Different deployments would need different rules, terminology, configuration, workflows, and reporting behavior, while the core platform needed to remain stable. The system also had to support auditability and AI-assisted work without hardcoding those variations into the core design.

That makes the architecture problem larger than a simple decomposition exercise. The question was not merely where to put a few services or classes. The system needed stable boundaries that could survive future variation. It needed to distinguish between product concepts that would change frequently and operational mechanisms that should remain stable. It also needed to avoid turning every business concept into a central dependency.

The early architecture had concrete problems. Some parts of the system reached across boundaries in ways that would have created hidden coupling. Some responsibilities were mutually dependent. Some boundaries looked reasonable at the diagram level but broke down when tested against actual component ownership. I used a multi-model workflow to apply and test the design against Righting Software / IDesign-style standards. In this context, that meant evaluating the architecture against principles such as identifying volatility, assigning responsibilities carefully, respecting explicit boundaries, treating subsystems as cohesive vertical slices, distinguishing business behavior from supporting mechanisms, and validating the design through interaction analysis.

This was exactly the kind of problem where AI can be useful and risky at the same time. A model can generate plausible architectures quickly. It can name subsystems, propose flows, identify patterns, and produce diagrams. But plausible architecture is not the same thing as durable architecture. A design can sound correct while hiding boundary violations, ownership confusion, or infrastructure choices that compensate for unclear business seams.

The important question was not which model could produce the most impressive answer. The important question was how to use different model capabilities without letting any model prematurely become the architect of record.

The First Need Was Breadth, Not Closure

The initial workflow used multiple agents to explore alternatives independently. That mattered because the early goal was breadth rather than convergence. Different models and harnesses surfaced different interpretations of the same architectural problem. Some emphasized subsystem topology. Some focused on volatility seams. Some were more sensitive to command and query ownership. Others were better at identifying cross-boundary communication risks. None of those perspectives was sufficient by itself, but together they created a wider map of the problem.

That is an important distinction in AI-assisted design. The first useful output from a model is not always an answer. Sometimes it is a disagreement worth preserving. A weak workflow would have forced the early results into an average, taken the apparent consensus, smoothed over the conflicts, and moved forward with a design that looked settled. That would have been easier, but it would also have destroyed useful information. In this case, the disagreement was not noise. It was evidence that the architecture problem still contained an unresolved seam.

The early passes narrowed the candidate space, but they did not settle the design. They helped expose the shape of the uncertainty. The work eventually narrowed from a broad question, “Which architecture is best?” to a more specific question: whether a central part of the domain represented one volatility area or two. That smaller question was far more useful because it had context, competing candidates, known defects, design standards, and prior critiques attached to it. By the time the stronger model entered, the problem had changed from a blank-page architecture request into a bounded decision problem.

A Stronger Model Needs a Job, Not a Blank Page

Fable 5 entered the workflow later, after other passes had already explored the space and after the unresolved issue had become more explicit. In this case, that timing was not a controlled experiment in deliberately withholding the strongest model from the first pass. Fable 5 became available to me later in the workflow. Even if it had been available earlier, however, its substantially higher cost would have made it difficult to justify as the default tool for broad initial exploration. That does not mean the cheaper path is always the prudent path. For architecture decisions with long-lived consequences, the apparent savings from using a less capable model can disappear quickly if the result is avoidable coupling, unclear ownership, or expensive rework. The practical question was not whether the stronger model was useful, or whether cost should be minimized. It was where the additional capability was likely to justify its cost by reducing meaningful design risk.

It was not asked to solve the architecture from scratch. That distinction matters because a powerful model used too early can become another generator of plausible structure, while a powerful model used later can become a reviewer of concrete alternatives.

The model was given the original constraints, the design standards being applied, prior candidates, validation findings, known coupling defects, and specific questions about boundary ownership, interaction classification, and subsystem responsibility. It was asked to perform senior-architect diagnosis, adversarial review, component decomposition, and dependency analysis. That is a very different role from free-form generation.

The value of the stronger model showed up because it had something concrete to attack. It did not merely choose which previous answer it liked. It found that the apparent leading candidate could not survive component-level allocation. A proposed split between policy-like behavior and operational state looked plausible at a higher level, but became unstable once real responsibilities had to be assigned to real components.

That finding was valuable because it challenged the prior direction rather than simply reinforcing it. The earlier multi-agent exploration had favored one candidate. The later stronger-model pass found that a repaired variant of another direction was more consistent within the constrained analysis. This did not mean the earlier agents were wrong or useless. They had performed a different role. They created the candidate space, exposed disagreement, and helped identify where the real uncertainty lived. Without that earlier work, the stronger model would have had less to inspect and fewer constraints to obey.

This is the central distinction. The stronger model was useful because the workflow gave it evidence, constraints, candidates, and open questions. It was acting less like an oracle and more like a senior reviewer entering a design review after the team had already done serious preparation.

What Changed After the Problem Was Shaped

Once the problem had been narrowed, the workflow could ask more useful questions. It was no longer asking AI to invent a plausible architecture. It was asking whether the remaining candidates survived concrete tests. That changed the value of the output. The useful findings were not new names, broader possibilities, or a more polished diagram. The useful findings were defects, dependencies, ripple effects, and proof obligations.

One of the more practical lessons was that diagrams should not come too early. Diagrams are persuasive. Boxes and arrows create visual authority. Once an architecture is drawn cleanly, it becomes easier to believe that the design is more settled than it really is. A diagram can hide unresolved ownership problems because a box can contain almost anything. The hard question is not whether a box can be drawn. The hard question is whether the responsibilities inside that box can survive allocation.

For that reason, the next pass forced the candidates into a text-first component decomposition before producing visual diagrams. Each part of the system had to justify what it owned: orchestration, business rules, access to external resources, supporting utilities, public boundaries, use cases, command surfaces, query surfaces, event surfaces, and cross-boundary communication. That textual discipline exposed defects before the diagrams could make them look real.

One candidate could not allocate its seam cleanly without splitting real components. Another candidate had the right topology but incomplete ownership details. The refined candidate was stronger, but some proposed refinements still required dependency analysis. Some capabilities that looked like generic support mechanisms turned out to own enough business behavior to justify thin subsystem boundaries. Others really were supporting utilities. An explicit event-dispatch mechanism had to be named as a supporting utility instead of being left as an implicit assumption. Several supporting mechanisms still needed analysis because placing them incorrectly would have changed the ownership model.

The lesson is broadly applicable. Before asking AI to draw the architecture, make it prove the architecture can be described in concrete ownership terms. If the design cannot survive a text decomposition, a diagram will not fix it. It will only make the unresolved design look finished.

The Hard Part Was the Ripple Effects

The stronger model was most useful not when it generated new names or more options, but when it found ripple effects. At one stage, the architecture had a list of proposed refinements. It would have been tempting to treat those refinements as independent toggles: accept this one, reject that one, keep the next one open for later. That would have been a mistake.

Some architectural decisions are not independent. One move changes the legal and useful moves available later. A refinement that looks harmless by itself can recreate a boundary problem when combined with another decision. A proposed split can solve one ownership issue while introducing a new coupling path somewhere else.

The dependency analysis found that some refinements were mandatory, some were conditional, and some had to move together. Some as originally stated created conflicts. Within the constrained set of alternatives under review, the later pass produced one preferred variant and one legal but weaker fallback. The outcome was a smaller, clearer decision space rather than a larger menu of possibilities.

That is a high-value use of a stronger model in a governed workflow. The useful output was not more imagination, but the removal of bad possibilities. The model helped identify which combinations were internally inconsistent, which ones violated the design rules being applied, and which ones looked valid in isolation but created future coupling or ownership debt. That kind of analysis is difficult to get from a single early prompt because the prompt usually has not yet exposed enough structure for the model to reason against.

Do Not Confuse Architecture Seams With Infrastructure

Another practical issue surfaced repeatedly: AI architecture proposals can jump too quickly from business boundaries to infrastructure. That is not surprising. Many architecture discussions are full of brokers, workflow engines, CQRS frameworks, event buses, distributed services, and other infrastructure-heavy patterns. Those tools can be useful. They can also distract from the actual architectural question.

The principle that emerged was to avoid confusing architectural truth with infrastructure heaviness. The architecture should know whether something is a command, query, or event. It should know which part of the system owns a business decision. It should know whether message delivery is a supporting mechanism or a subsystem-level responsibility. It should know whether a component owns a use case or merely supports one. Those are architectural decisions, and they should not be postponed just because the first implementation will be simple.

At the same time, the first implementation does not need to start with a full external broker, distributed services, a workflow engine, or a heavy CQRS framework. The architecture can model the event boundary explicitly while the first implementation remains simple behind that boundary. That distinction is easy to lose when AI is allowed to produce architecture freely. A model may solve a boundary problem by adding infrastructure, which can look sophisticated while avoiding the harder question of ownership. A governed workflow should force the distinction: first decide the seam, then decide how much infrastructure the current implementation actually needs.

Utility or Subsystem Is Not a Cosmetic Choice

A surprisingly useful test was whether a thing owned a use case or merely supported one. That question helped classify several ambiguous parts of the architecture.

In one case, an AI-related capability owned enough business policy to be treated as a thin business subsystem rather than a generic utility. Its responsibilities were not just mechanical invocation. It had to govern lifecycle, usage, provenance, and constraints around how AI-assisted work could be used. The low-level mechanics of provider calls still belonged elsewhere, but the business policy could not be treated as a generic helper.

In another case, a context-resolution capability was not merely data access, but also should not become a behavior-heavy subsystem. It needed to answer governed business questions about effective context. It should not initiate workflows or become a catch-all configuration owner.

A third case went the other direction. Delivery mechanics were better treated as a utility because the business decisions belonged to the subsystems that triggered them. The supporting mechanism did not own the reason something should happen. It only handled the mechanics after the owning part of the system made the decision.

These classifications matter because the wrong label becomes long-term semantic debt. If a business-governance concept is demoted to a utility, ownership becomes vague. If a supporting mechanism is promoted to a subsystem, it can start attracting behavior it should not own. If a context authority becomes a catch-all configuration owner, it can quietly become a dependency magnet. AI can help with these classifications, but only if the workflow asks the right question. “Where should this thing go?” is too vague. “Does this own a business use case, or does it support use cases owned elsewhere?” is a better architectural question.

Review Has to Mature With the Artifact

The workflow also reinforced that “AI review” is not one activity. Early review checked whether exploration was broad and fair. Later review checked whether reasoning was grounded in the actual constraints. Then it checked whether component allocation was coherent. For the diagram stage, the review target changed again: whether the rendered diagrams actually reflected the intended architecture and visual grammar rather than merely looking polished.

A review process that asks the same question at every stage will eventually miss the point. The review criteria for exploration are not the same as the review criteria for synthesis. The review criteria for textual architecture are not the same as the review criteria for rendered diagrams. The review criteria for diagrams are not the same as the review criteria for implementation.

This is one reason simple “ask another model to review it” workflows can disappoint. They treat review as a generic afterthought. The reviewing model may find something useful, but the process is not grounded enough to know what kind of review is needed. A more mature workflow changes the review question as the artifact changes. At the exploration stage, disagreement may be the desired output. At the synthesis stage, contradiction may be the target. At the decomposition stage, ownership and communication surfaces become central. At the diagram stage, the rendered artifact has to be checked, not just the source text that generated it. The review has to match the artifact’s maturity.

Tools, Models, and Roles Are Different Things

This workflow also made it important to separate models from the harnesses that run them. Codex and agy were not models. They were CLIs, adapters, or harnesses. Codex CLI used GPT-5.5 High for independent review and analysis. agy CLI used Gemini 3.1 Pro High for another independent perspective. Claude Code was the harness I used with Opus 4.8 as the normal Claude baseline for many architecture batches, and later with Fable 5 for bounded synthesis, pressure testing, component decomposition, and dependency analysis.

The model supplies reasoning behavior. The CLI or harness supplies execution shape, file access, prompts, repeatability, capture, and review protocol. That distinction is not pedantic. The same model in a loose chat does not play the same role as a model running in a repeatable harness with captured artifacts, explicit constraints, and defined stop conditions. The model may be the same, but the workflow is not.

This is why model comparison alone is a weak way to think about AI-assisted development. A model’s capability matters, but the role assigned to it matters too. Exploration, critique, synthesis, dependency analysis, diagram validation, and implementation support are different jobs. A strong model can do several of them, but it should not be allowed to blur them together without process boundaries.

The Model Still Did Not Decide

The most important governance rule remained in place even after the strongest pass produced a stronger recommendation within the bounded analysis. The model did not choose the architecture. No durable architecture files were changed automatically. No diagrams were treated as approved before review. Implementation remained unauthorized. The model produced decision support, not a decision.

This distinction matters because a stronger model can make a stronger argument. That makes the output more useful, but also more dangerous if the process treats persuasiveness as authority. In this workflow, the model’s output was still only decision support. It reduced ambiguity and exposed consequences. It did not own the decision.

That boundary becomes more important as models improve. A more capable model can produce a more coherent story, explain tradeoffs better, and make its recommendation sound more complete. That is useful, but it is also dangerous because persuasive output can create the feeling that judgment has already happened. In this workflow, the stronger model’s job was to reduce ambiguity, expose consequences, identify proof obligations, and challenge shallow consensus. It helped make the decision better formed, but it did not remove the need for validation, rendered diagrams, cross-review, and human ownership of the final decision.

A More Mature Pattern

The larger pattern was not that every design problem needs this many passes, this many models, or this much ceremony. Most do not. The useful lesson is that model power should be assigned a role.

In this case, the workflow moved through independent breadth, preserved disagreement, collation, bounded senior-model synthesis, text-first decomposition, dependency analysis, rendered diagram validation, and human decision. That sequence was appropriate because the architecture problem had real variation, coupling, and long-term ownership consequences. For a smaller problem, the workflow could be lighter. But the roles still matter.

Sometimes the right use of AI is exploration. Sometimes it is criticism. Sometimes it is synthesis. Sometimes it is checking whether a proposed refinement creates a downstream contradiction. Sometimes it is validating that a diagram or implementation actually matches a previously approved design. Treating all model calls as equivalent “answers” misses most of the value.

The economic version of the same pattern is not “use cheaper models first” and not “use the strongest model everywhere.” Both can be poor defaults. The right balance depends on the consequence of the decision. A low-consequence exploratory pass may not justify the most expensive reasoning available. A high-consequence architectural decision may justify it easily if the workflow has narrowed the problem enough for that reasoning to matter.

The better question is not only, “Which model is strongest?” The better question is, “What uncertainty exists at this point in the workflow, what would it cost to apply more model capability here, what long-term cost might remain if that uncertainty is not resolved, and what role should a model be allowed to play in reducing it?” That question leads to better AI use than simply reaching for the most powerful model first or trying to minimize model cost in isolation.

Closing Thoughts

This experience did not convince me that the strongest model should never be used first. It made me more skeptical that “use the strongest model first” is a good default for complex engineering judgment. For this kind of work, the stronger model may be most valuable after the weaker parts of the problem have already been made explicit. Earlier passes create breadth, preserve disagreement, expose uncertainty, and generate candidates. Validation turns vague disagreement into sharper questions. Only then does a stronger model have a bounded, evidence-rich decision problem to work against.

That does not make model capability irrelevant. Model power matters, but the timing, role, cost, and consequence of that power matter as well. A stronger model can make a better argument, which is useful, but it also makes process discipline more important. For long-lived systems, the cost of reasoning should be considered alongside the cost of being wrong. Saving money on early model calls is not useful if it produces an architecture that carries years of avoidable complexity. Spending the most expensive reasoning on every stage is not useful either if the problem is still too vague for that reasoning to be well directed.

The goal is to create a workflow where each model is used in the role where it can reduce meaningful uncertainty without taking ownership of the decision. In that sense, the stronger model helped because it came after the workflow had found the real question and because the remaining problem was important enough to justify using it.