A month ago I finished a piece I was proud of, and did the responsible thing before publishing it: I sent it for review. The reviewer was a capable model, from the same family as the one that had helped me write it. It read the whole thing and told me it was ready. Clean, accurate, ship it.

Out of habit, I passed the identical draft to a second model from a different family — one whose entire job, in the way I work, is to attack what I have made. It came back with seven errors. Not taste, not tone: seven specific, checkable mistakes. A claim that overstated what the technology could do. A number that was simply wrong. A confident sentence that dissolved the moment you traced it back to its source. Every one of them real. Every one of them something I had been about to publish under my own name.

Both reviews were competent. Only one was useful. And the difference had nothing to do with which model was smarter. It had to do with how I had set the work up.

THE SAME FAMILY, TWICE you a twin ✕ missed same lineage → the same gaps TWO DIFFERENT FAMILIES you a stranger ✓ caught different lineages → they cover for each other
Fig — a reviewer raised elsewhere is blind in different places than you. That is the whole value of it.

Not one assistant — a team

Here is what I have come to believe, after a year of doing most of my work this way: serious work with AI is no longer about choosing the best model. It is about designing the work — deciding what gets done, who does it, who checks it, and who, in the end, decides.

I do not use AI as a single brilliant assistant. I run it as a small team with assigned jobs. There is a driver, an adversary, and a reader. Each is a different model family. Each is good at something the others are not — not because of any permanent law of model nature, but because that is the role I have given it, and the role I trust it in today. Next quarter the assignments may change; treating them as assignments rather than facts is the whole point.

It sounds elaborate. It is actually the opposite. It is simply what you do the moment you stop expecting one tool to be excellent at everything, which no tool is.

If you have ever watched a newsroom work, you already understand the shape of it. A story is not one person. There is a writer, who gets too close to their own copy to see its faults. There is a fact-checker, whose whole temperament is suspicion — paid to assume the writer is wrong until shown otherwise. And there is an editor, who has read the whole archive and can see how this piece fits, or doesn’t. Nobody confuses these jobs. You would never ask the writer to fact-check themselves; you know exactly what you’d get.

That is the team I run. The only unusual part is that the three colleagues are three different AI model families, and I am the one holding the desk.

The three jobs

Let me be concrete about the roles, because the framework is the credibility. The vagueness — “just use a different model” — is what makes the whole idea sound like a slogan.

The driver — Claude. I run my sessions in Claude Code, and this is where I sit. It plans, holds the thread across a long task, does the bulk of the writing and the thinking, and — most importantly — decides what gets handed to whom. What I ask it to do: the work itself, and the judgment about how to break it up. What I ask it to catch: almost nothing — it is too close to the work to review it, for the same reason a writer cannot proofread their own copy. They read what they meant, not what they wrote. Where I don’t trust it alone: anywhere the cost of being confidently wrong is high. Which is precisely why it does not get the last word.

The adversary — Codex. A second family, OpenAI’s, reached from the driver’s seat as a sub-task. Its job is to attack. What I ask it to do: the scoped, self-contained tasks where I want a fresh, skeptical pass, and the research — the going-and-checking of claims against the world. What I ask it to catch: the confident error. The number that’s wrong, the citation that doesn’t actually say what I claimed, the reasoning that only looks sound because I wanted it to. This is the family that found my seven mistakes, and the one that, more than once, has stopped me publishing something a friendlier reviewer waved straight through. Where I don’t trust it alone: judgment and taste — it is a fine skeptic and a poor author.

The reader — Gemini, reached through Antigravity. A third family, built to hold enormous amounts of context at once. What I ask it to do: read the whole haystack — the entire codebase, the long document, the sprawl no short-context model can keep in its head — and tell me what is genuinely in there. What I ask it to catch: the inconsistency across the whole, the thing in chapter two that quietly contradicts chapter nine, the pattern you only see from altitude. Where I don’t trust it alone: the fine-grained call — it sees the forest, sometimes at the cost of the tree.

Three families, three jobs, two modes each: a thing they do, and a thing they catch. The catching is the half most people skip, and it is the half that matters.

Mechanically this is simpler than it sounds. The driver is where I work; the adversary and the reader are reached from that one seat, dispatched as sub-tasks with a clean brief, their answers brought back for me to weigh. There is a fourth hand for a narrower job — when a piece genuinely needs an image, that goes to a different specialist again: Google’s image model, Nano Banana, reached through Google Antigravity. I am not flipping between windows and pasting between them; I am running one desk that can call on its specialists without leaving it. And they are not three strangers shouting across a gap. They work over the same local files. And, separately and more to the point, each is handed the same persistent memory as it starts — a second brain that holds the accumulated context of the work: my decisions, the patterns I have settled on, the domain, the brand. The workspace is local to the task; the second brain is global, and it accumulates across every project. That persistent memory is the part that compounds. The models are interchangeable, and more so every month; the second brain they all read from is not. It is the one piece of this that is mine. The plumbing changes every few months, and the plumbing is not the point — the shape is:

You set the task · make the final call task ↓ ↑ decide THIS TASK · SHARED WORKSPACE + SKILLS · LOCAL Codex · GPT-5.5 the adversary review · web research Claude Code the driver plans · routes · synthesises (hub) Gemini · Antigravity the reader whole-codebase context Nano Banana · Google images, when needed via Google Antigravity Right model for the right job — reliable intelligence, often at lower cost than asking one model to do everything. context ↑ read by every model THE SECOND BRAIN · GLOBAL, PERSISTENT your Obsidian wiki — accumulated across every project [[decisions]] [[brand]] [[domain]] [[framework]] [[patterns]] [[entities]] [[preferences]] [[skills]] [[knowledge]] [[evolution]]
Fig — two layers. A global, persistent second brain — my Obsidian graph of linked notes — feeds context into every model. Below it, a separate local workspace where the driver routes each job to the model built for it. You make the final call.

The agents do not decide. I do.

Which brings me to the part that actually matters, and the part the breathless “AI agents will run your company” version always skips.

What I have built is not a machine that thinks for me. It is a machine that makes it harder for me to think badly — that surfaces the disagreement I would otherwise never have seen, and forces me to resolve it. When the adversary flags ten problems and three are imagined, it has not done my work for me. I still have to go and decide which seven are real, and that is genuine labour. But it is labour done with my eyes open, on a shortlist I would never have generated alone.

The decision logic is mundane, and it is mine. Some work routes straight to a specialist. Some gets escalated — when two families disagree, I read both and adjudicate. Some gets rejected outright. And the consequential calls — what ships, what I put my name to — come back to me, always, by design. The agents sharpen the judgment. They never hold it.

What this actually is

I have described this as a personal habit, but it stops being personal the moment you notice what it actually is. Strip away the detail that the three colleagues are AI models, and what remains is something any operating-model designer would recognise on sight: separation of duties, independent review, an audit trail of who checked what, a clear escalation path when reviewers disagree, and a named human who carries the final accountability.

That is not a workflow tip. That is governance — the ordinary, hard-won governance every serious organisation already applies to money, to safety, to anything where a confident mistake is expensive. We spent a century learning not to let the person who writes the cheque also sign it off. The genuinely strange thing about 2026 is how willing we are to let a single AI model write the analysis, check the analysis, and brief the board on the analysis — three jobs we would never let one person hold.

The monoculture you didn’t decide to buy

Which is the real reason this matters past my desk.

When an organisation standardises on a single model — one provider, one contract, one integration to secure — it almost never frames it as a strategy. It frames it as simplification, and all of that genuinely is simpler. But a single model family is a single way of seeing, and applying it to everything means every draft, every summary, every triaged decision passes through one set of blind spots, with nothing of a different lineage positioned to catch what that set cannot see. The organisation believes it is buying a capability. It is also, quietly, inheriting a monoculture — and monocultures fail the way my first reviewer failed me: not loudly, but everywhere at once, in the places the system was built unable to look.

The deeper cost is the one that never shows up as a risk at all. A single model is a generalist asked to do every job — to draft and to check and to read at scale — and a generalist held to that breadth is rarely the best at any of it, with nothing of a different lineage positioned to catch what it misses. An organisation running one model where it could run a designed team is not only more fragile. It is underusing capability it already pays for, every day, by giving one tool the work that wanted three.

Put the other way around, it stops sounding like extra effort and starts sounding like the plain economy it is: the right model on the right job is reliable intelligence, often at a lower cost than asking one model to do everything — frequently better answers and a smaller bill at once, because you are no longer paying a frontier generalist for the work a specialist tends to do both better and cheaper.

ONE MODEL, EVERY JOB one generalist write · check · read · make — all of it passable at all, best at none one way of seeing → one blind spot a frontier price for every task A DESIGNED TEAM Claude · write Codex · check Gemini · read Nano · make expert at each, and they cross-check the right rate for each job
Fig — same work, one way of seeing or several. The team is usually the more reliable answer, and often the cheaper one.

You can see the market arriving at the same conclusion. Sakana’s Fugu, reported in late June, points the same way: learned orchestration of several frontier models behind a single interface, with performance and independence from any single vendor among the goals it sells. I would not build a strategy on a launch that new, and I would treat its claims as claims. But the direction is the signal: the friction that has kept multi-model a discipline for the few is exactly the friction someone has now decided is worth selling you out of. When a capability starts moving into the infrastructure, it is usually about to become ordinary.

The cheapest reliability there is

For most of the short history of this technology, the instinct has been that the way to trust the output more is to wait for a better model. Sometimes that is right. But the most reliable improvement I have found is not a smarter assistant. It is a designed one — a second pair of eyes that does not share the first one’s blind spots, a third that can hold the whole when I can only carry a part, and a desk in the middle that keeps the final call human.

It begins with something small enough to do this afternoon. Take the thing you are most sure of — the draft your assistant told you was ready — and hand it to a colleague who was trained somewhere else entirely, with one instruction: find what is wrong. Then read the seven things you were about to miss. The rest of it — the roles, the routing, the governance — is just that one habit, taken seriously enough to build around.