In more organisations than I can politely count, the AI capability lives in a document.

A shared file of prompts. Lovingly maintained, carefully worded, guarded by the one person who knows why the third paragraph says meticulously instead of carefully. Teams trade prompts the way an earlier generation traded Excel macros. Somewhere in the building, a slide says the organisation is doing AI, and what it means is that the document exists.

I understand how it happens. Prompting is the part of this technology that feels like control. You change a word, the output changes, and the feedback arrives in seconds. It rewards craft. It costs nothing but attention. For a while, it genuinely is the highest-leverage thing a team can do.

But the document is carrying more weight than it can hold. A prompt is configuration written in natural language. It has no schema on either side of it, no test that runs when something changes, and — unless someone deliberately pinned a dated model snapshot, which the shared document never did — no guarantee that it means the same thing tomorrow. It is an interface contract with none of the properties of one.

The program was never in the prompt. It only looked that way while nothing moved.

The update arrives like weather

The studies underneath this are easy to summarise, and I know them the way the field mostly does — by their reported findings rather than their appendices.

One line of work looked at what happens when you change a prompt’s formatting and nothing else. Spacing, separators, casing; the meaning untouched. The reported swings run to tens of accuracy points on the same task with the same model. Another widely-cited result found that simply reordering the same few-shot examples could move a model from near state-of-the-art to close to chance. Not different examples. The same ones, in a different order.

Then there is the version problem. The comparison this conversation always reaches for is the reported drop in GPT-4’s accuracy on a prime-number task between its March and June 2023 versions — 97.6 per cent falling to 2.4 in the first version of the study, softened on revision to 84 against 51, with part of the change traced to whether the model still followed its chain-of-thought instruction. Same prompt, same task, different behaviour underneath. The revision was the honest kind — less cliff, more drift — and it makes the operating point sharper, not softer: the words stayed still and the system they were written against moved.

THE PROMPT · ONE TEXT BLOCK the role and the tone fifteen worked examples the edge cases someone remembered the safety rules the output format, described in prose the business logic held together by phrasing the model update arrives like weather no version no schema no test
Fig 1 — Everything on one text block; nothing guaranteed.

None of this means prompts are useless, and the newer generations of models are reported to be less twitchy than the systems those studies measured. The direction is what matters. A prompt is sensitive to things no specification would ever treat as meaningful — whitespace, ordering, an adjective’s neighbourhood — and it sits on top of a foundation its owner does not control and, outside deliberately pinned API snapshots, has not even pinned. When the foundation shifts, nothing errors. The system keeps producing fluent output. It just quietly means something slightly different than it did on Friday.

Software engineering spent fifty years learning what to do about dependencies that behave this way. None of the answers were “choose your words more carefully.”

Why the document felt like the work

Before the industry’s answer, it is worth being fair about why the prompt document exists everywhere.

Prompting has the fastest feedback loop this technology offers. It needs no environment, no deployment, no budget line. A business analyst can do it, which — in organisations where engineering capacity is the scarcest resource — is most of the appeal. And in the early period it really was where the leverage lived: the models were raw, the tooling did not exist, and a well-built prompt was the difference between a demo and a mess.

So the craft grew a culture. Prompt libraries. Prompt marketplaces. Job titles. Training courses that teach the magic words. An entire genre of advice built on the premise that the sentence is the system.

The premise wore out faster than the culture did. That gap — culture still investing in phrasing, platform quietly moving the logic elsewhere — is where a lot of organisations are standing right now, holding the document.

Where the logic went

Watch what the serious end of the industry actually shipped over the last two years, and a pattern is hard to miss. Almost none of it makes prompts better. Almost all of it makes prompts matter less.

Structured outputs clamp the shape of the model’s answer to a schema — the format stops being a paragraph of pleading (“respond ONLY in valid JSON”) and becomes a constraint the platform enforces. Tool calling moves the computing out of the model: the model decides that a calculation is needed — and can still reach for the wrong tool — but code does the arithmetic, the lookup, the date maths. Evaluation suites turn “the prompt seems to work” into a regression test that runs when anything changes — the model version most of all. And routing returns to what it always was: deterministic code deciding what runs when, with the model doing bounded reasoning steps inside a structure it does not control.

the request messy, human router deterministic code decides what runs the prompt — thin, versioned the model bounded reasoning steps inside a structure it doesn't own schema the output, clamped tools code does the computing evals — every change regression-tested, drift caught before users catch it runs when the prompt changes · runs when the model version changes the same responsibilities as Fig 1 — moved to parts that can be pinned, tested, and reviewed
Fig 2 — The same responsibilities, redistributed to parts that can be tested.

Notice what happened to the prompt in that picture. It did not disappear. It became one thin layer — the part that genuinely is instruction — and it acquired the properties the document never had: a version, a place in source control, tests that run against it, and a blast radius small enough that a bad change shows up as a failed eval rather than a quarter of quietly degraded decisions.

The industry did not abandon the sentence. It stopped asking the sentence to be the system.

The eras, compressed

Seen from a distance, the whole short history of applied AI is the logic migrating outward.

prompt-craft logic in the phrasing retrieval logic in what you fetch schemas & tools logic in contracts + code agentic systems logic in the architecture the sentence is the system the sentence plus a library the sentence inside a contract the sentence is one component a model update breaks everything, silently a model update breaks less, and loudly
Fig 3 — Each era moved the logic further out of the sentence.

The eras overlap, and none of them retired the ones before. There are workloads today that are rightly just a prompt, and there will be next year too. The axis under the timeline is the honest way to read it: what changes across the eras is not whether sentences exist, but what a model update can silently break. At the left end, everything, invisibly. At the right end, less — and when something does break, a schema violation or a failed eval says so out loud.

Silent versus loud is the whole game. Organisations can manage things that fail loudly. It is the quiet drift that ends up in front of a customer, a regulator, or a board.

I watched this happen on my own machine

I can date my own version of this migration fairly precisely.

My early setup was a folder of prompts I was proud of. Long ones — role, rules, examples, the accumulated scar tissue of every failure I had patched with another sentence. When the models updated, I would notice things subtly off, and I would patch again. The folder grew. My confidence in it did not.

What I run now barely resembles it. The instructions live in versioned skill files with change history. The context the models read is a set of maintained memory documents, not a heroic preamble. Which model handles which task is a routing rule, written down, not a vibe. Reviews run across model families precisely because I stopped trusting any single system’s account of its own work. The prompts that remain are short, and almost boring.

THEN · A FOLDER OF PROMPTS prompts-FINAL.txt prompts-FINAL-v2.txt prompts-FINAL-really.txt role · rules · examples · every patch that ever worked no version history — just versions of the filename NOW · THE SAME JOB, STRUCTURED skills versioned instructions, with change history memory the context every model reads, maintained routing which model gets which task, written down review a different model family checks the work the prompts — short, and almost boring
Fig 4 — The same machine, two years apart.

Running more than one model family sharpened the lesson. When work is dispatched to a second model — a review to one family, a research pass to another — the dispatch prompt is written fresh each time, and it is deliberately boring: read these files, do this task, return this shape, stay under this length. The substance travels in the files, which are versioned. The prompt is the routing slip. What actually protects the work is not phrasing at all: it is the contract at the boundary — what the agent must read, what it must return, in what form — and verification the agent cannot talk its way around. Did the run take as long as real work takes. Did it write the files it claims. Does a different family reach the same conclusion.

The most useful test I have found is portability. If a task only succeeds with one model’s magic phrasing, the phrasing is not craft. It is fragility that has not been invoiced yet.

The thing I did not expect: the system got better as the prompts got worse. Less clever, less load-bearing, less mine. The craft moved up a level — from choosing the words to designing the structure the words sit inside. I suspect that is the same move most organisations are about to make, whether they have named it yet or not.

Where this lands for the organisation

From the consulting seat, the pattern is easy to spot because it always presents the same way. The team is proud of the prompt library. Nobody can say which model version the prompts were written against. There is no eval suite; there is a person who “checks it still works” — and that person’s continued employment is, in a real sense, the organisation’s AI control environment.

That is not an AI capability. It is unversioned business logic in a dependency nobody pinned, with key-person risk where the test suite should be — invisible right up until a vendor ships an update on their schedule, not yours.

The maturity question is not whether your people write good prompts. Some of them do, and in open-ended work the words still steer more than architects like to admit. The question is where the logic lives. What is pinned. What is tested. What fails loudly. What a model update can and cannot silently change about decisions your organisation is accountable for.

The prompt was a fine place to start. It was never going to be the place where the thinking of an institution could safely live. The organisations that internalise that early will spend the next few years building systems that survive their vendors’ release notes.

The ones that do not will keep the document.