The tax on every other language — Rajneesh Kambhatla

Write a sentence in English. Write the same sentence in Hindi. A person reads both in about the same breath — same meaning, same information, no real difference in effort.

Now hand both to the part of a language model that does the reading. The English sentence costs seven tokens. The Hindi one costs twenty-eight.

Fig 1 — Same meaning, four times the pieces.

Four times the work to take in the same thought. (Those counts are from GPT-4’s tokenizer; the exact number depends on which one you use — a newer, more multilingual tokenizer narrows the gap, which turns out to be the whole point.) Not because Hindi is longer, or vaguer, or harder to think in. Because of a decision made years ago, in another language, about whose words the machine would be allowed to treat as whole.

I came at this the way I suspect a lot of people do. I saw that seven-versus-twenty-eight gap, assumed it was a rounding quirk, and went looking for why. The answer turned out to be more deliberate than I expected — not a bug anyone introduced, but the direct consequence of a choice no one framed as being about fairness. It is worth walking through slowly, because once you see where the number comes from it stops looking like a quirk and starts looking like a decision — and the places undoing it are not the ones you might expect.

How a machine reads a sentence

Strip a language model down and the figure everyone quotes — the billions of parameters — sits in the middle of the process, not the start. Before any of that, your text has to be turned into something the model can do arithmetic on. That job belongs to a small, unglamorous component called the tokenizer.

Fig 2 — Tokenizing is step one. Everything downstream inherits its choices.

The tokenizer chops your text into tokens — chunks that might be a whole word, part of a word, or a single character — and hands the model a list of numbers. The model then does the only thing it really does: predict the next token, billions of times over, until it is good at guessing the pattern. The parameters are simply the knobs it tunes while learning to guess.

Which means the very first thing that happens to your sentence is that it gets cut into pieces. If that cut is clean for your language, everything after it is cheaper. If it is clumsy, you pay for the clumsiness at every step. So the real question is how the tokenizer decides where to cut.

The dictionary was written in English

Many of today’s models build their tokenizer with a method called byte-pair encoding — or a close cousin of it — and the idea is simple enough to explain over dinner. Take an enormous pile of text. Find the two smallest pieces that sit next to each other most often. Glue them into a single unit. Repeat, thousands of times, gluing ever-larger fragments, until you have a fixed dictionary of chunks the model recognises.

Do this on mostly-English text and the dictionary fills with the shapes of English. Common English words survive whole. The, and, ing, tion — each earns a single-token slot, because each showed up often enough to deserve one.

Fig 3 — The vocabulary is a record of which language built it.

Now run a Hindi word through that same English-shaped dictionary. None of its pieces were frequent in the training pile, so none earned a whole-token slot. The word shatters into fragments — sometimes into single characters, sometimes into pieces smaller than a letter. The dictionary is not neutral. It is a record of which language it was built from, and it spends its efficiency on that language first.

This is the quiet mechanism behind the four-times number. The model was never told Hindi mattered less. It was fed a diet that made English cheap and everything else expensive, and the tokenizer faithfully wrote that diet into its bones.

Fertility — the tax nobody voted for

There is a word for this: fertility. It measures how many tokens it takes to represent one word. A perfect score is one token per word. English, on most models, sits close to it. In published tokenizer audits, major languages of South Asia, Southeast Asia and much of Africa commonly land far higher — four, six, sometimes eight.

Fig 4 — The distance between the orange bar and the line is the tax.

Every one of those extra tokens is a charge. On a per-token bill, four times the tokens is roughly four times the cost to process the same request. It eats the context window, the model’s short-term memory, so a Hindi document fills the model’s desk four times faster than the English one. And because that window is finite, more tokens for the same meaning leaves less room for everything else — the instructions, the documents, the reasoning the model is supposed to do.

So a speaker of a high-fertility language pays three times over: more money, less memory, worse thinking — for the same sentence. None of them agreed to it. It was set by the composition of a training corpus they never saw.

Who actually pays

It is tempting to read this as an edge case. It is the opposite. English is, computationally, one of the easiest languages in the world to model: a single script, an enormous written record, and users comfortable typing. Most of the planet does not look like that.

Indonesia has more than seven hundred living languages. Nigeria has over five hundred. Across South and Southeast Asia, Sub-Saharan Africa and the Middle East, the primary way people reach these systems is not the keyboard but the voice. Treating these languages as a localisation problem — ship the English model, sprinkle on translations — gets the shape of the thing exactly backwards. The hard cases are not the exception to be handled later. They are most of the world.

Building the dictionary the other way

If the tokenizer is where the bias enters, it is also where it can be removed. The fix is almost embarrassingly direct: build the dictionary from the languages you actually care about.

Weight the training pile toward Devanagari, Tamil, Bengali — or Thai, or Arabic, or Yoruba — and byte-pair encoding does exactly what it did for English, only now in service of those scripts. It learns their common pairings. Their frequent words earn whole-token slots. The same Hindi word that shattered into five fragments becomes one or two clean tokens.

Fig 5 — The tax is not reduced. It is repealed.

The results are not marginal. Tokenizers rebuilt this way report fertilities for these languages closer to one-and-a-half or two than to eight — within touching distance of English. And it takes nothing exotic. No new science; just a different answer to the question of whose language the dictionary is built from.

A model shaped for a phone

A good tokenizer is half the answer. The other half is the model you put on top of it — because the people who most need these systems are the least likely to be holding the hardware those systems quietly assume.

The important use cases here are government services, financial access, rural healthcare, running on mid-range phones and thin connections, not gigabit fibre. A model for that world has to be capable and light at the same time, and those usually pull against each other. The architecture that threads the needle is called a mixture of experts.

Fig 6 — Capability on the shelf; only a slice pulled down for each word.

A conventional dense model fires every one of its parameters for every word it processes, however trivial the request. A mixture-of-experts model adds a router that wakes only the handful of specialists each token actually needs. The model can hold an enormous amount of capability in total while using a small slice of it at any moment — competitive quality for far less compute per word. (The full model still has to live somewhere with enough memory to hold it; what falls is the work done per word, not the size of the thing.)

Put the two together — a tokenizer that reads the language efficiently, and an architecture that is cheap to run at scale — and serving a billion people stops being priced out of reach. Where the model has to run on the handset itself, the same instinct points the other way, toward deliberately small models. Either path starts from the same refusal: to treat these languages, and these users, as an afterthought.

One team did this — and they are not alone

This is no longer theoretical. A team in Bangalore, Sarvam AI, built exactly this: a tokenizer weighted toward Indian scripts that it reports brings fertility to roughly 1.4 to 2.1 tokens per word — several times more efficient than the multilingual models it competes with — paired with a mixture-of-experts model that holds 105 billion parameters but activates only around ten billion for any given token.

What makes it a story rather than a press release is that they are not alone. The same pattern is repeating across nearly every region that the largest labs treat as an afterthought.

In India, AI4Bharat out of IIT Madras has spent years building Indic tokenizers, translation systems and benchmarks across the country’s twenty-two scheduled languages; Krutrim, backed by Ola, trained its own Indic tokenizer from scratch. In Thailand, Typhoon rebuilt tokenization for Thai and reports it running more than two-and-a-half times more efficiently. AI Singapore’s SEA-LION and the Sailor models carry the same regional fight into the languages of Southeast Asia. In the Gulf, the Jais models centre Arabic rather than bolt it on. And in Africa — where the compute is scarcest and the linguistic diversity highest — Lelapa AI’s InkubaLM is a deliberately tiny model for African languages, while the Masakhane community keeps surfacing exactly where the standard tokenizers break.

There is even a lane that proposes to skip the dictionary entirely. Research efforts like Google’s ByT5 and Meta’s MEGABYTE and Byte Latent Transformer drop the fixed vocabulary altogether and work closer to the raw bytes — nothing to be biased toward one language in the first place. If that approach matures, the bias has nowhere left to live: there is no dictionary to build in the wrong language, even if longer scripts still cost more bytes to read.

What the market does next

Lined up together, these are not nine local curiosities. They are a set of differentiators pointing the same way. A tokenizer tuned to a region’s languages is a structural cost advantage no general-purpose model can match on that turf. An efficient architecture is reach into users a data-centre-shaped model cannot economically serve. And a model built on local languages and local data is something governments and regulators can be persuaded to trust — the argument now travelling under the banner of sovereign AI.

The likely shape is layered rather than winner-take-all: a few global base models underneath, wrapped in regional tokenizer and data layers on top, with a smaller number of genuinely ground-up national models where language, policy or procurement justify the cost. The open question is distribution. A regional model can be measurably better on its home languages and still lose to a Big Tech model that is merely good enough but arrives on every phone by default.

Which returns us to the sentence we started with — seven tokens in English, twenty-eight in Hindi. It looked at first like a technical quirk. It is really a question about who the foundation was poured for. For most of the short history of these systems the answer was: not you, not here, not in your language. That is finally starting to change — and, fittingly, the change is coming from the places that were never given the option of treating it as someone else’s problem.