Deep Dive 9 min read

What Is a Transformer?

The Transformer is the architecture behind GPT, Claude, Gemini, Llama, and most modern language models. If you understand the Transformer, you understand why today's models can track context, write fluently, and still fail in very specific ways.

1. The short version

A Transformer is a neural-network design built to process a whole sequence of tokens while constantly asking: which other tokens matter most right now?

Older language systems struggled with long-range relationships. Transformers improved that by using attention: a mechanism that lets each token look across the rest of the sequence and weigh what is relevant.

Key idea

A Transformer is not "one big thought." It is a stack of repeated blocks. Each block refines the model's internal representation of the text, passes it upward, and helps the model predict the next token.

2. Why the Transformer mattered

Before 2017, many language models processed text in a more step-by-step way. They could work, but they were slower to train and worse at relating distant pieces of information. The 2017 paper Attention Is All You Need changed that.

The breakthrough was that the model could learn relationships between tokens in parallel, which made larger training runs practical and made long-context behaviour much stronger. That is why the Transformer sits under nearly every frontier language model today.

3. A simple diagram

Tokens

Text becomes token IDs

Embeddings

Tokens become vectors

Transformer Blocks

Attention + FFN + residual paths

Output Head

Scores every next-token option

Next Token

One choice is sampled

4. What happens inside one block

A modern decoder-style model such as GPT or Claude repeats the same broad pattern many times. A single transformer block usually contains:

Attention

Each token compares itself to other tokens and decides what to focus on. This is where pronouns, references, syntax, and long-range relationships get stitched together.

Feed-forward network

After attention mixes information across tokens, a per-token network transforms that information into a richer representation.

Residual connections

The block keeps a shortcut path so it can add the new transformation without losing the original signal.

Layer normalization

The activations are stabilised so the stack trains and runs more reliably across many layers.

5. Why attention is so powerful

Suppose you read: "The trophy would not fit in the suitcase because it was too small." What does it refer to? Humans infer that it means the suitcase. Attention lets the model build similar links by scoring how strongly one token should attend to others.

In practice, models do this with many attention heads at once. Different heads can specialise in different kinds of relationships: syntax, position, code structure, quotation boundaries, or topic continuity.

6. What a Transformer is not

A Transformer is not a human brain, not a database of facts, and not the whole product experience. It is the core model architecture. Once you wrap that model in routing, tools, prompts, retrieval, sandboxes, and verifiers, you get the real system users feel.

That is why a model can appear to get "better" or "worse" even if the underlying base model is unchanged: the surrounding harness changes as well. The next deep dive on this is What Is a Harness?.

Every card is a fact.