Deep Dive 10 min read

Model Layers Explained

People often say a model has "dozens of layers" or "96 transformer blocks" without explaining what those layers actually do. This page walks through the stack from bottom to top in plain English.

1. Think of the model as a tower

Output head

Transformer block N

Transformer block ...

Transformer block 2

Transformer block 1

Embeddings + position information

Input tokens

Each stage refines the representation from the stage below. Lower layers often capture simpler structure, while higher layers tend to carry more abstract task-relevant information.

2. Input tokens become embeddings

The model does not reason over raw text characters. It starts with tokens, then maps each token to a dense numeric vector called an embedding.

These vectors are where the model's internal geometry begins. Similar meanings or usages can land near one another in this vector space, which makes later operations possible.

3. Position still matters

If the model only saw bags of token vectors, it would not know the difference between "dog bites man" and "man bites dog". So the stack also injects position information. Different model families handle this in slightly different ways, but the purpose is the same: preserve order.

Position is part of why long-context behaviour is hard. Models must preserve not just which tokens exist, but how they relate over long distances.

4. Inside the repeated transformer blocks

Multi-head attention

The block computes attention using several heads in parallel. A single attention head is one learned way of relating tokens to other tokens. Multiple heads let the model capture several kinds of relationships at once.

Feed-forward network

After information is mixed across tokens, each token passes through a feed-forward network. This gives the model capacity to transform and sharpen what it has learned from attention.

Residual connections

A residual connection adds the original signal back into a later computation so the network learns refinements instead of replacing the whole signal.

Layer normalization

Layer normalization keeps activations well-scaled and makes deep stacks more stable to train and run.

5. What "more layers" usually means

More layers generally means the model can transform the signal more times before making a final prediction. That often increases capacity, but it also increases memory needs, latency, and serving cost.

More layers does not automatically mean better outputs. Architecture choices, data quality, training recipe, inference stack, and the surrounding harness all matter too.

6. Where the final answer comes from

Once the top layer finishes, the model uses an output projection head to score every possible next token in its vocabulary. That score distribution is turned into a probability distribution, and one token is selected according to the decoding settings.

Then the whole loop repeats with the new token appended. This is why generation feels fluid but is still fundamentally next-token prediction.

Every card is a fact.