Beginner 8 min read Part 2 of 5

How Large Language Models Work

In Part 1, we learned what AI is at a high level. Now let's look under the hood of the technology that powers ChatGPT, Claude, Gemini, and every other AI chatbot you've used. You don't need a computer science degree -- just curiosity.

1. What Is a Large Language Model?

A Large Language Model (LLM) is a type of AI that has been trained on enormous amounts of text to understand and generate human language. When you type a message into ChatGPT, Claude, Gemini, or any similar tool, you're talking to an LLM.

The "large" in LLM refers to two things: the massive amount of text data the model was trained on (often most of the publicly available internet), and the sheer number of internal parameters -- the numerical values the model has learned, which can number in the hundreds of billions.

At its core, an LLM does something deceptively simple: it predicts the next word. Given the text that comes before, the model calculates which word (or piece of a word) is most likely to come next. Then it takes that output, appends it to the input, and predicts the next word again. And again. And again. That's how it writes entire paragraphs, essays, and even code -- one word at a time.

Key Concept

LLMs are next-word prediction machines. The sentence "The cat sat on the ___" has a highly predictable next word ("mat", "floor", "chair"). LLMs do exactly this, but across billions of patterns learned from training data. The results are so good they can feel like understanding -- but the underlying mechanism is statistical prediction, not reasoning the way humans do.

Well-known LLMs include OpenAI's GPT-4o, Anthropic's Claude, Google's Gemini, and Meta's open-source Llama models. They all work on this same fundamental principle, though they differ in their training data, size, architecture details, and fine-tuning.

2. Tokens and Tokenisation

Before an LLM can process your message, it needs to break the text down into smaller pieces called tokens. This process is called tokenisation, and it's the very first step in everything an LLM does.

A token is not the same as a word. Tokens are chunks of text that the model has learned to treat as units. Short, common words like "I" or "the" are usually a single token. Longer or less common words get split into multiple tokens. Punctuation, spaces, and even parts of words can each be their own token.

Tokenisation Example

The sentence "I love programming" gets split into four tokens:

"I" + " love" + " program" + "ming"

Notice: "love" keeps a leading space as part of the token. "programming" is split into "program" and "ming" because the model's tokeniser treats them as separate units. The exact split depends on the model's vocabulary.

Key Concept

A rough rule of thumb: 1 token is approximately 4 characters in English, or about three-quarters of a word. A 1,000-word essay is roughly 1,300 to 1,500 tokens. This matters because AI services charge per token -- both for your input (the prompt) and the model's output (the response). Understanding tokens is essential for understanding AI pricing.

Different models use different tokenisers, which means the same sentence might be split slightly differently by GPT-4 versus Claude versus Gemini. But the core idea is universal: text goes in as raw characters and comes out as a sequence of token IDs that the model can work with mathematically.

3. How Models Are Trained

Training an LLM happens in two major phases. Understanding these phases helps explain both why LLMs are so capable and why they have the specific limitations they do.

Phase 1: Pre-training

In the pre-training phase, the model reads an astronomical amount of text -- books, websites, articles, forums, code repositories, scientific papers, and much more. We're talking about trillions of words. During this process, the model is given a simple task over and over: predict the next word. It reads a sequence of tokens, predicts what comes next, checks if it was right, and adjusts its internal parameters to do better next time.

This process requires enormous computing power -- thousands of specialised processors (GPUs or TPUs) running for weeks or months. The cost of pre-training a frontier model can run into tens or even hundreds of millions of dollars. The result is a "base model" that is very good at predicting text but not yet good at following instructions or having a conversation.

Phase 2: Fine-tuning and RLHF

A base model that can predict the next word is impressive, but it's not particularly useful as an assistant. If you ask it a question, it might just continue generating text in the style of a web page rather than answering you directly. This is where fine-tuning comes in.

During fine-tuning, the model is trained on carefully curated examples of good conversations: a human asks a question, and an ideal response is provided. The model learns to follow instructions, be helpful, and format its answers clearly.

An important technique in this phase is RLHF (Reinforcement Learning from Human Feedback). Human reviewers rank different model outputs from best to worst. The model then learns to produce responses that humans prefer -- not just text that is statistically likely, but text that is helpful, honest, and safe. This is a large part of what makes modern chatbots feel so conversational.

Key Concept

Modern LLMs have billions of parameters -- the numerical values that encode everything the model has learned. GPT-4 is estimated to have over a trillion parameters. Claude, Gemini, and Llama models all operate in the range of tens to hundreds of billions. More parameters generally means the model can capture more nuance, but also requires more computing power to run.

4. The Transformer Architecture

Every modern LLM is built on an architecture called the Transformer, introduced in a landmark 2017 research paper from Google titled "Attention Is All You Need." This paper changed the entire field of AI, and its core innovation -- the attention mechanism -- is what makes today's language models possible.

What came before: processing words one at a time

Before Transformers, language models processed text sequentially -- one word at a time, left to right, like reading a sentence aloud. This made them slow and forgetful. By the time the model reached the end of a long paragraph, it had often "forgotten" details from the beginning. Long-range context was a fundamental problem.

The breakthrough: attention

The Transformer's key innovation is that the model can look at all the words at once, not just left-to-right. When generating each new word, the model uses its attention mechanism to decide which parts of the input are most relevant to what it's producing right now.

Think of it like reading an exam question. You don't read it one word at a time and forget the beginning by the end. You scan the whole thing, focus on the key parts, and then formulate your answer. That's essentially what the attention mechanism does -- it lets the model "attend to" (focus on) different parts of the input simultaneously, weighting them by relevance.

A Simple Example

Consider the sentence: "The dog didn't cross the street because it was too wide."

What does "it" refer to -- the dog or the street? You know it's the street (streets are wide, not dogs). The attention mechanism allows the model to make the same connection by directly linking "it" to "street" based on patterns learned during training, even though the two words are far apart in the sentence.

Why This Was Revolutionary

The Transformer architecture solved two problems at once. First, it could handle long-range dependencies -- connecting information across thousands of words. Second, because it processes all words in parallel rather than one at a time, it could be trained much faster on much more data. This is what enabled the jump from modest language models to the massive, capable LLMs we have today. Every major model -- GPT, Claude, Gemini, Llama -- is a Transformer.

5. Why Models Hallucinate

If there is one thing you take away from this entire course, let it be this: LLMs do not look up facts. They predict likely text.

When you ask a model "Who wrote Hamlet?", it doesn't search a database. Instead, it has seen the pattern "Hamlet was written by William Shakespeare" so many times during training that "William Shakespeare" is the overwhelmingly likely next set of tokens. It gets the answer right -- but not because it "knows" the fact. It gets it right because that pattern was extremely common in its training data.

The problem arises when you ask something that doesn't have a dominant pattern in the training data. The model will still generate confident, fluent text -- because that's what it was trained to do. But the content may be completely fabricated. This is called hallucination.

Critical Limitation

A hallucinating model doesn't signal that it's unsure. It presents made-up information with exactly the same confident, authoritative tone it uses for accurate information. It will cite research papers that don't exist, invent statistics, fabricate quotes, and describe events that never happened -- all while sounding completely certain.

This is not a bug that will be easily fixed. It's a fundamental consequence of how these models work: they generate plausible text, and plausible is not the same as true.

How to Protect Yourself

  • Verify claims independently -- especially statistics, dates, names, and citations.
  • Ask the model to show its reasoning -- if it can't explain how it arrived at an answer, be suspicious.
  • Be extra cautious in high-stakes domains -- medical, legal, and financial information should always be checked by a professional.
  • Tell the model it's OK to say "I don't know" -- this can reduce (but not eliminate) hallucination.

6. Context Windows: The Model's Working Memory

An LLM doesn't have memory in the way you do. It doesn't remember your conversation from yesterday or learn from past interactions. Instead, every time you send a message, the model receives your entire conversation so far as input and generates a response based on that. The maximum amount of text it can receive at once is called the context window.

Think of the context window as the model's working memory -- everything it can "see" at one time. Anything outside this window doesn't exist to the model. If your conversation grows longer than the context window, the oldest messages get dropped and the model simply cannot reference them any more.

Context Window Sizes (Examples)

GPT-4o 128K tokens
Claude (Anthropic) 200K tokens
Gemini 1.5 Pro 1M+ tokens

128K tokens is roughly equivalent to a 300-page book. 1M tokens can hold multiple books. Context windows have grown dramatically -- early GPT-3 had only 4K tokens.

Key Concept

Context windows are measured in tokens, not words. Remember: 1 token is roughly 4 characters. A larger context window means the model can process longer documents, maintain longer conversations, and consider more information at once. But larger context windows also cost more (you pay per token) and can be slower. There's always a trade-off.

Context window size is one of the most important factors when choosing a model. For a detailed side-by-side comparison of context windows across all major models, see our Context Window Comparison page.

Part 2 Summary

  • LLMs predict the next token -- that's the fundamental mechanism behind GPT, Claude, Gemini, and all similar models.
  • Text is split into tokens, not words. Roughly 1 token = 4 characters. Tokens determine both what the model can process and what you pay.
  • Training has two phases: pre-training (reading the internet) gives broad knowledge; fine-tuning and RLHF make the model helpful and safe.
  • The Transformer architecture (2017) is the breakthrough behind all modern LLMs. Its attention mechanism lets models consider all words at once.
  • Models hallucinate because they predict likely text, not verified facts. Always verify important information independently.
  • Context windows define how much text the model can see at once -- its working memory. Bigger isn't always better due to cost and speed trade-offs.