How Large Language Models Work
In Part 1, we learned what AI is at a high level. Now let's look under the hood of the technology that powers ChatGPT, Claude, Gemini, and every other AI chatbot you've used. You don't need a computer science degree - just curiosity.
1. What Is a Large Language Model?
A Large Language Model (LLM) is a type of AI trained on enormous amounts of text so it can understand and generate language. When you use ChatGPT, Claude, Gemini, or a similar assistant, you are interacting with an LLM.
The word "large" refers both to the training data and to the size of the model itself. Modern models contain huge numbers of learned parameters, often in the billions. Those parameters are not facts stored in neat boxes; they are learned numerical relationships that shape how the model predicts the next token.
At the most basic level, an LLM is a next-token prediction engine. It sees the text so far, estimates which token is most likely to come next, adds that token, and repeats the process until it has produced a response.
2. Tokens and Tokenisation
Before the model can work with your text, the text is broken into smaller pieces called tokens. This process is called tokenisation.
Tokens are not always the same as words. Common words may be single tokens, while longer words may split into several pieces. Punctuation and leading spaces can also form part of tokens.
Tokenisation Example
The sentence "I love programming" might be split like this:
Exact splits depend on the model's tokenizer. That is why token counts differ across providers.
3. How Models Are Trained
Training usually happens in two broad stages. First comes pre-training, where the model reads a huge amount of text and learns broad language patterns through next-token prediction. Then comes instruction tuning and preference tuning, where the model is shaped into something more useful, safer, and easier to interact with.
This is why a base model and a chat model can feel very different even if they share the same core architecture. The second stage changes the behaviour you see at the surface.
Key Concept
Parameter count matters, but it is not the whole story. Data quality, training recipe, alignment work, and the runtime harness also shape the final experience.
4. The Transformer Architecture
Every major LLM is built on a Transformer. The Transformer made modern language models possible by introducing the attention mechanism, which lets each token focus on other relevant tokens across the sequence.
This helped solve two major problems at once: long-range context and efficient parallel training. That is why the 2017 paper "Attention Is All You Need" is one of the most important documents in modern AI.
Simple example
In the sentence "The dog didn't cross the street because it was too wide.", attention helps the model relate it back to street rather than dog.
5. Why Models Hallucinate
Models do not retrieve truth from a verified source by default. They generate plausible continuations of the text they have been given. If the most plausible continuation is wrong, you get a hallucination.
Critical limitation
Hallucinations often sound confident. That is why you should verify important claims independently, especially in medical, legal, financial, or research-heavy work.
6. Context Windows
A model can only "see" a limited amount of information in one request. That limit is called the context window. The larger the window, the more text the model can handle at once.
Bigger context windows can help with long documents and extended tasks, but they also affect cost, latency, and sometimes quality. Bigger is not automatically better.
Go deeper from here
Part 2 Summary
- LLMs predict the next token - that is the core mechanism behind GPT, Claude, Gemini, and similar models.
- Text becomes tokens before the model can process it.
- Training is staged - broad pre-training first, then instruction shaping and preference tuning.
- Transformers enabled modern AI because attention made longer-context reasoning and large-scale training practical.
- Hallucinations are structural - models generate plausible text, not guaranteed truth.
- Context windows act like working memory and come with cost and speed trade-offs.