Explainer 24 Feb 2026 6 min read

Temperature, Top-p, and the Inference
Settings Nobody Explains

Every AI API has a set of parameters you can tweak when you send a request — temperature, top-p, frequency penalty, max tokens. Most people either leave them at their defaults or change them randomly. Here is what they actually do.

Temperature

Temperature controls how "random" the model's output is. At each step, the model calculates a probability for every possible next token. Temperature scales those probabilities.

  • Temperature 0: The model always picks the most probable next token. Output is deterministic (or near-deterministic). Good for factual tasks, code generation, data extraction.
  • Temperature 0.5–0.7: A moderate amount of randomness. The model mostly picks high-probability tokens but occasionally tries something less obvious. This is the sweet spot for most general tasks.
  • Temperature 1.0: The model uses its raw probability distribution as-is. More varied, more creative, more likely to surprise you — and more likely to say something odd.
  • Temperature > 1.0: The probability distribution gets flattened. Even low-probability tokens become viable. Useful for brainstorming. Unreliable for anything that needs to be correct.

Rule of thumb:

Use temperature 0 for tasks where there is one right answer (code, maths, classification). Use 0.5–0.7 for general-purpose tasks. Use 0.8–1.0 for creative writing or brainstorming.

Top-p (Nucleus Sampling)

Top-p is another way to control randomness, but it works differently from temperature. Instead of scaling all probabilities, it sets a cumulative probability threshold.

With top-p = 0.9, the model only considers tokens whose combined probability adds up to 90%. Everything else is excluded before sampling. This means that when the model is confident (one token has 95% probability), it effectively acts deterministically. When it is uncertain, it has more options.

Most APIs recommend using temperature or top-p, not both at once. Stacking them can produce unpredictable results.

Frequency Penalty & Presence Penalty

These two settings discourage the model from repeating itself.

  • Frequency penalty reduces the probability of a token proportionally to how many times it has already appeared. A word that has been used 5 times gets penalised more than one that has been used once.
  • Presence penalty applies a flat penalty to any token that has appeared at least once, regardless of how often. It encourages the model to introduce new topics rather than revisit old ones.

These are mainly useful for long-form generation. For short tasks (a few sentences), the defaults are fine and changing them is unlikely to help.

Max Tokens

This sets the maximum number of tokens the model will generate in its response. It does not make the response longer — it just puts a hard cap on it. If the model would naturally finish in 200 tokens, setting max_tokens to 4000 will not force it to write more.

Set this to a reasonable upper bound for your use case. Too low and the response gets cut off. Too high and you risk paying for an unnecessarily long output if the model rambles.

Stop Sequences

A list of strings that, when generated, cause the model to immediately stop. Useful for structured outputs — for example, if you want the model to generate a single JSON object, you can set a stop sequence of "\n\n" to prevent it from generating commentary after the JSON.

Practical Recommendations

Use case Temperature Top-p Notes
Code generation 0 You want the most probable (correct) code
Data extraction 0 Structured output, no creativity needed
General chat 0.7 0.9 Balanced: helpful but not robotic
Creative writing 0.9–1.0 0.95 More variety, more surprises
Brainstorming 1.0–1.2 1.0 Maximum diversity, expect some odd ones

Try It Yourself

Use our pricing calculator to estimate costs at different token lengths, or compare models to find the best fit for your workload.