Temperature and top-p are the two parameters most people adjust when using language models, yet most people adjust them wrong because they misunderstand what these numbers actually control. They are not creativity dials — they are probability distribution shapers.

Foundation: How Models Choose Words

At each step of generation, the model produces a probability distribution over its entire vocabulary — typically 32,000 to 128,000 tokens. Each token gets a probability. The model then samples from this distribution to choose the next token.

Without any modification, the distribution might look like: "the" (15%), "a" (8%), "this" (6%), "our" (4%), and so on through thousands of tokens with tiny probabilities. Sampling from this raw distribution produces reasonable but slightly random text.

Temperature: Sharpening or Flattening

Temperature divides the raw logits (pre-softmax scores) before the probability calculation.

Top-P (Nucleus Sampling): Cutting the Tail

Top-p takes a different approach. Instead of reshaping the entire distribution, it sorts tokens by probability and keeps only the smallest set whose cumulative probability reaches the threshold p.

Practical Application

For most applications, set one and leave the other at default:

Common Misconceptions

"Higher temperature means more creative"

Higher temperature means more random. Creativity requires coherent novelty. Temperature above 1.0 typically produces incoherent novelty — which is not creative, just noisy.

Frequently Asked Questions

Should I adjust both temperature and top-p?

Generally no. They interact in complex ways. Adjust one and leave the other at default. If you must adjust both, use lower values for each than you would individually.

Why does temperature 0 sometimes give different outputs?

Floating-point precision and batching can cause ties between top tokens. True temperature-0 should be deterministic, but implementation details sometimes introduce minor variation.