LLMs aren't just black boxes that output text. They have settings you can adjust to control their behavior. Understanding these settings helps you get better, more reliable results.
TemperatureWhat is temperature?A setting that controls how creative or predictable an AI's output is. Low temperature gives consistent answers; high temperature produces more varied responses.: creativity vs predictability
Temperature is the most important setting. It controls how "creative" or "random" the model's responses are.
How it works
Remember: the model generates text by choosing the next tokenWhat is token?The smallest unit of text an LLM processes - roughly three-quarters of a word. API pricing is based on how many tokens you use. based on probability.
Temperature modifies those probabilities:
- Low temperature (0-0.3): Always pick the highest probability token
- High temperature (0.7-1.0+): Sample from a wider range of probable tokens
Next token probabilities for "The cat":
"sat": 60% "jumped": 25% "meowed": 10% "flew": 5%
Temperature = 0.1 (low):
→ Almost always picks "sat"
Temperature = 0.8 (high):
→ Might pick "jumped" or even "meowed"
Temperature = 1.5 (very high):
→ Might pick "flew" (nonsense but creative!)When to use different temperatures
| Temperature | Best For | Why |
|---|---|---|
| 0.0, 0.2 | Coding, math, fact-checking | Consistent, predictable answers |
| 0.3, 0.5 | Technical writing, analysis | Balanced accuracy and variety |
| 0.6, 0.8 | General conversation, brainstorming | Natural, varied responses |
| 0.9, 1.2 | Creative writing, fiction | Unexpected, imaginative outputs |
| 1.3+ | Artistic experimentation | Often incoherent but surprising |
Examples
Prompt: "Write a haiku about programming"
Temperature 0.2:
Code runs through the night
Functions call and logic flows
Program executesTemperature 0.8:
Semicolons dance
Through loops of midnight coffee
Bugs become featuresTemperature 1.5:
Quantum bytes sing songs
Neon dragons debug stars
Syntax tastes purpleTop-p (Nucleus Sampling)
Top-p is an alternative to temperatureWhat is temperature?A setting that controls how creative or predictable an AI's output is. Low temperature gives consistent answers; high temperature produces more varied responses.. Instead of adjusting probabilities, it restricts which tokens can be chosen.
How it works
- Consider only tokens whose cumulative probability exceeds p
- Then sample from those tokens
Top-p = 0.9:
Token probabilities:
"sat": 50% (cumulative: 50%) ✓
"jumped": 30% (cumulative: 80%) ✓
"meowed": 15% (cumulative: 95%) ✓
"flew": 5% (cumulative: 100%) ✗
Only consider "sat", "jumped", "meowed" (95% of probability mass)Top-p vs Temperature
Top-p advantages:
- More intuitive ("use the top 90% of likely words")
- Automatically adapts to the confidence of the model
- Often produces more coherent results
Common combination:
- Temperature: 0.7
- Top-p: 0.9
Max tokens (max length)
Max tokens limits how long the response can be.
Why use it?
- Cost control: You're charged per tokenWhat is token?The smallest unit of text an LLM processes - roughly three-quarters of a word. API pricing is based on how many tokens you use.
- Prevent runaway: Some prompts can trigger infinite loops
- Force conciseness: "Explain in max 100 tokens"
- APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses. limits: Some endpoints have maximum response sizes
Examples
Prompt: "Summarize this article"
Max tokens: 150
→ Forces a brief summary
Prompt: "Explain quantum computing"
Max tokens: 2000
→ Allows comprehensive explanationPenalty settings
Two settings discourage certain behaviors:
Frequency penalty
Reduces repetition of the same words/phrases.
- 0.0: No penalty (can repeat freely)
- 0.5-1.0: Moderate penalty (reduces repetition)
- 2.0+: Strong penalty (forces variety)
Use case: Long-form writing where you don't want the model to keep using the same phrases.
Presence penalty
Encourages the model to talk about new topics.
- 0.0: No penalty (can stay on topic)
- 0.5-1.0: Encourages topic shifts
- 2.0+: Forces constant topic changes
Use case: Brainstorming sessions where you want diverse ideas.
Difference between them
| Penalty | What it punishes | Effect |
|---|---|---|
| Frequency | Repeating the SAME word multiple times | "Very very very" → "Quite extremely remarkably" |
| Presence | Repeating words that ALREADY appeared | Avoids reusing concepts from earlier in the text |
Recommended settings by task
Code generation
Temperature: 0.1-0.2
Top-p: 0.95
Max tokens: 2000
Frequency penalty: 0
Presence penalty: 0- Low temp for consistency
- High top-p to allow necessary syntax variations
Technical documentation
Temperature: 0.3-0.4
Top-p: 0.9
Max tokens: 1500
Frequency penalty: 0.2
Presence penalty: 0.1- Balanced accuracy and readability
- Slight penalties to avoid repetitive language
Creative writing
Temperature: 0.8-1.0
Top-p: 0.9
Max tokens: 2000
Frequency penalty: 0.5
Presence penalty: 0.3- Higher temp for creativity
- Penalties to maintain varied language
Brainstorming
Temperature: 0.9-1.1
Top-p: 0.95
Max tokens: 1000
Frequency penalty: 0.3
Presence penalty: 0.6- High temp and presence penalty for diverse ideas
Where to set these
OpenAI APIWhat is api?A set of rules that lets one program talk to another, usually over the internet, by sending requests and getting responses.:
response = client.chat.completions.create(
model="your-model-id", # check docs for current models
messages=[{"role": "user", "content": "Explain Python"}],
temperature=0.3,
max_tokens=500,
top_p=0.9,
frequency_penalty=0.2,
presence_penalty=0.1
)Claude API:
response = client.messages.create(
model="your-model-id", # e.g. "claude-sonnet-4-6", check docs for current models
messages=[{"role": "user", "content": "Explain Python"}],
temperature=0.3,
max_tokens=500
)ChatGPT web interface:
- Click your profile
- "Customize ChatGPT"
- Limited options available
These settings give you fine-grained control over LLM behavior. Start with defaults, then experiment. For most tasks, temperatureWhat is temperature?A setting that controls how creative or predictable an AI's output is. Low temperature gives consistent answers; high temperature produces more varied responses. between 0.3-0.7 and top-p of 0.9-0.95 works well. Adjust based on whether you need precision (lower) or creativity (higher).