What are Logits in Machine Learning and Why They Matter
Logits are the raw, unnormalized scores output by a neural network before they are turned into pr...
Abstract AlgorithmsTLDR: Logits are the raw, unnormalized scores produced by the final layer of a neural network โ before any probability transformation. Softmax converts them to probabilities. Temperature scales them before Softmax to control output randomness.
๐ The Confidence Meter Before Calibration
A doctor looks at an X-ray and assigns a raw "gut score": Cancer likelihood = 8.2, Normal = 1.4. These numbers are not percentages โ they haven't been normalized yet. To get percentages, she runs them through a calibration formula.
Logits are those raw gut scores. The Softmax function is the calibration formula.
๐ข From Network Output to Prediction
A classification network predicting image labels outputs raw scores:
Raw output (logits):
Cat: 4.5
Dog: -1.2
Bird: 0.8
These numbers have no fixed scale. A "4.5" just means "more confident Cat than the others."
Softmax converts logits to probabilities:
$$P_i = \frac{e^{z_i}}{\sum_j e^{z_j}}$$
Applied to the example:
| Class | Logit | e^z | Probability |
| Cat | 4.5 | 90.02 | 90.0% |
| Dog | -1.2 | 0.30 | 0.3% |
| Bird | 0.8 | 2.23 | 2.2% |
Wait - let me recalculate. e^4.5 = 90.02, e^-1.2 = 0.30, e^0.8 = 2.23. Sum = 92.55. Cat: 90.02/92.55 = 97.3%.
| Class | Logit | e^z | Probability |
| Cat | 4.5 | 90.02 | ~97% |
| Dog | -1.2 | 0.30 | ~0.3% |
| Bird | 0.8 | 2.23 | ~2.4% |
The model is 97% confident it is a cat.
โ๏ธ Why Not Output Probabilities Directly?
- Training stability: Working with logits and applying log-Softmax is numerically more stable than computing probabilities first, then taking log for cross-entropy loss.
- Flexibility: The same logits can be processed differently (Softmax for classification, sigmoid for multi-label, raw for regression).
- Temperature scaling: You can reshape the distribution from the logits before applying Softmax, which you couldn't do if the network directly output probabilities.
๐ง Temperature: Reshaping the Logit Distribution
In language models, the vocabulary logits are scaled by a temperature $T$ before Softmax:
$$P_i = \frac{\exp(z_i / T)}{\sum_j \exp(z_j / T)}$$
Effect of temperature on the cat/dog/bird example (logits: 4.5, -1.2, 0.8):
| Temperature | Cat P | Dog P | Bird P | Interpretation |
| T = 1.0 | 97% | 0.3% | 2.4% | Standard |
| T = 0.5 | ~99.9% | ~0% | ~0.1% | Even more peaked on Cat |
| T = 2.0 | ~80% | ~3% | ~17% | Flatter โ more uncertainty expressed |
| T โ 0 | 100% | 0% | 0% | Greedy (argmax) |
Low T: Sharp distribution. High confidence. Repetitive in language models.
High T: Flat distribution. Diverse/random. Creative in language models.
flowchart LR
Hidden["Hidden Layer Output"] --> Logits["Logit Layer\n(raw z_i scores)"]
Logits --> Temp["รท Temperature T"]
Temp --> Softmax["Softmax\nโ probabilities P_i"]
Softmax --> Sample["Sample or Argmax\nโ predicted class / next token"]
โ๏ธ Logits in Different Contexts
| Context | Output Layer | Applied After | Purpose |
| Multi-class classification | Logit vector (vocab size ร 1) | Softmax | Pick one class |
| Multi-label classification | Logit vector | Sigmoid (per logit) | Multiple binary predictions |
| Language model | Logit vector (one per vocab token) | Softmax + Temperature | Sample next token |
| Binary classification | Single logit | Sigmoid | P(positive class) |
| Regression | Raw value (no normalization) | None | Continuous output |
๐ Summary
- Logits = raw, unnormalized scores from the final linear layer of a neural network.
- Softmax converts logits to a probability distribution summing to 1.
- Temperature divides logits before Softmax: low T โ peaked (confident), high T โ flat (diverse).
CrossEntropyLossin PyTorch/TensorFlow takes logits directly โ don't apply Softmax before passing to the loss function; it's included internally for numerical stability.
๐ Practice Quiz
A model outputs logits [4.5, -1.2, 0.8] for classes Cat, Dog, Bird. After Softmax, which class has the highest probability?
- A) Dog, because -1.2 is the outlier.
- B) Cat โ the highest logit (4.5) maps to the highest Softmax probability.
- C) Bird โ the middle logit is most "normal."
Answer: B
You apply temperature T=0.1 to language model logits before Softmax. What effect does this have?
- A) The model becomes more creative and unpredictable.
- B) The distribution becomes very peaked โ the highest-logit token gets nearly all probability mass; output is near-deterministic.
- C) It has no effect when T < 1.
Answer: B
Why does PyTorch's
nn.CrossEntropyLossexpect raw logits rather than Softmax-normalized probabilities?- A) Logits require less memory.
- B) Computing log(Softmax(logits)) directly (LogSoftmax) is numerically more stable than computing Softmax first and then log โ avoiding floating-point underflow.
- C) CrossEntropyLoss cannot process probabilities between 0 and 1.
Answer: B

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. ๏ฟฝ...
