A Guide to Pre-training Large Language Models
Pre-training is the most expensive part of building an LLM. We explain the data pipeline, the 'Ne...
Abstract AlgorithmsTLDR: Pre-training is the phase where an LLM learns "Language" and "World Knowledge" by reading petabytes of text. It uses Self-Supervised Learning to predict the next word in a sentence. This creates the "Base Model" which is later fine-tuned.
๐ The Library Metaphor: What Pre-training Actually Does
Imagine teaching a child to read.
- Pre-training: You lock the child in a library for ten years. They read every book โ grammar, history, math, code, recipes. They absorb the structure and content of human knowledge. But they have no social skills; they can't follow instructions or hold a polite conversation.
- Fine-tuning (next step): You hire a tutor to teach manners ("don't say harmful things") and specific tasks ("summarize this document").
Pre-training creates the Base Model โ a powerful but raw artifact. Fine-tuning shapes it into a product (ChatGPT, Claude, Gemini).
๐ข Next-Token Prediction: The Self-Supervised Training Signal
The entire pre-training game is one question repeated billions of times:
"Given the text so far, what is the most likely next token?"
Input: "The capital of France is"
Target: "Paris"
This is self-supervised because the labels are already in the data โ you just mask the next word. No human annotation needed.
The Loss Function: Cross-Entropy
$$L = -\sum_{t} \log P(x_t \mid x_{
- $x_t$: the correct next token at step $t$
- $x_{<t}$: all preceding tokens
- $P(xt \mid x{<t})$: the model's probability for the correct token
The model is penalized for assigning low probability to the correct next word. Minimizing $L$ over trillions of tokens forces the model to learn grammar, facts, and reasoning patterns.
โ๏ธ The Data Pipeline: From the Web to a Training Run
flowchart LR
A[Common Crawl\nBooks3 / GitHub / arXiv] --> B[Deduplication]
B --> C[Quality Filtering]
C --> D[Tokenization]
D --> E[Packed Sequences\n2Kโ128K tokens]
E --> F[Training Shards\non Object Storage]
F --> G[GPU Cluster]
| Stage | What happens | Why it matters |
| Deduplication | Remove near-duplicate pages | Prevents memorization of repeated text |
| Quality filter | Remove boilerplate, low-quality HTML | Improves token efficiency |
| Tokenization | BPE / SentencePiece | Compresses text; handles rare words |
| Packing | Fill context windows to capacity | Maximizes GPU utilization |
Training data typically includes Common Crawl (web text), Books3, GitHub code, arXiv papers, and Wikipedia. The mix ratio shapes model strengths.
๐ง Inside the Training Loop: Loss, Gradients, and Checkpoints
The training loop looks like this in pseudocode:
for batch in training_data:
tokens = tokenize(batch)
logits = model(tokens[:-1]) # predict all positions
loss = cross_entropy(logits, tokens[1:]) # compare to true next token
loss.backward() # compute gradients
optimizer.step() # update weights
optimizer.zero_grad()
if step % checkpoint_interval == 0:
save_checkpoint(model)
In practice, training runs on thousands of GPUs or TPUs for weeks to months, using advanced parallelism strategies (data parallelism, tensor parallelism, pipeline parallelism).
๐งฉ What a Base Model Can and Cannot Do
| Can do | Cannot do |
| Complete text in context | Follow instructions reliably |
| Summarize if prompted cleverly | Refuse harmful requests |
| Write code that is syntactically plausible | Admit when it doesn't know |
| Translate languages | Have a consistent helpful persona |
A base model will happily continue any text you give it โ including harmful content. Fine-tuning with RLHF or SFT shapes it into a helpful, harmless assistant.
โ๏ธ Cost, Carbon, and the Scaling Trap
Training a frontier-scale LLM (GPT-4, LLaMA 3 70B) requires:
- Compute: thousands of H100 GPUs running for months
- Cost: $50Mโ$100M+ per frontier run
- Energy: significant carbon footprint
The key trade-offs:
- Data scale vs data quality: more tokens help, but noisy corpora have diminishing returns
- Larger model vs smaller but high-quality: a well-filtered 7B model can out-perform a poorly trained 70B
- Pre-training breadth vs fine-tuning depth: broad pre-training creates a flexible base; fine-tuning sharpens it for specific tasks
Few organizations can afford to pre-train from scratch. Most practitioners work with open base models (LLaMA, Mistral, Qwen) and apply LoRA fine-tuning.
๐ Key Takeaways
- Pre-training = self-supervised learning on massive text corpora using next-token prediction.
- The loss is cross-entropy; minimizing it forces the model to learn grammar, facts, and reasoning.
- The result is a Base Model โ capable but unaligned. Fine-tuning is required for product use.
- The data pipeline (dedup โ filter โ tokenize โ pack) is as important as the model architecture.
- Most practitioners never pre-train from scratch; they fine-tune existing open models.
๐งฉ Test Your Understanding
- Why is next-token prediction called "self-supervised"?
- What does the cross-entropy loss penalize the model for?
- Why is deduplication important before training?
- What is the difference between a base model and a fine-tuned assistant?
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. ๏ฟฝ...
