Why Embeddings Matter: Solving Key Issues in Data Representation
How do computers understand that 'King' - 'Man' + 'Woman' = 'Queen'? Embeddings convert words int...
Abstract AlgorithmsTLDR: Embeddings convert words (and images, users, products) into dense numerical vectors in a geometric space where semantic similarity = geometric proximity. "King - Man + Woman ≈ Queen" is not magic — it is the arithmetic property of well-trained embeddings.
📖 The One-Hot Problem: Numbers That Know Nothing
Before embeddings, machines represented words as one-hot vectors:
Vocabulary: [cat, dog, fish, car, truck]
"cat" = [1, 0, 0, 0, 0]
"dog" = [0, 1, 0, 0, 0]
"fish" = [0, 0, 1, 0, 0]
Problems:
- Sparse: 50,000-word vocab = 50,000-dimensional vectors that are 99.998% zeros.
- No similarity: The machine sees
catanddogas equally distant ascatandcar. Nothing in their representation captures that cats and dogs are both pets.
Embeddings solve both problems.
🔢 Dense Vectors: Coordinates in Meaning Space
An embedding represents a word as a dense low-dimensional vector (e.g., 300 dimensions):
"cat" → [0.8, -0.3, 0.6, 0.1, ...] (300 values, all non-zero)
"dog" → [0.7, -0.2, 0.5, 0.2, ...] (similar to cat)
"car" → [-0.1, 0.9, -0.4, 0.8, ...] (different region of space)
Cosine similarity between cat and dog: ~0.92 (very close).
Cosine similarity between cat and car: ~0.1 (far apart).
The model has learned that cats and dogs live in the same region of the 300D semantic space.
⚙️ The Learning Principle: "You Shall Know a Word by the Company It Keeps"
This is Firth's distributional hypothesis (1957), and it is the foundation of word2vec, GloVe, and modern LLM embeddings.
Training signal: Predict surrounding words.
Context window: "... feeds the ___ every morning ..."
Target word: "cat"
Words that appear in similar contexts get similar representations. Cat and dog both appear near "pet," "feed," "vet," "collar" → their vectors converge in the training process.
Word2Vec (skip-gram) objective: Given a word $w$, maximize the probability of observing its context words $c_i$:
$$\max \sum_{(w, c) \in \text{corpus}} \log P(c | w)$$
🧠 Vector Arithmetic: Why "King - Man + Woman ≈ Queen"
Because semantically coherent relationships are encoded as directions in embedding space:
direction("Man" → "King") ≈ direction("Woman" → "Queen")
= vector("King") - vector("Man") ≈ vector("Queen") - vector("Woman")
Rearranging: $$\text{vector("Queen")} \approx \text{vector("King")} - \text{vector("Man")} + \text{vector("Woman")}$$
The "royalty" concept is a direction. The "gender" flip is another direction. These directions are geometrically consistent across thousands of analogies because they learned from the same statistical patterns.
flowchart LR
King["King"] -->|subtract| ManAxis["— Man direction"]
ManAxis -->|add| WomanAxis["+ Woman direction"]
WomanAxis --> Queen["≈ Queen\n(nearest neighbor in embedding space)"]
⚙️ Embeddings in Production: Not Just Words
Modern embeddings go far beyond words:
| Input Type | Embedding Model | Application |
| Text | BERT, Sentence-BERT, OpenAI text-embedding-3 | Semantic search, RAG, classification |
| Images | CLIP, ViT | Image search, visual Q&A, multimodal retrieval |
| Users | Collaborative filtering embeddings | Recommendation systems (Netflix, Spotify) |
| Products | Catalog embeddings | "Customers who bought X also bought Y" |
| Code | OpenAI Codex embeddings | Semantic code search |
Vector databases (Pinecone, Weaviate, Milvus, pgvector) store billions of embedding vectors and support approximate nearest neighbor (ANN) search — the "find the most semantically similar documents" query at scale.
⚖️ One-Hot vs. Dense Embeddings: The Full Picture
| Property | One-Hot | Dense Embedding |
| Dimensionality | Equal to vocabulary size (50K+) | Fixed low dimension (128–1536) |
| Sparsity | ~100% sparse | Dense (all values non-zero) |
| Semantic similarity | Not captured | Captured via geometric distance |
| Computation | High (huge sparse vectors) | Efficient (small dense vectors) |
| Supports analogies (King-Man+Woman) | ❌ | ✅ |
| Requires training | ❌ (constructed) | ✅ (learned from data) |
📌 Summary
- One-hot encoding is sparse and captures no semantic similarity.
- Embeddings are dense vectors learned from co-occurrence statistics; semantically similar items are geometrically close.
- Firth's hypothesis: "A word is known by the company it keeps" — context predicts embeddings.
- Vector arithmetic (King - Man + Woman ≈ Queen) works because semantic relationships are consistent directions in the embedding space.
- Vector databases (Pinecone, pgvector) serve nearest-neighbor queries over billions of embeddings for RAG, recommendation, and semantic search.
📝 Practice Quiz
Two words have a cosine similarity of 0.95 in embedding space. What does this indicate?
- A) Their one-hot vectors overlap in 95% of positions.
- B) The words appear in very similar contexts and are semantically close (e.g., "cat" and "kitten").
- C) One word contains 95% of the letters of the other.
Answer: B
Why does "King - Man + Woman ≈ Queen" work with word embeddings?
- A) It's a hard-coded rule in the embedding model.
- B) The model learned consistent semantic directions from co-occurrence data — "royalty" and "gender" are separate geometric directions in the embedding space.
- C) It only works for those four words specifically.
Answer: B
You need to find the 10 most semantically similar documents to a query in a corpus of 100 million documents. Which tool is designed for this?
- A) A relational database with
LIKE '%query%'search. - B) A vector database (e.g., Pinecone, Weaviate, pgvector) with approximate nearest-neighbor (ANN) search over embedding vectors.
- C) An inverted index with TF-IDF ranking.
Answer: B
- A) A relational database with

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. �...
