ANN Index Types Explained: When to Choose Flat, HNSW, IVF, or IVF-PQ
A production-first guide to picking the right vector index for recall, latency, memory, and update patterns.
Executive TLDR
- TLDR: If your dataset is small and correctness is critical, use Flat.
- If you need high recall with low latency and enough RAM, use HNSW.
- If your corpus is huge and memory is your bottleneck, use IVF PQ.
- Start with HNSW for most RAG systems, then move to IVF or IVF PQ when scale and cost force compression.
Core mental model
Read this as a system of state, constraints, and failure boundaries.
A production-first guide to picking the right vector index for recall, latency, memory, and update patterns.
Key systems visualization
The articleβs conceptual path
01
Ann
02
Vector Database
03
Rag
04
Llm
05
System Design
TLDR: If your dataset is small and correctness is critical, use Flat. If you need high recall with low latency and enough RAM, use HNSW. If your corpus is huge and memory is your bottleneck, use IVF-PQ. Start with HNSW for most RAG systems, then move to IVF or IVF-PQ when scale and cost force compression.
β TLDR Summary: The Fastest Path to the Right ANN Choice
Most teams do not fail at embeddings. They fail at index selection. The same embedding model can look amazing or terrible depending on whether your index matches your workload.
Here is the practical recommendation order:
- Use Flat for baseline truth and small corpora.
- Use HNSW for most online RAG systems where recall matters.
- Use IVF when corpus size grows and brute-force is too slow.
- Use IVF-PQ when memory and cost dominate, and you can tolerate some recall loss.
One sentence rule of thumb:
- Need exactness: Flat.
- Need speed plus quality: HNSW.
- Need scale efficiency: IVF.
- Need extreme memory savings: IVF-PQ.
π§ Decision Matrix: Situation to Index Mapping
| Situation | Recommended Index | Why |
| Up to a few million vectors, strong quality requirements | Flat (exact) | Maximum recall, simplest behavior, perfect for evaluation baselines |
| Real-time retrieval with strict p95 latency and high relevance | HNSW | Excellent latency-recall tradeoff for online serving |
| Tens to hundreds of millions of vectors, moderate memory budget | IVF | Coarse partitioning reduces search cost significantly |
| Very large corpus with tight RAM and cloud-cost pressure | IVF-PQ | Vector compression reduces memory footprint drastically |
| Heavy write/update workload with frequent re-indexing | HNSW or dynamic IVF (carefully tuned) | Better practical update behavior than heavily compressed pipelines |
| Offline reranking pipeline where latency is relaxed | IVF-PQ + reranker | Compression for candidate generation, model reranker restores precision |
If you are unsure, start with HNSW and benchmark against Flat on a labeled evaluation set. That gives you a realistic recall and latency floor before you add compression complexity.
π Why ANN Index Choice Is a System Design Decision, Not Just a Search Setting
Approximate nearest neighbor search looks like a low-level retrieval detail, but it is actually a core system design choice for any RAG, recommendation, or semantic search pipeline.
An ANN index controls four production outcomes:
- User-perceived latency.
- Recall of relevant documents.
- Infrastructure memory cost.
- Rebuild and operational complexity.
When teams choose the wrong index, they usually see one of two symptoms:
- Retrieval is fast but low-quality, so answer quality drops and hallucinations increase.
- Retrieval is high-quality but too expensive or too slow, so SLOs and cost targets fail.
This is why index selection is not a one-time tuning knob. It is a capacity and architecture decision, similar to choosing between in-memory cache, disk-based store, or distributed queue design.
For foundational context on vectors and similarity, see Embeddings Explained and Vector Databases Explained.
π Core Concept Explanations: Flat, HNSW, IVF, and PQ in Plain Engineering Terms
Flat Index
Flat means you compare a query vector against every vector in the dataset. It is exact nearest-neighbor search, not approximate.
- Pros: best possible recall, simplest correctness model.
- Cons: search cost grows linearly with dataset size.
Flat is the truth benchmark. Even if you never deploy it in production, you should use it to measure the quality loss of ANN alternatives.
HNSW (Hierarchical Navigable Small World)
HNSW builds a graph where each vector connects to selected neighbors. Queries traverse the graph from upper sparse layers down to dense local neighborhoods.
- Pros: strong recall-latency tradeoff, widely used for online semantic retrieval.
- Cons: high memory overhead, tuning complexity (
M,efConstruction,efSearch).
HNSW is often the default winner for medium-to-large real-time workloads where recall matters.
IVF (Inverted File Index)
IVF clusters vectors into coarse cells (lists). At query time, you only probe top nearest clusters (nprobe) instead of scanning everything.
- Pros: much faster than Flat at scale, predictable tuning axis (
nlist,nprobe). - Cons: cluster miss can hurt recall, especially for long-tail queries.
IVF is useful when brute-force is no longer practical and memory remains manageable.
PQ (Product Quantization)
PQ compresses vectors into compact codes by splitting dimensions into subspaces and storing centroid IDs instead of full precision values. While this resembles model weight compression (which you can read about in LLM Model Quantization: Why, When, and How), vector product quantization specifically quantizes the dataset representations rather than the neural network layers themselves.
- Pros: major memory reduction, faster scan over compressed codes.
- Cons: approximation error lowers similarity precision.
PQ is usually combined with IVF as IVF-PQ. It is less about raw speed and more about cost-efficient large-scale retrieval.
βοΈ Core Mechanics: How ANN Candidate Generation Actually Runs
ANN indexes all implement the same broad retrieval loop with different internal data structures.
- Convert the user query into an embedding vector.
- Use the index structure to find a small candidate set.
- Compute approximate similarity over those candidates.
- Return top-k nearest results.
The difference is where approximation happens:
- Flat: no approximation, full scan.
- HNSW: graph traversal approximation.
- IVF: cluster pre-filter approximation.
- IVF-PQ: cluster pre-filter plus compressed-distance approximation.
flowchart TD
Q[Query Text] --> E[Query Embedding]
E --> D{Index Family}
D --> FL[Flat Full Scan]
D --> HG[HNSW Graph Walk]
D --> IV[IVF Coarse Cluster Probe]
IV --> PQ[PQ Code Distance]
FL --> TOP[Top-K Candidates]
HG --> TOP
PQ --> TOP
TOP --> RR[Rerank Optional]
How to read this diagram:
- Every path starts from the same embedding vector and ends with top-k candidates.
- ANN speedups come from reducing the number of exact distance computations.
- Extra approximation stages improve throughput but can reduce recall if not tuned.
βοΈ Vector Index Comparison: Latency, Recall, Memory, and Operations
| Dimension | Flat | HNSW | IVF | IVF-PQ |
| Recall ceiling | Highest | Very high | Medium to high | Medium |
| Query latency at scale | Slowest | Fast | Fast | Fast to very fast |
| Memory footprint | Highest | High | Medium | Lowest |
| Build complexity | Low | Medium | Medium | High |
| Update friendliness | High | Medium to high | Medium | Lower |
| Best fit | Baseline or small corpus | Online RAG default | Large corpus with balanced constraints | Massive corpus and tight budget |
Quick takeaway:
- HNSW wins many production workloads where quality matters and RAM is acceptable.
- IVF-PQ wins when cost per million vectors matters more than perfect recall.
π Real-World Scenarios: Which Index Wins Under Different Constraints
Scenario 1: Internal Docs RAG for Engineering Support
- Corpus: 3 million chunks.
- Target: p95 retrieval < 80 ms.
- Quality requirement: high recall for rare incident terms.
Best first choice: HNSW.
Why: You need high recall on tail queries and interactive latency. HNSW usually beats IVF at equivalent recall in this range, assuming enough memory.
Scenario 2: Consumer Recommendation with 200 Million Items
- Corpus: 200 million embeddings.
- Cost pressure: strict RAM budget.
- Latency target: p95 < 120 ms.
Best first choice: IVF-PQ.
Why: Flat and HNSW memory overhead can be prohibitive. IVF-PQ provides large compression and acceptable candidate quality when paired with reranking.
Scenario 3: Legal Search with Explainability and High Precision
- Corpus: 800k vectors.
- Requirement: minimal false negatives.
- Latency is important, but correctness dominates.
Best first choice: Flat for baseline plus HNSW for production candidate generation.
Why: Use Flat to validate top-k quality, then move to HNSW if latency needs improvement. Maintain periodic Flat audits to detect drift.
Scenario 4: Near-Real-Time News Index with Frequent Updates
- Corpus changes every minute.
- Need low-latency updates and consistent serving.
Best first choice: HNSW or a hybrid two-tier architecture.
Why: Heavily compressed pipelines can complicate rapid updates. A hot HNSW tier plus periodic cold compaction works better operationally. For details on how to design and coordinate this tier-routing pattern inside robust agent runtimes, check out our companion post on System Design for Agentic AI Systems.
π Architecture and Workflow: How ANN Index Selection Fits into a Retrieval Stack
π§ Deep Dive: Tuning Knobs That Actually Move Outcomes
You can spend weeks tuning random parameters. Focus on the small set that directly moves latency-recall behavior.
The Internals
HNSW Parameters
M: number of bi-directional edges per node.- Higher
Mimproves recall but increases memory and build time.
- Higher
efConstruction: search breadth during graph build.- Higher values improve graph quality and recall, but slow indexing.
efSearch: search breadth at query time.- Higher values improve recall but increase latency.
Operational pattern: keep M and efConstruction stable after benchmarking, then tune efSearch dynamically by query class.
IVF Parameters
nlist: number of coarse clusters.- Too low: large lists, poor filtering.
- Too high: sparse lists, training instability.
nprobe: number of clusters scanned at query time.- Higher
nprobeimproves recall, increases latency.
- Higher
Operational pattern: choose nlist from dataset scale, then tune nprobe per SLO tier.
PQ Parameters
m: number of subquantizers.nbits: codebook bits per subvector.
More bits generally improve recall but reduce compression. In cost-sensitive systems, this is where you buy back quality with memory.
Metric Alignment Gotcha
If your embedding model was trained with cosine similarity, but you use a raw L2 setup without normalization, quality can collapse even with a perfect index configuration.
Always validate metric/index alignment:
- cosine-trained embeddings -> normalize vectors and use inner product or cosine-aware setup.
- euclidean-trained embeddings -> use L2.
Performance Analysis
For production choices, compare indexes on the same dataset, same embedding model, and same top-k target. Do not compare numbers from separate blog posts or benchmark papers with different hardware.
A practical evaluation grid:
| Metric | Why it matters | Typical guardrail |
| Recall@k versus Flat | Measures retrieval quality loss | Keep within acceptable drop per product objective |
| p50/p95 query latency | Captures interactive experience | Match user-facing SLO budget |
| Memory per million vectors | Determines cloud cost and node fit | Stay inside planned RAM envelope |
| Build and rebuild time | Affects freshness and operations | Meet index refresh windows |
| Update throughput | Important for dynamic corpora | Sustain ingestion targets |
Use this process:
- Establish Flat baseline recall and latency.
- Evaluate HNSW sweep across
efSearchvalues. - Evaluate IVF across
nlistandnprobe. - Evaluate IVF-PQ across
m,nbits, and reranking depth. - Select the lowest-cost setup that still meets recall and latency objectives.
π§ͺ Practical Example 1: Building and Querying HNSW with hnswlib
This example demonstrates a realistic online-serving setup: build an HNSW index, tune efSearch, and query top-k neighbors. Watch how efSearch changes recall-latency behavior.
import numpy as np
import hnswlib
import time
# Synthetic dataset: 1M vectors in production is common, kept small here for demo speed.
num_elements = 100_000
dim = 384
rng = np.random.default_rng(42)
data = rng.normal(size=(num_elements, dim)).astype(np.float32)
queries = rng.normal(size=(100, dim)).astype(np.float32)
# Build HNSW index
index = hnswlib.Index(space='cosine', dim=dim)
index.init_index(max_elements=num_elements, ef_construction=200, M=32)
index.add_items(data, np.arange(num_elements))
index.set_ef(80) # query-time breadth; tune this for recall/latency
# Query benchmark
k = 10
start = time.perf_counter()
labels, distances = index.knn_query(queries, k=k)
elapsed_ms = (time.perf_counter() - start) * 1000
print(f'Queried {len(queries)} queries, top-{k}, total_time_ms={elapsed_ms:.2f}')
print('First query nearest IDs:', labels[0][:5])
What to observe:
- Increasing
efSearchusually raises recall and latency together. - Memory consumption is driven heavily by
Mand corpus size. - Use this pattern in a benchmark harness with real query logs, not synthetic vectors only.
π§ͺ Practical Example 2: IVF-PQ with FAISS for Memory-Efficient Scale
This example shows a common large-scale approach: train IVF-PQ, search with adjustable nprobe, and use compression to control memory growth. The right way to deploy this is with offline evaluation against a Flat or high-quality HNSW baseline.
import numpy as np
import faiss
# Data
num = 300_000
dim = 384
rng = np.random.default_rng(7)
xb = rng.normal(size=(num, dim)).astype('float32')
xq = rng.normal(size=(500, dim)).astype('float32')
# Normalize for cosine-like search via inner product
faiss.normalize_L2(xb)
faiss.normalize_L2(xq)
# IVF-PQ settings
nlist = 4096 # coarse clusters
m = 48 # subquantizers
nbits = 8 # bits per subvector
quantizer = faiss.IndexFlatIP(dim)
index = faiss.IndexIVFPQ(quantizer, dim, nlist, m, nbits, faiss.METRIC_INNER_PRODUCT)
# Train + add
train_samples = xb[:120_000]
index.train(train_samples)
index.add(xb)
# Search control
index.nprobe = 24
D, I = index.search(xq, 10)
print('Index ntotal:', index.ntotal)
print('Top IDs for first query:', I[0][:5])
What to observe:
nprobeis the main query-time recall-latency dial for IVF.- PQ compression reduces memory significantly but may hurt fine-grained ranking.
- Pair with reranking for quality-sensitive workloads.
π οΈ FAISS and Milvus: How Open-Source Systems Implement ANN at Scale
FAISS is a high-performance similarity search library from Meta. It provides exact and ANN index families including Flat, HNSW, IVF, and IVF-PQ, with GPU support and battle-tested primitives for large vector collections.
Milvus is an open-source vector database that builds on ANN concepts with distributed storage, filtering, metadata operations, and production deployment workflows.
How they solve the problem in practice:
- FAISS gives low-level algorithm control and excellent benchmarking fidelity.
- Milvus adds operational layers: sharding, segment lifecycle, persistence, and service interfaces.
A minimal Milvus-flavored configuration idea is:
- Online tier for HNSW-based hot documents.
- Cold tier using IVF-PQ for long-tail archives.
- Query router selecting tier by recency, tenant, or latency budget.
This split architecture frequently outperforms forcing one index type to satisfy all workloads. For a deep-dive into LLM observability, tracking, and telemetry inside production vector database pipelines, see LLM Observability: Tracing, Logging, and Debugging.
For a full deep-dive on FAISS internals, Milvus segment lifecycle, and hybrid ANN serving patterns, a dedicated follow-up post is planned.
π Lessons Learned: Operationalizing ANN Vector Search in Production
- Index choice is workload design, not library preference.
- Always benchmark against Flat before accepting ANN tradeoffs.
- HNSW is an excellent default, but memory cost must be monitored.
- IVF-PQ is often the economic winner at very large scale, especially with reranking.
- Metric mismatch can silently destroy relevance even when latency looks good.
- Query-class-aware tuning (for example dynamic
efSearchornprobe) often beats one static configuration. - Treat ANN as candidate generation and let rerankers restore precision for critical tasks.
π Key Takeaways: Selecting the Right ANN Index for Vector Search
- Flat: exact, simple, expensive at scale.
- HNSW: best default for online RAG when recall matters.
- IVF: scalable partitioned search with moderate complexity.
- IVF-PQ: compressed large-scale retrieval for memory-constrained systems.
- Start with HNSW, validate with Flat, move to IVF-PQ only when scale or cost requires it.
Quiet AI help
Article metadata

Written by
Abstract Algorithms
@abstractalgorithms
Reader feedback
Was this article useful?
Rate it if it helped, then continue with the next deep dive when you are ready.
Related deep dives
Continue reading



