What are Hash Tables? Basics Explained
Hash Tables are the magic behind O(1) lookups. We explain Hash Functions, Collisions, and why the...
Abstract AlgorithmsTLDR: A hash table gives you near-O(1) lookups, inserts, and deletes by using a hash function to map keys to array indices. The tradeoff: collisions (when two keys hash to the same slot) must be handled, and a full hash table must be resized.
π The Magic Mailroom Stamp
Imagine a mailroom with 100 drawers. A magic stamp converts any employee name to a number 1β100. Want John's mail? Stamp "John" β you get 42 β open Drawer 42. No search needed.
That stamp is a hash function. The row of drawers is the array (the backing store). The combination is a hash table.
π’ Building Blocks: Array + Hash Function
flowchart LR
Key["Key: 'cat'"] --> HashFn["Hash Function\nh('cat') = 2"]
HashFn --> Array["Array\n[0][1][2: 'catβmeow'][3]..."]
Array --> Value["Value: 'meow'"]
A hash function must be:
- Deterministic: Same input always produces the same output.
- Fast: O(1) computation.
- Uniform: Keys distribute evenly to minimize collisions.
Simple hash function (NOT production-quality):
def simple_hash(key: str, size: int) -> int:
return sum(ord(c) for c in key) % size
Python's built-in hash() uses SipHash (cryptographically secure version) to prevent hash flooding attacks.
βοΈ Collisions: When Two Keys Land in the Same Slot
No hash function is perfect. Given a finite array, two different keys will eventually map to the same slot.
Example:
hash('cat') = 2β Stored at index 2.hash('bird') = 2β Collision at index 2!
Two collision resolution strategies:
Separate Chaining
Each slot holds a linked list of key-value pairs. On collision, append to the list.
index 2: [('cat', 'meow') β ('bird', 'tweet')]
O(1) average, O(n) worst case if all keys hash to the same slot.
Open Addressing (Linear Probing)
No linked lists. On collision, step forward until you find an empty slot.
hash('cat') = 2 β store at 2
hash('bird') = 2 (full) β probe 3 β store at 3
Pros: Better cache performance (all data in one array). Cons: Clustering β many consecutive full slots slow probing.
| Strategy | Space | Lookup (avg) | Cache Friendly | Used In |
| Separate Chaining | Extra memory for lists | O(1) | No | Java HashMap (until JDK 8) |
| Linear Probing | Compact array | O(1) | Yes | Python dict, Java HashMap (JDK 8+) |
| Quadratic Probing | Compact array | O(1) | Partial | Open-addressing tables |
| Robin Hood Hashing | Compact array | O(1) | Yes | Rust HashMap |
π§ Load Factor and Rehashing
Load factor = (number of entries) / (array capacity).
As the table fills up (load factor β 1.0), collisions increase and performance degrades.
Standard practice: rehash when load factor exceeds 0.75.
Rehashing:
- Create a new array 2Γ the size.
- Re-insert every existing key into the new array (re-hash all keys).
- Old array is garbage collected.
Cost: O(N) for one rehash. But since it happens at doubly-spaced intervals, the amortized cost per insert is O(1).
βοΈ Why Python Dicts Are Fast
Python dict uses open addressing with compact arrays (since Python 3.6). Entries are stored in an incrementally-growing compact array indexed by insertion order, with a separate hash table of indices pointing into it. This gives:
- Fast iteration (compact memory, no pointer chasing).
- Memory-efficient (no linked list overhead).
- Dict guaranteed ordered by insertion order.
βοΈ Hash Tables vs. Other Data Structures
| Operation | Hash Table | BST (e.g., TreeMap) | Array (sorted) |
| Lookup by key | O(1) avg | O(log n) | O(log n) binary search |
| Insert | O(1) avg | O(log n) | O(n) shift |
| Delete | O(1) avg | O(log n) | O(n) shift |
| Range query | β Not supported | β O(log n + k) | β O(log n + k) |
| Ordered iteration | β No order | β In-order | β Sorted |
Choose hash table when: You need fast key-value lookup and don't need ordering or range queries. Choose BST when: You need sorted iteration or range queries (e.g., "find all keys between 10 and 50").
π Summary
- Hash function maps key β array index in O(1). Must be deterministic and uniform.
- Collisions are inevitable; Separate Chaining uses linked lists; Linear Probing uses the array itself.
- Rehash at load factor 0.75 to keep average O(1); amortized cost is O(1) per insert.
- Hash tables have no order β use a BST when you need sorted iteration or range queries.
π Practice Quiz
What is the primary purpose of a hash function in a hash table?
- A) To encrypt the key for security.
- B) To convert a key to an integer array index in O(1) time, enabling direct-access storage.
- C) To sort the keys in the table.
Answer: B
A hash table has 100 slots and 90 entries. What is the load factor, and what should happen next?
- A) Load factor = 0.9; rehash now (exceeds 0.75 threshold) β create a 200-slot table and re-insert all entries.
- B) Load factor = 90; nothing β the table still has 10 empty slots.
- C) Load factor = 0.9; delete the 15 least-recently-used entries to stay under capacity.
Answer: A
When would you choose a Binary Search Tree over a Hash Table?
- A) When you need O(1) lookup.
- B) When you need sorted iteration or range queries (e.g., "all keys between 50 and 100").
- C) When your dataset is too small for a hash table.
Answer: B

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. οΏ½...
