Home/Blog/Database/How It Works: The Internals of Database B+ Tree Indexes
DatabaseIntermediateβ€’11 min readβ€’

How It Works: The Internals of Database B+ Tree Indexes

Understand how B+ Trees minimize disk I/O, manage page splits, and accelerate database queries.

Abstract Algorithms

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: Database indexes utilize B+ Trees to minimize slow disk I/O operations. By organizing data into balanced, wide node pages and linking leaf nodes sequentially, databases resolve queries in logarithmic time and support fast range scans.


πŸ“– Design Challenge: The Cost of Full Table Scans

Imagine you are running a financial reporting application storing 100 million transaction records in a database table. The table is stored on a Solid State Drive (SSD). Every record is 100 bytes, making the total raw table size 10 gigabytes.

When a user runs a query to find a transaction by its unique IDβ€”SELECT * FROM transactions WHERE id = 9876543β€”and the database does not have an index, the storage engine must execute a full table scan.

It reads every single page of the 10 GB file from disk into memory to check if the ID matches. Even on a fast NVMe SSD with a read throughput of 1 GB/second, this lookup takes 10 full seconds of raw disk I/O, exhausting CPU cycles and locking database pages.

To resolve this, we need a data structure that allows the database to locate any specific row in under 5 milliseconds. The structure must be optimized for disk block storage, allowing the database to search millions of records by reading only a tiny fraction of the data from disk. This structure is the B+ Tree.


πŸ” Core Structure: The B+ Tree Node Hierarchy

A B+ Tree is a self-balancing search tree designed to store sorted data. It is a variation of the standard B-Tree, optimized specifically for filesystem blocks and database page layouts.

The tree is organized into three distinct node layers:

  1. Root Node: The single entry point at the top of the tree, containing pointers to intermediate child pages.
  2. Internal Nodes: Middle-tier routing pages. They do not store actual database rows; they only hold search keys and child page pointers to guide the search.
  3. Leaf Nodes: The bottom layer of the tree. These pages store the actual keys and associated data pointer references (or the raw rows themselves in a clustered index). Crucially, all leaf nodes are connected sequentially using a doubly linked list.

The diagram below visualizes this node hierarchy structure:

graph TD
    Root[Root Node Page] -->|Keys: 20, 50| Int1[Internal Page 1: Keys 5-15]
    Root -->|Pointer > 50| Int2[Internal Page 2: Keys 60-80]
    Int1 -->|Pointer < 20| Leaf1[Leaf Page 1: Data 1-19]
    Int1 -->|Pointer 20-50| Leaf2[Leaf Page 2: Data 20-49]
    Int2 -->|Pointer > 50| Leaf3[Leaf Page 3: Data 50-99]
    Leaf1 <--> Leaf2
    Leaf2 <--> Leaf3

This node diagram illustrates how B+ Trees separate index routing from storage. The root and internal nodes only contain keys and page pointers, acting as a directory. The leaf nodes at the bottom contain the actual keys and data, and they are linked sequentially to support fast, single-pass range scans.


βš™οΈ Core Mechanics: Lookups, Splits, and Leaf Pointers

The B+ Tree maintains balance and search efficiency through three core operations:

1. The Lookup Operation

To locate a key, the database starts at the root page. It runs a binary search on the keys stored in the root page to find the child pointer range that covers the target. It follows the pointer to the next internal node page and repeats this process until it reaches a leaf node page, where the actual data resides.

2. The Page Split (Write Paths)

When you insert a new row, the database navigates to the target leaf page. If the leaf page has free space, the key is inserted in sorted order. If the page is full (typically 16 KB in size), a page split occurs:

  • A new leaf page is allocated.
  • Half of the keys from the full page are moved to the new page.
  • The minimum key of the new page is copied up to the parent internal page.
  • If the parent internal page is also full, this split propagates up the tree, occasionally creating a new root page and increasing the height of the tree by 1.

3. Leaf Pointer Linking

Because the leaf nodes are connected via a doubly linked list, executing range queries (e.g., WHERE age BETWEEN 20 AND 30) is highly efficient. The database navigates to the first matching key (20) at the leaf layer, then traverses the linked list pointers directly across the leaf pages until it hits the end key (30), avoiding the need to traverse back up the tree.


πŸ“Š Architectural Blueprint: Query Traversal Flow

To visualize a database search, the sequence diagram below maps the interaction between the query engine, cache buffer pool, and disk blocks during a lookup:

sequenceDiagram
    participant QE as Query Engine
    participant BP as Buffer Pool (RAM)
    participant Disk as Disk Storage (SSD)

    QE->>BP: Request Root Page 0
    BP-->>QE: Page 0 (Cache Hit)
    Note over QE: Binary search Page 0. Select Child Page 12
    QE->>BP: Request Internal Page 12
    BP->>Disk: Fetch Page 12 (Cache Miss)
    Disk-->>BP: Return Page 12
    BP-->>QE: Page 12
    Note over QE: Binary search Page 12. Select Leaf Page 85
    QE->>BP: Request Leaf Page 85
    BP-->>QE: Page 85 (Cache Hit)
    Note over QE: Extract record matching Key 45

This sequence diagram illustrates a typical B+ Tree lookup traversal. The Query Engine requests pages sequentially from the Buffer Pool cache. If a page is absent from RAM (cache miss), the Buffer Pool fetches the 16 KB page block from SSD disk storage. The Query Engine performs a fast binary search on each page's keys in memory, repeating this traversal down the hierarchy until it extracts the target record from the destination leaf page.


🧠 Deep Dive: Disk I/O Blocks and Node Fan-out

Understanding why B+ Trees outperform other trees (like Red-Black trees or AVL trees) on disk requires analyzing how hardware reads data.

The Internals of Page Allocations and Row Offsets

Operating systems and storage controllers read and write data in fixed-size blocks (typically 4 KB). Databases group these blocks into pages (e.g., 16 KB pages in MySQL InnoDB). When the database reads a single byte from a row, the storage controller must transfer the entire 16 KB page block from disk to the RAM buffer pool.

A B+ Tree is designed specifically to match this block structure. Each node in the tree is mapped to a single 16 KB page. Because internal nodes do not store actual row data (only keys and page pointers), they have a massive fan-out (the number of child pointers per node):

  • Assuming a key is 8 bytes and a child pointer is 8 bytes, an internal node slot is 16 bytes.
  • A single 16 KB page can hold approximately $16384 / 16 \approx 1000$ child pointers.

This high fan-out keeps the tree very flat:

  • Height 1 (Root only): Holds 1,000 leaf pointers.
  • Height 2 (Root + 1 internal layer): Holds $1,000 \times 1,000 = 1,000,000$ leaf pointers.
  • Height 3 (Root + 2 internal layers): Holds $1,000 \times 1,000 \times 1,000 = 1,000,000,000$ (1 billion) leaf pointers.

A tree height of 3 is sufficient to index 1 billion rows. This means the database can locate any record among 1 billion rows by reading only 3 pages from disk, keeping disk seek latency extremely low.

Performance Analysis of Tree Depth and Index Scans

The time complexity of a B+ Tree lookup is $O(\log_B N)$, where $B$ is the fan-out (node capacity) and $N$ is the number of keys. Because $B$ is large (typically $\ge 1000$), the log base is large, resulting in a very flat search complexity compared to binary search trees ($O(\log_2 N)$).

The table below contrasts B+ Trees with other tree architectures under database workloads.

FeatureB+ TreeBinary Search Tree (AVL/Red-Black)Hash Index
Lookup Complexity$O(\log_B N)$$O(\log_2 N)$$O(1)$
Range Query SpeedExtremely Fast (via leaf linked list)Slow (requires tree traversals)Not Supported (hash values are unsorted)
Disk I/O CountLow (3-4 reads for 1B rows)High (20-30 reads for 1B rows)Low (1 read)
Memory FootprintLow (internal pages are small)High (every node contains pointers)High (requires large hash bucket array)

🌍 Real-World Implementation: MySQL InnoDB and PostgreSQL

Most modern relational databases implement B+ Trees as their primary indexing structure.

In MySQL's InnoDB storage engine, tables are organized as a clustered index. The leaf nodes of the primary key index do not just hold pointers to rowsβ€”they store the actual row data itself. If you query by primary key, the lookup returns the row immediately. Secondary indexes (e.g., an index on email) contain the secondary key and point to the primary key, requiring a second lookup (index traversal) to resolve the full row.

In PostgreSQL, indexes are non-clustered. The leaf nodes of the B+ Tree (called nbtree) hold pointers to the physical page locations (TIDs) of the rows stored in the main table heap.


βš–οΈ Trade-offs and Failure Modes: Write Amplification vs. Search Speed

While B+ Trees provide fast reads, they introduce trade-offs for write operations:

  • Write Amplification (Page Splits): When inserting rows in random order, pages fill up quickly and split. Writing a single 100-byte row can force the database to allocate, split, and write multiple 16 KB pages back to disk, causing high disk write amplification.
  • Index Overhead: Every index on a table must be updated during inserts, updates, and deletes. Having too many indexes significantly degrades write throughput.

🧭 Decision Guide: Clustered vs. Non-Clustered Indexes

Use this decision table to guide index design in relational databases.

SituationRecommendationAlternative
Frequently queried range scans or primary key lookupsClustered Index (Primary Key)Keep primary key data types small to minimize secondary index overhead.
Lookups on non-unique columns (e.g., name, status)Non-Clustered Secondary IndexCover frequently queried fields using composite indexes.
High write throughput with low read volumeAppend-Only Log / LSM TreeAvoids page split overhead (common in timeseries DBs).

πŸ§ͺ Practical Implementation: MySQL InnoDB Page Sizing

In MySQL InnoDB, the default page size is 16 KB, which is optimal for most OLTP workloads. However, for systems with massive row widths or specialized high-throughput storage, page sizes can be adjusted to optimize alignment with hardware blocks.

-- View the current InnoDB page size configuration in MySQL
SHOW VARIABLES LIKE 'innodb_page_size';

If the variable returns 16384, it indicates a 16 KB page size. To change this, you must configure the database during bootstrap:

# MySQL Configuration File (my.cnf)
[mysqld]
innodb_page_size = 32K

Setting innodb_page_size = 32K increases the node fan-out, which can flatten the tree further for massive datasets, but it increases the RAM requirements for the buffer pool.


πŸ“š Lessons Learned: Indexing Antipatterns in Production

Avoid these indexing mistakes when scaling database architectures:

  1. UUIDs as Primary Keys: Using random UUIDs as primary keys in a clustered index causes inserts to target random leaf pages. This triggers constant page splits and database fragmentation. Use sequential UUIDs (like UUIDv7) instead.
  2. Indexing Low-Cardinality Columns: Creating an index on columns with low cardinality (e.g., gender, is_active) is useless. The database optimizer will ignore the index and run a full table scan because index traversal overhead exceeds the cost of a direct scan.
  3. Unused Indexes: Every index consumes disk space and slows down write operations. Audit index usage regularly and drop unused indexes.

πŸ“Œ Summary: The B+ Tree Indexing Rules

  • Wide Fan-out: B+ Tree nodes match disk blocks, keeping the tree flat (height 3-4) to minimize disk reads.
  • Leaf Linked List: Doubly linked leaf nodes allow fast range scans without traversing parent nodes.
  • Separated Storage: Internal nodes hold only keys and pointers, maximizing node capacity.
  • Write Overhead: Page splits create disk write amplification, meaning index counts must be managed carefully.
  • Sequential Keys: Always use sequential primary keys to prevent page fragmentation.

AI-generated article quiz

Test your understanding

🧠

Ready to test what you just learned?

Generate four focused questions from this article. Answers include immediate explanations.

Guided series path

How It Works: Internals Explained

View all lessons β†’
Lesson 25 of 34

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Sign in to save your rating.