All Posts

Understanding Inverted Index and Its Benefits in Software Development

Why is Google so fast? It doesn't scan every webpage. It uses an Inverted Index. We explain how t...

Abstract AlgorithmsAbstract Algorithms
ยทยท5 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: An Inverted Index maps every word to the list of documents containing it โ€” the same structure as the back-of-the-book index. It is the core data structure behind every full-text search engine, including Elasticsearch, Lucene, and PostgreSQL full-text search.


๐Ÿ“– The Textbook Index You Already Know

Open any textbook. Flip to the index at the back. Under "Binary Tree" you see: pages 42, 78, 203. That listing tells you exactly which pages to go to without reading every page.

A database full-text search without an inverted index would mean reading every document to find which ones contain "Binary Tree." For a billion documents, that's O(N).

An inverted index turns that into O(1) lookup followed by O(k log k) sort, where k = number of matching documents.


๐Ÿ”ข How an Inverted Index Is Built

Given three documents:

Doc 1: "Apple is fruit"
Doc 2: "Banana is fruit"
Doc 3: "Apple and Banana are both fruit"

Step 1 โ€” Tokenize: Split into tokens. Apply stemming ("fruits" โ†’ "fruit") and stop word removal ("is", "are", "and" removed).

Step 2 โ€” Build posting lists:

WordPosting List
apple[1, 3]
banana[2, 3]
fruit[1, 2, 3]

Each posting list entry also stores the position of the word in the document (for phrase queries) and the frequency (for TF-IDF ranking).

Step 3 โ€” Sort posting lists by document ID. Sorted order is critical for fast intersection.


โš™๏ธ Executing a Query: AND, OR, and Phrase

AND Query: "Apple AND Banana"

Intersect the two posting lists:

apple:  [1, 3]
banana: [2, 3]
intersection: [3]

Since both lists are sorted, a simple two-pointer merge achieves O(m + n) intersection โ€” no nested loops.

i โ†’ 1, j โ†’ 2:  1 < 2, advance i
i โ†’ 3, j โ†’ 2:  3 > 2, advance j
i โ†’ 3, j โ†’ 3:  MATCH โ†’ output Doc 3

OR Query: "Apple OR Banana"

Union: [1, 2, 3] โ€” all documents containing either word.

Phrase Query: "Apple Banana" (adjacent)

Use position data: find documents in both lists where apple's position + 1 = banana's position.


๐Ÿง  Skip Pointers: Speeding Up Large Posting List Intersection

When posting lists are large (millions of documents), pointer-by-pointer traversal is slow.

Skip pointers index every $\sqrt{n}$ entries in the list, allowing jumps:

apple: [1, 3, 7, 11, 15, 22, ...] with skip pointer at 7 โ†’ jump over 3,7 if needed

If apple[i]=3 and banana[j]=22, the skip pointer can jump directly to the first entry โ‰ฅ 22 in the apple list without touching 7, 11, 15.


โš™๏ธ TF-IDF: Why "the" Ranks Lower Than "quantum"

A search for "quantum computing" should rank documents that specifically discuss quantum computing higher than documents that just happen to contain the word "quantum" once.

TF-IDF (Term Frequency โ€“ Inverse Document Frequency):

$$\text{score}(d, t) = TF(d, t) \times IDF(t)$$

$$TF(d, t) = \frac{\text{count of } t \text{ in doc } d}{\text{total tokens in doc } d}$$

$$IDF(t) = \log \frac{N}{\text{docs containing } t}$$

Intuition:

  • TF: High if the term appears often in this document.
  • IDF: High if the term appears in few documents overall (rare = discriminative).

"The" appears in every document โ†’ IDF โ‰ˆ 0 โ†’ score โ‰ˆ 0. "Quantum" appears in 100/1M documents โ†’ high IDF โ†’ high score in docs with high TF.


โš–๏ธ Inverted Index in Elasticsearch / Lucene

Elasticsearch (built on Lucene) stores inverted indexes per shard:

flowchart LR
    Doc["New Document\n(JSON)"] --> Analyze["Text Analyzer\n(tokenize, stem, stop-words)"]
    Analyze --> Index["Inverted Index\n(term โ†’ [docId, position, freq])"]
    Index --> Segment["Immutable Segment\nflushed to disk"]
    Segment --> Merge["Merge Segments\n(background, periodic)"]

Segments are immutable โ€” new documents go into a new segment, not into existing ones. Periodic merging consolidates segments for query efficiency.


๐Ÿ“Œ Summary

  • An Inverted Index maps term โ†’ sorted list of document IDs (posting list).
  • AND queries use two-pointer sorted intersection (O(m+n)).
  • Skip pointers enable O(โˆšn) jumps in large posting lists.
  • TF-IDF scores documents by term frequency within the document ร— rarity across all documents.
  • Elasticsearch stores indexes in immutable segments โ€” new writes go to new segments; background merge consolidates them.

๐Ÿ“ Practice Quiz

  1. You search for "Apple AND Banana." Apple's posting list is [1, 3] and Banana's is [2, 3]. What is the result?

    • A) [1, 2, 3] โ€” all documents mentioning either word.
    • B) [3] โ€” only the document containing both words.
    • C) [1] โ€” the first document in the Apple list.
      Answer: B
  2. Why does "the" score near zero in TF-IDF, even if it appears 100 times in a document?

    • A) Stop word removal always filters "the" before indexing.
    • B) IDF is near zero because "the" appears in nearly every document, making it non-discriminative โ€” even high TF can't rescue it.
    • C) TF penalizes words that appear too frequently.
      Answer: B
  3. Why does Elasticsearch write new documents to new immutable segments instead of updating the existing segment?

    • A) To avoid disk fragmentation.
    • B) Immutability means no locking during reads โ€” existing segments can be searched concurrently while new segments are written.
    • C) To support time-series indexing.
      Answer: B

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms