All Posts

System Design HLD Example: Web Crawler

A practical interview-ready HLD for a distributed web crawler handling politeness, deduplication, and scale.

Abstract AlgorithmsAbstract Algorithms
Β·Β·14 min read
Cover Image for System Design HLD Example: Web Crawler
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: A distributed web crawler must balance global throughput with per-domain politeness. The architectural crux is the URL Frontier, which manages priority and rate-limiting across a distributed fetcher pool. By combining Bloom Filters for URL deduplication and SimHash for content near-duplicate detection, the system can crawl billions of pages while maintaining a minimal storage footprint and respecting the stability of the target websites.

πŸ•·οΈ The Infinite Web Problem

Imagine you’ve built a simple crawler. It works great on your personal blog. You point it at a small news site, and it finishes in minutes. Then, you point it at a major e-commerce site with millions of products.

Within seconds, your crawler starts hitting the "Calendar Trap." The site has a calendar page for every product, and your crawler is clicking "Next Month" infinitely. Meanwhile, the site's webmaster is seeing 5,000 requests per second from your IP address. Their server CPU spikes to 100%, their database locks up, and their legitimate customers can no longer buy products.

You haven't just crawled the site; you've accidentally launched a Distributed Denial of Service (DDoS) attack.

The design paradox at the heart of a web crawler is this: you need to be as fast as possible globally, but as slow as necessary per domain. If you design for speed, you break the web's politeness rules. If you design for politeness, you'll never finish crawling the billions of pages that exist. This tension drives every architectural decision in a production-grade web crawler. At 1 billion pages per month, failure to handle these constraints doesn't just result in a slow systemβ€”it results in being blacklisted by ISPs and potential legal ramifications.

πŸ“– Web Crawler: Use Cases & Requirements

Actors & Journeys

  • Search Engine Indexer: The main actor that consumes crawled content to build a search index. It provides "Seed URLs" to start the process.
  • Webmaster: The entity that owns the content being crawled. They communicate via robots.txt to define which parts of their site are off-limits.
  • Data Scientist: Analyzes web-scale data for trends, using the crawler's output as a raw data lake.

In/Out Scope

  • In-Scope: Distributed crawling, URL deduplication, content near-duplicate detection, politeness, and robots.txt handling.
  • Out-of-Scope: Building the search index itself (Indexing), real-time ranking algorithms, and deep-web crawling (password-protected sites).

Functional Requirements

  • Scalable Discovery: Identify and extract all valid hyperlinks (HTTP/HTTPS) from fetched pages.
  • Polite Fetching: Ensure the system respects robots.txt and never exceeds a specified QPS (Queries Per Second) per domain.
  • Deduplication: Skip URLs already crawled and content that is "substantially similar" to previously indexed pages.
  • Fault Tolerance: If a crawler node fails, another must pick up the work without losing the frontier state.

Non-Functional Requirements (NFRs)

  • High Throughput: Handle 1 billion pages/month (~400 QPS globally).
  • Scalability: The architecture must scale horizontally by adding more fetcher nodes.
  • Extensibility: The modular design should allow adding support for PDF, FTP, or other protocols with minimal refactoring.
  • Robustness: Gracefully handle "crawler traps" (infinite loops), malformed HTML, and slow network responses.

πŸ” Foundations: How Crawling Actually Works

At its most basic level, a web crawler is a graph traversal engine. The web is a directed graph where pages are nodes and hyperlinks are edges.

The baseline architecture follows a simple loop:

  1. Pick a URL from the list of known URLs (the "Frontier").
  2. Download the page from the internet.
  3. Extract new URLs found in that page.
  4. Filter & Store those URLs back into the Frontier.

However, moving from a single-threaded script to a production system requires solving "The Three Ps": Persistence (storing billions of URLs), Politeness (not overwhelming servers), and Parallelism (using hundreds of nodes). Without a structured "Frontier" management system, your crawler would either run out of memory or get blocked by every major firewall on the planet.

βš™οΈ The Fetcher Mechanics: DNS & URL Normalization

The mechanism of fetching isn't as simple as GET /. To operate at scale, we must optimize the "pre-fetch" phase.

  • URL Normalization: Before checking if a URL has been crawled, we must normalize it. This includes converting to lowercase, removing default ports (e.g., :80), and stripping session IDs or tracking parameters (e.g., ?utm_source=...). Without this, the crawler would fetch the same page thousands of times.
  • Asynchronous DNS: Standard DNS lookups are synchronous and blocking. A production fetcher uses an asynchronous DNS resolver (like Netty's) and maintains a local cache of IP addresses to avoid hitting public DNS servers 400 times a second.
  • Protocol Agnosticism: The fetcher should be a pluggable module. While 99% of the web is HTTP/HTTPS, a robust crawler should be able to handle FTP or Gopher by simply swapping the protocol handler.

πŸ“ Estimations & Design Goals

Capacity Math (The 1B Page Scale)

To crawl 1 billion pages per month, we need to understand the physical constraints of our hardware and network.

  • Global QPS: $1,000,000,000 \text{ pages} / 30 \text{ days} / 86,400 \text{ seconds} \approx \mathbf{385 \text{ pages/sec}}$.
  • Storage (Raw Content): Assuming 250 KB per page (HTML + metadata), $1B \times 250KB = \mathbf{250 \text{ TB/month}}$. We likely need S3 or a distributed file system (HDFS) for this.
  • URL Storage: If a URL is 100 bytes on average, 1B URLs = 100 GB.
  • Bloom Filter Memory: To keep track of 10B URLs (including discovered but not yet crawled) with a 1% false positive rate, we need $\approx 12 \text{ GB}$ of RAM.

Scaling Targets

  • Max Delay per Domain: 5 seconds between requests (industry standard politeness).
  • DNS Resolution: Must be $< 10ms$ (requires local caching/asynchronous resolvers).
  • Fetcher Utilization: Target 80% CPU/Network utilization per node.

πŸ“Š High-Level Design: The Distributed Crawl Architecture

The following diagram shows the separation between the URL management (The Frontier) and the execution (The Fetchers).

graph TD
    Seeds[Seed URLs] --> Frontier[URL Frontier]
    Frontier --> Fetchers[Distributed Fetcher Pool]
    Fetchers --> DNS[Local DNS Resolver]
    Fetchers --> Robots[Robots.txt Cache]
    Fetchers --> ContentStore[(Content Store: S3)]
    Fetchers --> ContentDedupe[SimHash Engine]
    ContentDedupe --> LinkExt[Link Extractor]
    LinkExt --> URLDedupe[URL Dedupe: Bloom Filter]
    URLDedupe --> Frontier

Explanation of the Architecture: The architecture uses a Centralized Frontier and Decentralized Fetchers. The Frontier acts as the orchestrator, maintaining a prioritized list of URLs across many domains. Fetchers pull URLs from the Frontier, resolve them via a local DNS cache, and fetch the HTML. The content is then checked for uniqueness using SimHash; if unique, links are extracted and passed through a Bloom Filter to prevent re-adding already-known URLs back into the Frontier. This circular pipeline ensures the crawler is self-sustaining and efficient.

πŸ”Œ API Design: The Crawler Control Plane

While the crawler is largely autonomous, it needs a control plane for management and monitoring.

EndpointMethodPayloadDescription
/v1/seedsPOST{"urls": ["site1.com", "site2.com"]}Add new starting points to the frontier.
/v1/statsGETN/AGet global crawl status (QPS, Error rates, Frontier size).
/v1/domains/{host}/blacklistPOST{"reason": "trap"}Manually block a domain from being crawled.
/v1/frontier/pausePOST{}Halt all fetching activities immediately.

πŸ—„οΈ Data Model: Schema Definitions

URL Metadata Store (Apache Cassandra)

We need a distributed wide-column store to track the status and history of every URL. Cassandra is chosen for its write-heavy optimization.

TableColumnTypeNotes
urlsurl_hashBYTE[20]SHA-1 hash of the URL (Row Key).
urlsurlTEXTThe full original normalized URL.
urlslast_crawlTIMESTAMPWhen it was last fetched.
urlsstatusENUMDISCOVERED, FETCHED, FAILED.
urlspriorityINTCalculated priority score (0-10).

Robots.txt Cache (Redis)

To ensure politeness, we cache the robots.txt directives for every domain.

  • Key: robots:{domain_name}
  • Value: JSON blob of rules and Crawl-delay.
  • TTL: 24 hours (refreshed daily).

πŸ”§ Tech Stack & Design Choices

ComponentChoiceRationale
URL FrontierKafka + RedisKafka for distributed queuing; Redis for per-domain rate limiting state.
Content StoreAWS S3Standard for durability and massive blob storage at low cost.
URL DeduplicationRedis Bloom FilterSpace-efficient, $O(1)$ lookup for billions of items in RAM.
Metadata DBApache CassandraWrite-optimized, scales linearly with nodes; handles sparse metadata well.
LanguageJava (Netty)High-performance non-blocking I/O; massive ecosystem for crawler libraries.
Parsing EngineJsoup / TikaRobust HTML parsing and metadata extraction.

🧠 Design Deep Dive

πŸ›‘οΈ Internals: The URL Frontier (Priority & Politeness)

The Frontier is the brain of the system. It must solve two conflicting problems: Politeness (don't DDoS a site) and Priority (crawl the best pages first).

We implement this using a two-tiered queueing structure:

  1. Priority Queues (Front): URLs are assigned a score (0-10) based on domain authority and PageRank. High-priority URLs go into a high-priority Kafka topic.
  2. Politeness Queues (Back): Once a URL is picked from the Front Queues, it's mapped to a "Back Queue" based on its domain. A Back Queue Selector ensures that only one worker is pulling from a specific domain's queue at a time, and it enforces a sleep time (the Crawl-delay) between requests to that host using Redis EXPIRE keys as semaphores.

πŸ“Š Performance Analysis: DNS and Deduplication

  • The DNS Bottleneck: A standard gethostbyname() call is blocking and slow. Our fetchers use an Asynchronous DNS Resolver that keeps thousands of domain-to-IP mappings in memory.
  • SimHash for Content: To avoid storing 50 copies of the same news article (reposted on different blogs), we generate a 64-bit SimHash. By comparing the Hamming Distance (number of bit flips) between hashes, we can detect if two pages are 95% similar and discard the duplicate without expensive string comparisons.
  • SLO Targets: 95th percentile fetch time $< 500ms$. Bloom filter false positive rate $< 0.1\%$.

🌍 Real-World Applications

Crawler architectures are foundational to several massive industries:

  • Search Engines: Google and Bing use crawlers as the primary data ingestion path.
  • Price Aggregators: Sites like Kayak or Honey crawl airline and e-commerce sites to find the best deals.
  • Web Archiving: The Internet Archive's "Wayback Machine" uses crawlers to preserve the history of the web.
  • Security Auditing: Vulnerability scanners crawl internal networks to map attack surfaces.

βš–οΈ Trade-offs & Failure Modes

  1. Freshness vs. Depth: A crawler cannot crawl everything instantly. We trade off depth (crawling every page on a small site) for freshness (re-crawling the CNN homepage every 5 minutes).
  2. Bloom Filter Accuracy vs. Memory: A Bloom filter can have false positives, meaning it might think a URL is crawled when it isn't. We accept this small loss of coverage to save terabytes of RAM.
  3. Failure Mode: Infinite Loops. A "Calendar Trap" can fill the Frontier with junk. We mitigate this with max-depth limits and pattern-based URL exclusion.
  4. Failure Mode: IP Blacklisting. If our politeness logic fails, our IPs will be blocked. We use a distributed IP pool and rotation to minimize the impact of single-domain blocks.

πŸ—οΈ Advanced Concepts for Production Evolution

  1. JavaScript Rendering: Many modern sites (React/Vue) are empty shells without JS. A production crawler uses a Headless Chrome (Puppeteer/Playwright) Cluster to render pages before extracting links, though this is 100x more expensive than simple HTML fetching.
  2. Adaptive Re-crawl: Instead of a fixed TTL, we use machine learning to predict how often a page changes. A news site changes every hour; a blog might change once a month.
  3. Checkpointing & Snapshots: The Frontier state is massive. We use Kafka's consumer offsets and periodic Cassandra snapshots so that if the whole cluster restarts, we can resume within minutes.
  4. Mobile vs. Desktop Crawling: Different user-agents trigger different content. We crawl with both to ensure mobile-first indexing parity.

🧭 Decision Guide

SituationRecommendation
Small Scale (<1M pages)Use a single-node library like Scrapy with a Postgres backend.
Enterprise Scale (1B+ pages)Use a distributed architecture with Kafka Frontier and S3 Storage.
High JS ContentUse a headless browser pool; expect 10x higher costs.
Real-time DiscoveryUse RSS feeds or sitemaps as the primary seed source instead of raw crawling.

πŸ§ͺ Practical Example: Interview Delivery

If asked to design this in a 20-minute interview, focus on these three beats:

  1. The Politeness Constraint: Start with the "Calendar Trap" scenario. Explain why a naive crawler is dangerous and how the Back-Queue selector solves it.
  2. The Bloom Filter Trade-off: Explain why you use a Bloom filter for URL deduplication. Admit that it has false positives but argue that it's the only way to handle billions of URLs in memory.
  3. The Distributed Frontier: Sketch how Kafka allows you to separate the discovery of links from the fetching of content, allowing each to scale independently.

Standard Interview Closer: "I chose a distributed architecture with a Kafka-based Frontier because it allows us to handle the massive I/O of 1 billion pages per month while maintaining strict politeness via domain-sharded queues. We accept eventual coverage loss due to Bloom Filter false positives to ensure our memory footprint remains manageable at scale."

πŸ› οΈ Apache Nutch: How It Solves This in Practice

Apache Nutch is the industry-standard open-source crawler. It uses a Hadoop-style architecture to handle massive crawls.

Nutch Politeness Configuration (Java-ish XML)

<!-- nutch-site.xml configuration for politeness -->
<property>
  <name>fetcher.server.delay</name>
  <value>5.0</value>
  <description>The number of seconds between successive requests to the same server.</description>
</property>

<property>
  <name>fetcher.threads.per.host</name>
  <value>1</value>
  <description>Maximum threads fetching from a single host simultaneously.</description>
</property>

<property>
  <name>http.robots.cache.ttl</name>
  <value>86400000</value> <!-- 24 hours in ms -->
</property>

Nutch handles "Politeness" by partitioning the fetch list by host and ensuring that each fetcher thread processes URLs from a single host sequentially, respecting the fetcher.server.delay.

πŸ“š Lessons Learned

  • Architecture is about Legal Boundaries: robots.txt isn't just a suggestion; it's a protocol for staying out of legal trouble and respecting the web.
  • Memory is your biggest cost: At scale, storing raw strings for billions of URLs is impossible. Hashing and Bloom filters are mandatory.
  • Don't trust the network: Use aggressive timeouts (e.g., 10 seconds) and retry logic with exponential backoff for failed fetches.

πŸ“Œ Summary & Key Takeaways

  • URL Frontier: The core orchestrator that manages priority and domain-based politeness.
  • Bloom Filters: The standard for space-efficient URL deduplication in RAM.
  • SimHash: Essential for detecting "near-duplicate" content and avoiding index bloat.
  • Asynchronous I/O: Using Java/Netty to handle thousands of concurrent network connections without thread exhaustion.

πŸ“ Practice Quiz

  1. Why is a Bloom Filter preferred over a Hash Set for URL deduplication at a 10B URL scale?

    • A) It is faster to write to.
    • B) It uses significantly less memory (bits vs bytes per entry).
    • C) It guarantees 100% accuracy. Correct Answer: B
  2. What is the primary role of the "Back Queue" in the URL Frontier?

    • A) To store failed URLs for re-try.
    • B) To ensure politeness by serializing requests to the same domain.
    • C) To cache content before it's sent to S3. Correct Answer: B
  3. How does SimHash help reduce the storage footprint of a crawler?

    • A) It compresses the HTML content.
    • B) It identifies near-duplicate content so it can be discarded before storage.
    • C) It encrypts the data for faster transmission. Correct Answer: B
  4. [Open-ended] Describe a scenario where a Bloom Filter's "false positive" would be a major problem for a crawler. How would you adjust the architecture to fix it for that specific case?

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms