Advanced18 min readSystem DesignHldWeb Crawler

System Design HLD Example: Web Crawler

A practical interview-ready HLD for a distributed web crawler handling politeness, deduplication, and scale.

System Design Interview Prep

Abstract Algorithms

·Mar 28, 2026·18 min read

More actions⌄

Practice Interview Mock Discussion

Reading progress

18 min left

Metadata and pacing⌄

Total read

18 min

Sections

◴ On this page⌄

🕷️ The Infinite Web Problem 📖 Web Crawler: Use Cases & Requirements Actors & Journeys In/Out Scope Functional Requirements Non-Functional Requirements (NFRs)🔍 Foundations: How Crawling Actually Works ⚙️ The Fetcher Mechanics: DNS & URL Normalization 📐 Estimations & Design Goals Capacity Math (The 1B Page Scale)Scaling Targets 📊 High-Level Design: The Distributed Crawl Architecture 🧠 Deep Dive: The Two-Queue URL Frontier and the Mathematics of Bloom Filter Deduplication Internals: How Robots.txt Enforcement and URL Normalization Work at Scale The Two-Queue Frontier Architecture The Mathematics of Bloom Filter Deduplication Performance Analysis: Throughput, DNS Cost, and Fetcher Utilization at 400 QPS URL Frontier and Deduplication Data Model 🌍 How Googlebot, Bingbot, and CommonCrawl Deploy Crawler Architecture at Scale ⚖️ The Core Trade-offs in Web Crawler Architecture 🧭 URL Deduplication Strategy Decision Framework 🧪 Presenting the Web Crawler HLD in a System Design Interview 📚 What Building a Web Crawler Teaches About Adversarial Distributed Systems 📌 TLDR & Key Takeaways 🔗 Related Posts

✣ Need another angle?⌄

Switch the article companion into a lower-complexity framing, then quiz yourself when you are ready.

Advanced18 min readSystem DesignHldWeb Crawler

System Design HLD Example: Web Crawler

A practical interview-ready HLD for a distributed web crawler handling politeness, deduplication, and scale.

Abstract Algorithms

Mar 28, 2026 · 18 min read

Interview

Helpful?

🕷️ The Infinite Web Problem

TLDR: A distributed web crawler must balance global throughput with per domain politeness.

1. Overview

A practical interview-ready HLD for a distributed web crawler handling politeness, deduplication, and scale.

⌁

Why it matters

TLDR: A distributed web crawler must balance global throughput with per domain politeness.

Show high-level concept flow⌄

🕷️ The Infinite Web Problem

Starting point

→

📖 Web Crawler: Use Cases & Requirements

Next concept

→

🔍 Foundations: How Crawling Actually Works

Next concept

→

⚙️ The Fetcher Mechanics: DNS & URL Normalization

Next concept

→

📐 Estimations & Design Goals

Outcome

Committed

At a glance

DifficultyAdvanced ▥

Concepts25

Estimated time18 min

PrerequisitesSystem Design, Hld

System lens

See System Design HLD Example: Web Crawler as a living topology.

A practical interview-ready HLD for a distributed web crawler handling politeness, deduplication, and scale.

🕷️ The Infinite Web Problem

Ingress and assumptions

📖 Web Crawler: Use Cases & Requirements

State transition

🔍 Foundations: How Crawling Actually Works

State transition

⚙️ The Fetcher Mechanics: DNS & URL Normalization

State transition

📐 Estimations & Design Goals

Outcome and guarantees

The article becomes easier when every section maps to a state change, a guarantee, or a failure boundary.

Narrative transition

Move from explanation to operating judgment.

Use these checkpoints as the conceptual pacing layer before continuing into the full article.

!Why this matters

TLDR: A distributed web crawler must balance global throughput with per domain politeness.

#Key section to watch

Pay attention to "📖 Web Crawler: Use Cases & Requirements"; it usually contains the main mechanism or tradeoff.

?Interview angle

Be ready to explain 🕷️ The Infinite Web Problem and 📖 Web Crawler: Use Cases & Requirements with one concrete example and one tradeoff.

Tradeoff path 1

🕷️ The Infinite Web Problem: speed-first

TLDR: A distributed web crawler must balance global throughput with per domain politeness.

Tradeoff path 2

📖 Web Crawler: Use Cases & Requirements: reliability-first

The architectural crux is the URL Frontier , which manages priority and rate limiting across a distributed fetcher pool.

Failure rehearsal

Pressure-test the mental model.

🕷️ The Infinite Web Problem misunderstood

High model quality can still produce incorrect outputs without grounding and verification.

Mitigation: Revisit 🕷️ The Infinite Web Problem and validate the first principles.

Risk 68%

📖 Web Crawler: Use Cases & Requirements tradeoff missed

Low latency does not automatically mean high throughput under contention.

Mitigation: Compare against 📖 Web Crawler: Use Cases & Requirements and document the tradeoff.

Risk 58%

Back to the article

Continue into the authored sections with the topology in mind: each heading should now answer what changes, what can fail, and what guarantee the system is trying to preserve.

TLDR: A distributed web crawler must balance global throughput with per-domain politeness. The architectural crux is the URL Frontier, which manages priority and rate-limiting across a distributed fetcher pool. By combining Bloom Filters for URL deduplication and SimHash for content near-duplicate detection, the system can crawl billions of pages while maintaining a minimal storage footprint and respecting the stability of the target websites.

🕷️ The Infinite Web Problem

Imagine you’ve built a simple crawler. It works great on your personal blog. You point it at a small news site, and it finishes in minutes. Then, you point it at a major e-commerce site with millions of products.

Within seconds, your crawler starts hitting the "Calendar Trap." The site has a calendar page for every product, and your crawler is clicking "Next Month" infinitely. Meanwhile, the site's webmaster is seeing 5,000 requests per second from your IP address. Their server CPU spikes to 100%, their database locks up, and their legitimate customers can no longer buy products.

You haven't just crawled the site; you've accidentally launched a Distributed Denial of Service (DDoS) attack.

The design paradox at the heart of a web crawler is this: you need to be as fast as possible globally, but as slow as necessary per domain. If you design for speed, you break the web's politeness rules. If you design for politeness, you'll never finish crawling the billions of pages that exist. This tension drives every architectural decision in a production-grade web crawler. At 1 billion pages per month, failure to handle these constraints doesn't just result in a slow system—it results in being blacklisted by ISPs and potential legal ramifications.

📖 Web Crawler: Use Cases & Requirements

Actors & Journeys

Search Engine Indexer: The main actor that consumes crawled content to build a search index. It provides "Seed URLs" to start the process.
Webmaster: The entity that owns the content being crawled. They communicate via robots.txt to define which parts of their site are off-limits.
Data Scientist: Analyzes web-scale data for trends, using the crawler's output as a raw data lake.

In/Out Scope

In-Scope: Distributed crawling, URL deduplication, content near-duplicate detection, politeness, and robots.txt handling.
Out-of-Scope: Building the search index itself (Indexing), real-time ranking algorithms, and deep-web crawling (password-protected sites).

Functional Requirements

Scalable Discovery: Identify and extract all valid hyperlinks (HTTP/HTTPS) from fetched pages.
Polite Fetching: Ensure the system respects robots.txt and never exceeds a specified QPS (Queries Per Second) per domain.
Deduplication: Skip URLs already crawled and content that is "substantially similar" to previously indexed pages.
Fault Tolerance: If a crawler node fails, another must pick up the work without losing the frontier state.

Non-Functional Requirements (NFRs)

High Throughput: Handle 1 billion pages/month (~400 QPS globally).
Scalability: The architecture must scale horizontally by adding more fetcher nodes.
Extensibility: The modular design should allow adding support for PDF, FTP, or other protocols with minimal refactoring.
Robustness: Gracefully handle "crawler traps" (infinite loops), malformed HTML, and slow network responses.

🔍 Foundations: How Crawling Actually Works

At its most basic level, a web crawler is a graph traversal engine. The web is a directed graph where pages are nodes and hyperlinks are edges.

The baseline architecture follows a simple loop:

Pick a URL from the list of known URLs (the "Frontier").
Download the page from the internet.
Extract new URLs found in that page.
Filter & Store those URLs back into the Frontier.

However, moving from a single-threaded script to a production system requires solving "The Three Ps": Persistence (storing billions of URLs), Politeness (not overwhelming servers), and Parallelism (using hundreds of nodes). Without a structured "Frontier" management system, your crawler would either run out of memory or get blocked by every major firewall on the planet.

⚙️ The Fetcher Mechanics: DNS & URL Normalization

The mechanism of fetching isn't as simple as GET /. To operate at scale, we must optimize the "pre-fetch" phase.

URL Normalization: Before checking if a URL has been crawled, we must normalize it. This includes converting to lowercase, removing default ports (e.g., :80), and stripping session IDs or tracking parameters (e.g., ?utm_source=...). Without this, the crawler would fetch the same page thousands of times.
Asynchronous DNS: Standard DNS lookups are synchronous and blocking. A production fetcher uses an asynchronous DNS resolver (like Netty's) and maintains a local cache of IP addresses to avoid hitting public DNS servers 400 times a second.
Protocol Agnosticism: The fetcher should be a pluggable module. While 99% of the web is HTTP/HTTPS, a robust crawler should be able to handle FTP or Gopher by simply swapping the protocol handler.

📐 Estimations & Design Goals

Capacity Math (The 1B Page Scale)

To crawl 1 billion pages per month, we need to understand the physical constraints of our hardware and network.

Global QPS: $1,000,000,000 \text{ pages} / 30 \text{ days} / 86,400 \text{ seconds} \approx \mathbf{385 \text{ pages/sec}}$.
Storage (Raw Content): Assuming 250 KB per page (HTML + metadata), $1B \times 250KB = \mathbf{250 \text{ TB/month}}$. We likely need S3 or a distributed file system (HDFS) for this.
URL Storage: If a URL is 100 bytes on average, 1B URLs = 100 GB.
Bloom Filter Memory: To keep track of 10B URLs (including discovered but not yet crawled) with a 1% false positive rate, we need $\approx 12 \text{ GB}$ of RAM.

Scaling Targets

Max Delay per Domain: 5 seconds between requests (industry standard politeness).
DNS Resolution: Must be $< 10ms$ (requires local caching/asynchronous resolvers).
Fetcher Utilization: Target 80% CPU/Network utilization per node.

📊 High-Level Design: The Distributed Crawl Architecture

The following diagram shows the separation between the URL management (The Frontier) and the execution (The Fetchers).

graph TD
    Seeds[Seed URLs] --> Frontier[URL Frontier]
    Frontier --> Fetchers[Distributed Fetcher Pool]
    Fetchers --> DNS[Local DNS Resolver]
    Fetchers --> Robots[Robots.txt Cache]
    Fetchers --> ContentStore[(Content Store: S3)]
    Fetchers --> ContentDedupe[SimHash Engine]
    ContentDedupe --> LinkExt[Link Extractor]
    LinkExt --> URLDedupe[URL Dedupe: Bloom Filter]
    URLDedupe --> Frontier

The diagram captures the closed feedback loop at the heart of every web crawler. Seed URLs enter the URL Frontier, which orchestrates dispatch to the Distributed Fetcher Pool. Before each fetch, the pool consults a Local DNS Resolver (to avoid 400 public DNS lookups per second) and a Robots.txt Cache (to enforce domain-level access rules without re-downloading the robots file on every request). After each successful fetch, the content passes through the SimHash Engine for near-duplicate detection and the Link Extractor for new URL discovery. Newly extracted URLs flow through the Bloom Filter Deduplication layer before re-entering the Frontier. Content that clears deduplication is written to the Content Store (S3 or HDFS).

🧠 Deep Dive: The Two-Queue URL Frontier and the Mathematics of Bloom Filter Deduplication

The URL Frontier is the most architecturally complex component of a distributed web crawler. It must simultaneously satisfy two contradictory requirements: it must function as a priority queue (high-value pages should be crawled before low-value pages) and as a rate limiter (no single domain should receive more than N requests per unit time). No single data structure satisfies both requirements, which is why production systems use a Two-Queue Architecture.

Queue 1 — The Priority Queue (Front Queue): URLs are assigned a priority score based on factors such as estimated PageRank, domain authority, inbound link count from already-crawled pages, and content freshness signals. This queue is a min-max heap sorted by descending priority. High-authority domains such as wikipedia.org and major news sites occupy the top; low-authority blogs and comment spam occupy the bottom.

Queue 2 — The Politeness Queue (Back Queue): Before any URL is dispatched to a fetcher, it passes through the Politeness Queue — a collection of per-domain FIFO queues. Each domain gets exactly one queue. A Scheduler monitors the last-fetch timestamp for each domain and prevents its queue from releasing the next URL until the configured minimum delay (typically 5 seconds for production crawlers) has elapsed. This prevents any fetcher from accidentally issuing concurrent requests to the same domain, regardless of how many of that domain's pages appear in the Priority Queue.

Internals: How Robots.txt Enforcement and URL Normalization Work at Scale

Before a URL enters the Fetcher, two pre-flight checks must complete. First, the Robots.txt Cache is consulted to determine whether the target path is accessible. Each domain's robots.txt file is fetched once and cached with a 24-hour TTL. The cache maps each domain to a parsed rule set (a list of Allow and Disallow directives per user agent). If the path is disallowed, the URL is dropped silently and never enters the Content Store.

Second, URL normalization is applied before any deduplication check. Normalization rules include: converting the scheme and host to lowercase, removing default ports (:80 for HTTP, :443 for HTTPS), stripping session IDs and tracking parameters (any query parameter matching patterns like utm_*, sessionid=, PHPSESSID=), resolving relative paths (../), and decoding percent-encoded characters where unambiguous. Without normalization, http://Example.com/page, http://example.com:80/page, and http://example.com/page?utm_source=google would be treated as three distinct URLs requiring three fetches of identical content.

The Two-Queue Frontier Architecture

graph TD
    New[New URLs Discovered] --> PQ[Priority Queue: Score-Sorted Heap]
    PQ --> Selector[Domain Selector]
    Selector --> DQ1[Domain Queue: wikipedia.org]
    Selector --> DQ2[Domain Queue: nytimes.com]
    Selector --> DQ3[Domain Queue: example.com]
    DQ1 --> Throttle[Rate Limiter: min 5s per domain]
    DQ2 --> Throttle
    DQ3 --> Throttle
    Throttle --> Fetcher[Distributed Fetcher Pool]

The Selector is a hash map in memory mapping each domain to its queue identifier. The Rate Limiter maintains a Redis hash of domain → last_fetch_timestamp and blocks the queue from releasing the next URL until now - last_fetch > min_delay. This enforcement happens at the scheduler level, not the fetcher level — a critical distinction. If enforcement were at the fetcher level, multiple fetcher threads could independently decide it is "time" to fetch from the same domain and issue simultaneous requests, violating politeness even if each individual fetcher respected the delay.

The Mathematics of Bloom Filter Deduplication

A Bloom Filter is a probabilistic data structure that answers "Has this URL been seen before?" in O(1) time with sub-linear space requirements. It consists of a bit array of size M and K independent hash functions. To insert a URL, all K hash functions are applied and the resulting K bit positions are set to 1. To query, the same K positions are checked: if all are 1, the URL has "probably" been seen; if any bit is 0, it has definitely not been seen.

Key Property: Bloom Filters have zero false negatives — a URL that was crawled will never be falsely reported as "not seen" — but a configurable false positive rate. In web crawling, a false positive means skipping a URL we should crawl, which is acceptable for 1% of cases. A false negative (re-crawling a page already indexed) wastes resources but never happens with a Bloom Filter.

Sizing for 10 Billion URLs at 1% False Positive Rate

Parameter	Value
Total URLs (N)	10 billion
Target false positive rate (p)	1% (0.01)
Required bit array size (M)	~95.85 billion bits (~12 GB)
Optimal hash function count (K)	~7
Memory cost per URL	~9.6 bits

This 12 GB Bloom Filter fits on a single high-memory server instance and can be loaded into Redis as a BitSet or maintained in the crawling process's own heap. In practice, crawlers shard the Bloom Filter across multiple machines using a consistent hashing scheme when the URL set exceeds what a single node can hold.

Performance Analysis: Throughput, DNS Cost, and Fetcher Utilization at 400 QPS

The system must sustain 385 pages per second globally (1 billion pages per month). At 250 KB per page and 100 Mbps per fetcher node, a single fetcher can sustain approximately 50 pages per second when accounting for DNS resolution, TCP handshake, and TLS overhead. Reaching 385 QPS therefore requires a minimum of 8 fetcher nodes, but production deployments use 100+ nodes to handle the long tail of slow-responding domains and provide fault tolerance when individual nodes fail.

DNS resolution is a hidden bottleneck. A naive fetcher resolves every hostname via the public DNS infrastructure, which adds 20–200 ms per request at scale. At 385 QPS, this means 385 DNS lookups per second — enough to get the crawler's IP flagged as a DNS abuser. The mitigation is a local Unbound or Bind9 resolver running on each fetcher node, caching responses with TTL respecting the target domain's DNS TTL (typically 60–3600 seconds). A well-tuned local DNS cache reduces DNS lookup cost to under 1 ms for any domain that has been visited in the past hour.

URL Frontier and Deduplication Data Model

Component	Storage	Key Format	Notes
Priority Queue	Redis Sorted Set	`frontier:priority`	Score equals URL priority value
Domain Queues	Redis List per domain	`frontier:domain:{domain_hash}`	FIFO queue per domain
Last Fetch Timestamp	Redis Hash	`domain:lastfetch`	Polled by Rate Limiter
Bloom Filter	Redis BitSet or in-process	`dedupe:bloom`	~12 GB for 10 B URLs at 1% FPR
Robots.txt Cache	Redis Hash	`robots:{domain}`	TTL 24 hours
SimHash Fingerprints	Cassandra	`{fingerprint_64bit}`	Queried for near-duplicate detection

🌍 How Googlebot, Bingbot, and CommonCrawl Deploy Crawler Architecture at Scale

Googlebot indexes over 130 trillion web pages and is the world's most sophisticated crawler. Google's published patents and engineering papers describe a URL Frontier nearly identical to the two-queue architecture above, augmented with a "freshness" layer that dynamically adjusts re-crawl frequency based on observed change rate. News sites are re-crawled hourly, product listing pages daily, and static documentation monthly. Google also uses proprietary signals — server response time trends and historical crawl success rates — to dynamically throttle its crawl rate per domain, not just the fixed 5-second delay used by simpler crawlers.

Bingbot follows Microsoft's publicly documented Bing Webmaster Guidelines. Their architecture emphasizes "crawl budget management" for large sites with millions of auto-generated pages — for example, e-commerce sites where every product variant generates a unique URL. Without a crawl budget limit, Bingbot could spend its entire allocation for a single domain crawling millions of near-identical product pages, starving the rest of its index.

CommonCrawl is a non-profit open-source web crawl that releases monthly data sets used by researchers, AI training pipelines, and academic institutions. Unlike commercial crawlers, CommonCrawl prioritizes breadth over freshness — it aims to crawl as much of the web as possible without prioritizing re-crawl frequency. Their architecture is built on Apache Nutch running on Hadoop clusters, using HDFS for the URL frontier state and MapReduce for link extraction. This distributed-systems-first approach sacrifices real-time responsiveness for horizontal scale across hundreds of machines.

⚖️ The Core Trade-offs in Web Crawler Architecture

Depth-First vs. Breadth-First Traversal: A depth-first crawler follows the first link on every page as deep as possible before backtracking. It is highly susceptible to crawler traps — infinite calendar pages, endlessly paginated results, and dynamic parameter combinations. A breadth-first crawler processes all links at the current depth before going deeper. It is more trap-resistant but requires a significantly larger in-memory frontier queue. Production crawlers use priority-weighted BFS, which is functionally breadth-first but biased toward high-authority domains, combining trap-resistance with value maximization.

Bloom Filter False Positive Rate vs. Memory Cost: Reducing the false positive rate from 1% to 0.1% roughly triples the required bit array size from 12 GB to 36 GB. For most production crawlers, a 1% skip rate (missing 10 million URLs in a 1-billion-page crawl) is an acceptable trade-off against the infrastructure cost of the larger filter. If completeness is critical — for example, a legal discovery crawl where missing any document carries liability — a persistent hash set in Cassandra provides zero false positives at the cost of much higher storage and O(log N) lookup latency.

SimHash Near-Duplicate Threshold Sensitivity: SimHash generates a 64-bit fingerprint from page content using locality-sensitive hashing. Two pages are considered near-duplicates if their Hamming distance is 3 or fewer bits out of 64. This threshold is configurable: a threshold of 3 catches near-identical pages (same article, different URL); a threshold of 10 catches pages that are substantially similar but with different boilerplate sections. Setting the threshold too high causes genuine unique content to be incorrectly skipped; too low allows near-duplicate pollution in the index.

🧭 URL Deduplication Strategy Decision Framework

Scale	Best Deduplication Strategy	Storage Cost	Trade-off
Fewer than 10 M URLs	In-process HashSet	~800 MB RAM	Zero false positives; does not survive restarts
10 M to 1 B URLs	Redis Set with URL hashes	~100 bytes per URL; ~100 GB	Persistent, O(1) lookup, exact match
1 B to 100 B URLs	Bloom Filter	~12 GB at 1% FPR	Sub-linear space; 1% false positive skip rate
Over 100 B URLs	Distributed Bloom Filter (sharded)	~120 GB+	Network overhead for cross-shard lookups

🧪 Presenting the Web Crawler HLD in a System Design Interview

Start every web crawler answer by drawing the core loop on the whiteboard: Frontier → Fetcher → Content Store → Link Extractor → Frontier. This loop is the entire system. Every other component — Bloom Filter, SimHash Engine, Robots.txt cache, DNS cache — is a correctness or performance guard on one of the loop's stages.

Interviewers reliably probe three specific areas in every web crawler question. Prepare for each:

Politeness: Explain the two-queue architecture before the interviewer asks. The key insight to communicate is that politeness must be enforced at the scheduler level, not the fetcher level. If enforcement were at the fetcher level, multiple fetchers independently checking the clock could both decide it is safe to fetch from the same domain simultaneously, violating politeness even if each individual fetcher appears to respect the delay.

Deduplication: Distinguish clearly between URL deduplication (Bloom Filter — "have we crawled this URL before?") and content deduplication (SimHash — "have we seen this content before, possibly at a different URL?"). These are separate problems with separate data structures and are frequently conflated. Both are required: URL deduplication prevents re-fetching; content deduplication prevents index pollution from mirror sites, AMP pages, and copy-paste spam.

Scale: Ground your answer with the capacity math. "1 billion pages per month is roughly 385 pages per second globally. At 250 KB per page, that is approximately 96 MB/sec of sustained bandwidth per fetcher cluster. Distributed across 100 fetcher nodes, each node handles about 960 KB/sec — trivially within the capacity of a single data center rack."

📚 What Building a Web Crawler Teaches About Adversarial Distributed Systems

A web crawler is a distributed system that must cooperate with millions of external systems — the web's servers — over which it has zero control. This forces architectural decisions that most internal distributed systems never encounter: adversarial environments, unpredictable response times, resource-constrained counterparties, and legal constraints from robots.txt and terms of service.

Every design choice in a web crawler is a negotiation between the crawler's appetite for throughput and the web's need for stability. The politeness constraint is not merely an ethical guideline — it is a self-preservation mechanism. A crawler that ignores politeness quickly finds its IP ranges blacklisted by CDNs, blocked by Web Application Firewalls, and reported to ISP abuse teams. The constraint enforces disciplined engineering in a way that no internal system ever requires. Building a crawler teaches you to design software that is not just correct in isolation, but well-behaved as a participant in a larger ecosystem.

📌 TLDR & Key Takeaways

A web crawler is a graph traversal engine on the world's largest directed graph, constrained by politeness, deduplication, and fault tolerance requirements.
The Two-Queue URL Frontier cleanly separates priority scheduling (which URLs to crawl next) from politeness rate limiting (how fast to crawl each individual domain).
Bloom Filters enable O(1) URL deduplication for 10 billion URLs in approximately 12 GB of RAM, with a configurable false positive skip rate.
SimHash detects near-duplicate content using a 64-bit fingerprint regardless of URL, preventing duplicate AMP pages, mirrors, and copy-paste spam from polluting the index.
Per-domain rate limiting (minimum 5 seconds between requests) is enforced at the scheduler level to prevent accidental DDoS against target sites.
At 1 billion pages per month (~400 QPS), the system requires approximately 100 distributed fetcher nodes for adequate parallelism and fault tolerance.
Separate Bloom Filter tuning for URL deduplication and SimHash threshold tuning for content deduplication are independent architectural decisions driven by different correctness requirements.

Expandable deep dives

🕷️ The Infinite Web Problem⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

📖 Web Crawler: Use Cases & Requirements⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

Actors & Journeys⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

In/Out Scope⌄

Dive deeper into this section and cross-reference concepts before moving to the next heading.Jump to section

Key takeaways

✓TLDR: A distributed web crawler must balance global throughput with per domain politeness.
✓The architectural crux is the URL Frontier , which manages priority and rate limiting across a distributed fetcher pool.
✓By combining Bloom Filters for URL deduplication and SimHash for content near duplicate detection, the system can crawl billions of pages while maintaining a minimal storage footprint and respecting the stability of the target websites.
✓🕷️ The Infinite Web Problem Imagine you’ve built a simple crawler.

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Reader feedback

Was this article useful?

Rate it before you leave, then follow or subscribe for the next deep dive.

Continue learning

Stale Reads and Cascading Failures in Distributed Systems

25 min · System Design · best next step

View roadmap