All Posts

System Design HLD Example: E-Commerce Platform (Amazon)

A practical interview-ready HLD for a large-scale e-commerce system handling catalog, cart, inventory, and orders.

Abstract AlgorithmsAbstract Algorithms
ยทยท13 min read

AI-assisted content.

TLDR: A large-scale e-commerce platform separates catalog, cart, inventory, orders, and payments into independent microservices. The core architectural challenge is Inventory Correctness during flash salesโ€”solved with a two-phase reservation pattern: an atomic Redis DECR for high-speed "soft" reservation and an optimistic-lock SQL update for the final "hard" commitment.

๐Ÿ›๏ธ The Prime Day Pressure Cooker

Imagine itโ€™s 12:00 PM on Prime Day. A "Lightning Deal" for the latest iPhone goes live at 90% off. There are exactly 1,000 units available. In the first 10 seconds, 500,000 users click "Buy Now."

In a naive system, every click triggers a database transaction: SELECT stock FROM inventory WHERE sku_id = 'iphone15'. If stock > 0, then UPDATE inventory SET stock = stock - 1. Under this massive concurrent load, the database's row-level locking will cause a "thundering herd" effect. Database connections will max out, latency will spike from milliseconds to minutes, andโ€”worst of allโ€”the system might accidentally sell 1,050 units because of a race condition between the check and the decrement.

This is the Overselling Trap. In e-commerce, selling an item you don't have isn't just a technical bug; it's a financial and reputational disaster involving cancelled orders, refund processing fees, and lost customer trust. If you design for the average day, you fail on the only day that matters.

๐Ÿ“– E-Commerce Systems: Use Cases & Requirements

Actors

  • Shopper / Buyer: Browses products, manages a cart, and places orders.
  • Merchant / Seller: Lists products, manages stock levels, and fulfills orders.
  • System Admin: Monitors platform health, manages global promotions, and handles fraud.

Functional Requirements

  • Catalog Management: Search and browse millions of products with filters.
  • Shopping Cart: Persistent cart that works across devices and handles guest users.
  • Inventory Reservation: Atomic stock subtraction that prevents overselling.
  • Order Processing: State machine tracking from Placed to Delivered.
  • Payment Integration: Secure, idempotent payment processing via third-party gateways.

Non-Functional Requirements

  • High Read Availability: Browsing the catalog should never be down (99.99%).
  • Strong Write Consistency: Inventory and Order records must be 100% accurate.
  • Low Latency: Product pages must load in < 100ms to prevent conversion drop-off.
  • Scale: Support 1M+ concurrent users and 10k+ orders/sec during peak spikes.

๐Ÿ” Basics: Baseline Architecture

An e-commerce system is essentially a Distributed State Machine. Every order moves through a series of transitions, and the system must ensure that data remains consistent across multiple specialized services.

The baseline architecture separates concerns into:

  1. The Read Path (Catalog & Search): Optimized for massive scale and eventual consistency.
  2. The Write Path (Checkout & Payment): Optimized for ACID compliance and strong consistency.
  3. The Async Path (Notifications & Analytics): Decoupled from the critical user path to ensure high performance.

Without this separation, a spike in checkouts would slow down users who are just browsing, leading to a massive loss in potential revenue.

โš™๏ธ Mechanics: The Two-Phase Reservation Logic

The most critical mechanic in e-commerce is how we handle the "Check-and-Reserve" of inventory. We use a Hybrid Two-Phase Pattern:

  • Phase 1: Soft Reservation (Redis): We keep a high-speed counter in Redis. Every checkout attempt performs an atomic DECR. If the result is $\ge 0$, the user proceeds. This happens in $\approx 2ms$.
  • Phase 2: Hard Commitment (Postgres): Once payment is authorized, we write the reservation to the primary database using an optimistic lock. If this fails, we "compensate" by incrementing the Redis counter back.

This mechanic allows us to handle 50k requests per second on a single SKU without locking our primary database.

๐Ÿ“ Estimations & Design Goals

The Math of Amazon-Scale

  • Total Products: 100 Million SKUs.
  • Peak Orders: 10,000 orders per second.
  • Read-to-Write Ratio: 50:1. If we have 10k orders/sec, we have 500k product views/sec.
  • Storage Growth: 10k orders/sec 2KB per order = *1.7 GB of data per minute.

Design Goal: We must use a Cache-Aside Pattern for the catalog and a Message-Driven Architecture for post-order processing to ensure the "Buy Now" button remains responsive even if the notification service is lagging.

๐Ÿ“Š High-Level Design: The Distributed Microservices Architecture

The following architecture illustrates the separation of concerns between discovery, transaction, and fulfillment.

graph TD
    User((User)) --> LB[Load Balancer]
    LB --> AG[API Gateway]

    subgraph Discovery_Discovery
        AG --> PCS[Product Service]
        PCS --> PDB[(Catalog DB: Postgres)]
        PCS --> RC[(Catalog Cache: Redis)]
        AG --> SES[Search Service]
        SES --> ES[(Elasticsearch)]
    end

    subgraph Transaction_Path
        AG --> CS[Cart Service]
        CS --> RCart[(Cart Cache: Redis)]
        AG --> OS[Order Service]
        OS --> IS[Inventory Service]
        IS --> RInv[(Inv Counter: Redis)]
        IS --> PInv[(Inv DB: Postgres)]
        OS --> PS[Payment Service]
    end

    subgraph Fulfillment_Async
        OS --> Kafka[Kafka]
        Kafka --> NS[Notification Service]
        Kafka --> AS[Analytics Service]
        Kafka --> WS[Warehouse Service]
    end

The diagram above reveals the three-path separation that makes Amazon-scale e-commerce resilient. The Discovery Path uses Elasticsearch and Redisโ€”both read-optimizedโ€”so that browsing the catalog never touches the transactional database. The Transaction Path contains all ACID operations. The Fulfillment Path is entirely asynchronous via Kafka, meaning a slow notification service has zero impact on the checkout experience.

๐Ÿง  Deep Dive: How the Two-Phase Reservation Prevents the Overselling Trap

The Two-Phase Reservation pattern is the core innovation that separates a production e-commerce platform from a demo. Understanding its internals explains why the inventory system can sustain 50,000 simultaneous checkout attempts on a single SKU.

Internals: Phase 1 (Soft Reserve) and Phase 2 (Hard Commit) State Machine

Phase 1 is a single atomic decrement on a Redis counter for the SKU. Redis processes this as a single CPU instruction โ€” truly atomic even under extreme concurrency. If the counter drops below zero, the operation is immediately reversed and the user receives an "out of stock" response. At no point does Phase 1 touch Postgres, which is what gives it its ~2 ms speed.

FieldTypeDescription
sku_idVARCHAR(50)Unique product identifier
nameTEXTProduct display name
base_priceDECIMAL(10,2)Listed price before promotions
stock_physicalINTEGERActual warehouse on-hand count
stock_reservedINTEGERSoft-reserved count held in Redis
versionINTEGEROptimistic lock counter for Phase 2
statusENUMACTIVE, DISCONTINUED, OUT_OF_STOCK

Phase 2 โ€” Hard Commitment โ€” runs only after the payment gateway returns a successful charge authorization. The Order Service issues a Postgres UPDATE that reads the current version number, performs the stock decrement, and asserts the version has not changed. If another concurrent transaction modified the row between the read and the write, the version will differ and the UPDATE returns zero rows affected โ€” the operation retries. This is optimistic locking: no database locks are held during the 2โ€“5 second payment processing window.

PhaseStorageMechanismLatencyFailure Recovery
Soft ReserveRedisAtomic DECR~2 msINCR to compensate on payment failure
Hard CommitPostgresOptimistic lock (version check)~20 msRetry on version conflict
Async NotifyKafkaEvent publish~5 msAt-least-once delivery via consumer groups

Performance Analysis: Handling 500,000 Flash Sale Clicks in 10 Seconds

At 500,000 simultaneous clicks in 10 seconds, the system must process 50,000 checkout requests per second. A Postgres row lock with a 2-second payment window serializes ~500 concurrent updates (1 lock per connection). Redis, by contrast, handles 500,000 atomic DECRs per second on commodity hardware.

The API Gateway's rate limiter is the first defense: it applies a token bucket per SKU, capping checkout requests at 10,000 per second per product. This protects both Redis and Postgres from thundering-herd behavior and returns a 429 "Try Again" response to excess users โ€” a far better user experience than a 503 server error.

CheckpointTargetTechnology
Peak order throughput10,000 orders/secRedis Phase 1 + Kafka dispatch
Catalog page latency< 100 msElasticsearch + Redis cache-aside
Inventory reservation latency< 10 msRedis DECR (Phase 1 only)
Payment + hard commit latency< 2,000 msPostgres optimistic lock + payment gateway

๐ŸŒ Real-World Inventory Systems: Amazon, Shopify, and ASOS

Amazon uses a multi-layer inventory system where each fulfillment center maintains its own local stock count, and a global aggregation layer provides approximate availability for product pages. The final stock deduction happens at fulfillment center selection time โ€” not at "Add to Cart." This is why Amazon occasionally allows an order that is later listed as "delayed": the global system showed inventory that a local fulfillment center had already exhausted.

Shopify adopted an event-sourced inventory model during its 2021 Black Friday scale-up. Rather than storing "current stock" as a mutable integer, Shopify stores "stock adjustment events" and computes current stock as the event aggregate. This makes the audit trail perfect and enables point-in-time stock reconstruction, but it adds read complexity (aggregating all events) that requires a materialized view for performance.

ASOS routes all flash-sale traffic through a dedicated Flash Sale Service isolated from the main catalog and cart services. This prevents a Black Friday traffic spike from degrading the browse experience for regular shoppers โ€” a clean example of the Bulkhead pattern applied at the service level.

โš–๏ธ Microservice Independence vs. Distributed Transaction Complexity

Design DecisionAdvantageRisk
Redis-first inventoryExtremely fast soft reservation at scaleRedis restart loses in-flight reservations without persistence
Optimistic locking in PostgresNo lock contention; high concurrencyHigh retry rate under extreme write pressure per SKU
Kafka for post-order asyncDecouples notification/warehouse from checkoutDelayed warehouse dispatch if Kafka consumer lags
Microservice separationIndependent scaling per domainDistributed transactions require Saga or 2PC compensation
Elasticsearch for catalogSub-100 ms search across 100M products1โ€“2 second indexing lag after product updates

Critical Failure Mode โ€” The Compensation Loop: If the payment gateway succeeds but the Postgres hard-commit fails (e.g., database timeout during high load), the Redis counter has already been decremented and the customer charged. The system must execute an idempotent compensation job: detect "PAYMENT_CAPTURED / DB_WRITE_FAILED" state, retry the Postgres write with the same idempotency key, or issue a refund if retries are exhausted. Without this compensation loop, the company has charged a customer with no order record โ€” a financial and reputational liability.

๐Ÿงญ Choosing the Right Inventory Consistency Model for Your Scale

Use the Two-Phase Reservation when:

  • Flash sale or limited-quantity products exist where overselling is a significant business risk.
  • Any SKU where a single unit has material monetary value (electronics, limited-edition items).
  • The "check-and-reserve" window spans multiple seconds due to payment processing.

Simpler inventory patterns are sufficient when:

  • Products have thousands of units and a small overcount is commercially acceptable.
  • B2B systems where the buyer confirms quantity through a purchase order before payment.
  • Digital goods with unlimited inventory: software licenses, streaming access, SaaS seats.

Scaling the Order Service beyond a single region:

  • Assign each SKU to a primary region based on geographic demand concentration.
  • Route all reservation requests for that SKU to its primary region.
  • Use a cross-region Kafka replication topic for inventory reconciliation and audit.

๐Ÿงช Delivering This Design in a System Design Interview

Act 1 โ€” Frame the Overselling Trap (2 minutes): Open with the Prime Day scenario. Draw two concurrent SELECT stock + UPDATE stock pairs racing against the same inventory row. Show how both transactions read "1 unit available" and both successfully decrement, resulting in stock at -1 with two confirmed orders. This immediately demonstrates you understand the core race condition.

Act 2 โ€” The Three-Path Architecture (5 minutes): Draw the Discovery, Transaction, and Fulfillment separation. Explain why catalog reads and order writes must be on entirely separate data paths. Walk through the Two-Phase Reservation: Redis DECR โ†’ Payment Gateway โ†’ Postgres optimistic lock version check.

Act 3 โ€” Failure Scenarios (3 minutes): When the interviewer asks "What if Redis crashes?", answer: Phase 1 is a soft gate. If Redis is unavailable, fall back to Postgres-only mode with strict rate limiting rather than going fully down. The trade-off is lower throughput, not data loss.

Interviewer QuestionStrong Answer
How do you handle cart abandonment?Redis cart key expires after 30 minutes; no inventory decrement until checkout begins
How do you prevent one bad seller from crashing the platform?Bulkhead: Flash Sale Service is isolated from catalog; Kafka decouples fulfillment
How would you add a recommendations engine?Read from the order history Kafka topic; never add latency to the checkout critical path

๐Ÿ› ๏ธ Open Source Building Blocks for E-Commerce Scale

Apache Kafka is the standard for the Fulfillment Async path. Its durable, partitioned log makes it ideal for the high-fan-out order event stream (notification, analytics, warehouse). Kafka's consumer group model enables independent scaling of each downstream service.

Elasticsearch powers the Catalog Search service. Its inverted index and geo-point mapping handle product search across 100M SKUs with sub-100 ms latency. The Debezium connector provides CDC-based synchronization from Postgres to Elasticsearch without application-level dual writes.

Redis is used in three distinct roles in this architecture: the inventory Phase 1 counter (String with DECR), the shopping cart store (Hash per session with TTL), and the catalog cache (String with short TTL for product JSON blobs).

๐Ÿ“š Lessons Learned From Operating E-Commerce Systems at Flash Sale Scale

Lesson 1 โ€” The cart is not the inventory. Never decrement real inventory when an item is added to cart. Only decrement during the checkout-to-payment flow. Cart abandonment rates of 70โ€“80% make pre-reservation at cart time completely unworkable.

Lesson 2 โ€” Rate limiting at the SKU level is critical. Global rate limiting protects infrastructure. SKU-level rate limiting protects product fairness during flash sales and prevents single-product thundering herds from impacting all other products.

Lesson 3 โ€” Payment idempotency prevents double charges. Every payment request must include an idempotency key tied to the order ID. The payment gateway must return the same result for repeated requests with the same key during network retries, eliminating the risk of double charges.

Lesson 4 โ€” Monitor the reservation leak metric. Track the count of orders where Phase 1 succeeded but Phase 2 failed. A rising leak rate is an early warning of Postgres write pressure, network instability between services, or payment gateway latency spikes.

๐Ÿ“Œ TLDR & Key Takeaways for E-Commerce Platform Design

  • Core challenge: Prevent overselling during flash sales while maintaining sub-100 ms catalog response times.
  • The solution: Two-Phase Reservation โ€” atomic Redis DECR for soft reservation, optimistic Postgres lock for hard commitment after payment success.
  • Architecture: Three separated paths โ€” Discovery (Elasticsearch + Redis), Transaction (Order + Inventory + Payment services with ACID writes), Fulfillment (Kafka async fan-out).
  • Critical failure mode: The Compensation Loop โ€” detect and resolve payment-captured-but-db-write-failed states with idempotent retry jobs.
  • Key trade-off: Redis speed vs. durability; always pair soft reservations with a compensation strategy for Phase 2 failures.
  • At scale: SKU-level rate limiting at the API Gateway is as important as the inventory mechanism itself.
Share

Test Your Knowledge

๐Ÿง 

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms