All Posts

System Design HLD Example: Ride-Sharing (Uber/Lyft)

A practical interview-ready HLD for a ride-sharing platform with real-time matching and location tracking.

Abstract AlgorithmsAbstract Algorithms
Β·Β·12 min read
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: A ride-sharing platform is a high-velocity geospatial matching engine. Drivers stream GPS coordinates every 5 seconds into a Redis Geospatial Index. When a rider requests a trip, the Matching Service executes a GEORADIUS query to find the 10 closest available drivers. A distributed Trip State Machine ensures transactional consistency (no double-matching), while WebSockets provide the real-time "car on map" experience that users expect.

πŸš— The New Year's Eve Meltdown

Imagine it’s New Year's Eve in New York City. At 12:05 AM, one million people open their ride-sharing apps simultaneously. In the background, 50,000 drivers are moving through the city, sending GPS updates every few seconds.

Your matching algorithm is working perfectly, but your database is screaming. Every ride request triggers a massive SQL query: SELECT * FROM drivers WHERE ST_Distance(location, rider_location) < 2000. With 100,000 concurrent requests, the database's CPU hits 100%, lock contention skyrockets, and the "Nearest Driver" query that usually takes 10ms now takes 30 seconds. By the time a driver is "matched," they’ve already moved three blocks away or accepted a ride from a competitor.

This is the Geospatial Hotspot problem. In ride-sharing, the challenge isn't just matchingβ€”it's maintaining a high-velocity, in-memory index of moving objects that can be queried hundreds of times per second without falling behind. At the scale of Uber or Lyft, you aren't just building a database; you are building a real-time reflection of the physical world in silicon.

πŸ“– Ride-Sharing: Use Cases & Requirements

Actors & Journeys

  • Rider: Requests a ride, views ETAs, tracks the driver in real-time, and pays for the trip.
  • Driver: Updates their availability (Online/Offline), broadcasts their location, and accepts/rejects trip requests.
  • System Admin: Monitors surge pricing, platform health, and fraud detection.

In/Out Scope

  • In-Scope: Real-time driver tracking, geospatial matching, trip lifecycle management, and live position streaming.
  • Out-of-Scope: Advanced driver background checks (manual workflows), automated vehicle maintenance scheduling, and complex airport queueing logic.

Functional Requirements

  • Real-time Location Stream: Drivers must broadcast GPS coordinates every 5 seconds.
  • Geospatial Matching: Efficiently find available drivers within a specific radius (e.g., 2km) of a rider.
  • Trip Management: Manage the full lifecycle of a ride (Request $\rightarrow$ Match $\rightarrow$ Pickup $\rightarrow$ Dropoff).
  • Live Tracking: Stream the driver's current position to the rider's map with low latency.

Non-Functional Requirements (NFRs)

  • Ultra-Low Matching Latency: A rider should see a "Driver Found" confirmation in $< 2$ seconds.
  • High Write Availability: The system must handle 100k+ GPS updates per second without dropping pings.
  • Strict Consistency for Matching: One driver must never be matched to two riders simultaneously (Double Booking prevention).
  • Scalability: The architecture must handle city-wide surges (e.g., New Year's Eve) without cascading failures.

πŸ” Foundations: How Geospatial Matching Works

At its core, a ride-sharing app is a high-frequency matching market.

The baseline architecture relies on three concepts:

  1. Geospatial Indexing: Traditional B-Tree indexes cannot efficiently handle 2D range queries. We use Geohashing or Quadtrees to convert 2D coordinates into a 1D searchable space.
  2. The Ping-Pong Protocol: Drivers "ping" the server with their location; the server "pongs" back with current demand or nearby rider requests.
  3. The Buffer Zone: We don't just match the "closest" driver; we match the "best" driver, which might consider ETA, driver rating, or heading.

Without these foundations, the system would require constant full-table scans of every driver in the world for every rider requestβ€”a task that is computationally impossible at scale.

βš™οΈ The Mechanics of Geohashing

The mechanism of spatial search is powered by Geohashing.

  • Geohash Encoding: A Geohash is a hierarchical spatial data structure that subdivides the world into a grid of cells. Each cell is assigned a unique alphanumeric string.
  • The Prefix Property: Two points that are close together in the physical world will usually have Geohashes with a common prefix. For example, 9q8yy and 9q8yz are both in San Francisco.
  • Radius Search: By finding the rider's Geohash and its 8 neighboring cells, we can perform a simple range query in a Sorted Set (like Redis) to find all drivers in the vicinity. This turns a $O(N^2)$ geometry problem into a $O(\log N)$ string comparison problem.

πŸ“ Estimations & Design Goals

The Math of Real-Time Geospatial

  • Total Drivers: 1,000,000 globally.
  • Active Drivers (Peak): 500,000.
  • Location Update Frequency: Every 5 seconds.
  • Location Write QPS: $500,000 / 5 = \mathbf{100,000 \text{ writes/sec}}$.
  • Ride Request QPS (Peak): $\approx \mathbf{500 \text{ requests/sec}}$.
  • In-Memory Index Size: 500k drivers $\times$ 100 bytes (ID + Lat + Lng + Metadata) $\approx$ 50 MB. This is tiny enough for a single Redis node but needs sharding for QPS.

Design Goals

  • In-Memory Indexing: Use Redis Sorted Sets for $O(\log N)$ spatial search.
  • Decoupled Write Path: GPS updates are buffered through a high-throughput Location Service.
  • Double-Booking Prevention: Use distributed locking for atomic matching.

πŸ“Š High-Level Design: The Real-Time Matching Flow

The architecture separates the high-frequency location stream from the transactional trip management logic.

graph TD
    Driver -->|GPS 5s| LS[Location Service]
    LS -->|GEOADD| Redis[(Redis Geo Index)]

    Rider -->|Request Ride| MS[Matching Service]
    MS -->|GEORADIUS| Redis
    MS -->|Lock Driver| TS[Trip Service]
    TS --> DB[(Trip DB: Postgres)]

    TS -->|Match Event| NS[Notification Service]
    NS -->|Push| Driver
    NS -->|Push| Rider

    LS -->|Stream| WS[WebSocket Gateway]
    WS -->|Live Update| Rider

Explanation of the Architecture: The architecture utilizes a Write-Heavy Location ingestion path. The Location Service receives 5-second GPS pings from drivers and updates an in-memory Redis Geospatial Index. When a rider requests a trip, the Matching Service performs a radius search on Redis. To ensure a driver is only matched once, the Trip Service handles the transactional "handshake" using Postgres. Real-time visual updates are provided by a WebSocket Gateway that streams driver positions directly to the rider's map, bypassing the database entirely for coordinates.

πŸ”Œ API Design: The Service Contract

The communication between the mobile apps and the backend requires efficient, lightweight endpoints.

EndpointMethodPayloadDescription
/v1/locationsPOST{"lat": 37.7, "lng": -122.4}Driver sends GPS ping. Highly optimized for speed.
/v1/tripsPOST{"rider_id": "r1", "pickup": "..."}Rider requests a ride. Triggers matching.
/v1/trips/{id}/acceptPOST{"driver_id": "d1"}Driver accepts the matched request.
/v1/trips/{id}/statusGETN/APolling fallback for trip state updates.

πŸ—„οΈ Data Model: Schema Definitions

Trip Store (PostgreSQL)

We use a relational database for trips because we need strong ACID guarantees for billing and history.

TableColumnTypeNotes
tripsidUUID (PK)Unique trip identifier.
tripsrider_idUUIDIndexed for rider history.
tripsdriver_idUUIDIndexed for driver payouts.
tripsstatusENUMREQUESTED, MATCHED, IN_PROGRESS, COMPLETED.
tripsfareDECIMALCalculated final price.

Active Driver Index (Redis)

This is not a traditional table but a Geospatial Sorted Set.

  • Key: drivers:active:{city_id}
  • Member: driver_id
  • Score: 52-bit Geohash (managed by Redis GEOADD).

πŸ”§ Tech Stack & Design Choices

ComponentChoiceRationale
Location StoreRedis (Geo Commands)Native support for radius queries; extreme write throughput.
Transactional DBPostgreSQLReliability and complex relationship support for payments.
Real-time CommsWebSockets (Netty)Bi-directional low-latency map updates.
MessagingApache KafkaDecouples location updates from analytics and logging.
Distributed LockRedlock (Redis)Essential for atomic matching in a distributed environment.

🧠 Design Deep Dive

πŸ›‘οΈ Internals: The Trip State Machine

Consistency is maintained by a strict State Machine. A trip cannot skip states; for example, a trip cannot go from REQUESTED to IN_PROGRESS without being MATCHED.

  1. The Request: Rider creates a trip record.
  2. The Match: Matching Service finds Driver A. It attempts to acquire a Distributed Lock in Redis: SET driver:lock:A trip_123 NX EX 15.
  3. The Acceptance: If the lock is acquired, Driver A gets 15 seconds to accept. If they accept, the state moves to MATCHED and the driver is removed from the "Available" Redis set.
  4. The Pickup: Driver clicks "Start Trip". State moves to IN_PROGRESS.

This ensures that even if two Matching Service nodes try to match Driver A at the exact same millisecond, only one will succeed.

Redis performs spatial search by mapping Geohash cells to a 1D Sorted Set.

  • Complexity: GEORADIUS is $O(N + \log M)$. For a city with 20,000 drivers, the query takes $< 5$ ms.
  • Bottleneck: The real bottleneck is the network stack. 100k GPS pings/sec requires a massive fleet of Location Service nodes and optimized TCP/UDP handling.
  • SLO: Driver discovery time $< 1s$. GPS-to-Map latency $< 2s$.

🌍 Real-World Applications

Ride-sharing architecture is a template for the "On-Demand Economy":

  • Food Delivery: DoorDash and UberEats use similar matching logic for couriers.
  • Logistics: Trucking companies use real-time tracking for fleet management.
  • Micro-mobility: Bird and Lime use geospatial indexing to show available scooters to users.
  • Local Services: TaskRabbit or Handy matching service providers to customers.

βš–οΈ Trade-offs & Failure Modes

  1. Availability vs. Consistency: We trade off a tiny bit of availability (locking a driver for 15 seconds) to ensure strict consistency (no double booking).
  2. Ping Frequency vs. Battery: We could ping every 1 second for perfect accuracy, but that would kill driver phone batteries. We trade off 5-second intervals for battery health.
  3. Failure Mode: Hot Keys. A massive event in a small area (e.g., Coachella) can overwhelm a single Redis shard. We use Geohash Sharding to distribute load.
  4. Failure Mode: Dead Reckoning. If a driver enters a tunnel, their GPS stops. We use Kalman Filtering on the client-side to "predict" the driver's movement until the next signal.

πŸ—οΈ Advanced Concepts for Production Evolution

  1. Geohash Sharding: We shard Redis by Geohash prefixes (e.g., all 9q8... go to Shard A). If a rider is on a border, the Matching Service queries neighbors in parallel.
  2. Surge Pricing Engine: Monitors the "Supply/Demand" ratio. If 1,000 riders are in a cell with 10 drivers, the multiplier increases.
  3. Map Matching: Raw GPS is jumpy. We use a service to snap coordinates to valid road segments.
  4. WebSocket Load Balancing: Maintaining 1M+ concurrent WebSockets requires specialized gateways that can handle connection draining during deployments.

🧭 Decision Guide

SituationRecommendation
Low QPS (<100/sec)Use Postgres with PostGIS. It's simpler and more robust.
High QPS (>1k/sec)Use Redis Geospatial or Quadtrees in memory.
Global ScaleShard your in-memory index by City or Geohash prefix.
Accuracy CriticalUse Map Matching and HMM algorithms to smooth GPS noise.

πŸ§ͺ Practical Example: Interview Delivery

If asked to design Uber in 20 minutes, focus on these three beats:

  1. The Write-Volatile Location Store: Explain why GPS updates go to Redis, not Postgres. Use the "100k writes/sec" math.
  2. The Matching Race Condition: Explain the "Double Booking" problem and the Redis SET NX lock solution.
  3. Geospatial Efficiency: Explain how Geohashing turns a 2D search into a 1D range query.

Standard Interview Closer: "I chose a Redis-backed geospatial architecture to handle the extreme velocity of location updates while ensuring sub-second matching latency. By decoupling the real-time tracking via WebSockets and using a transactional Postgres store for the trip lifecycle, we achieve the perfect balance between real-time performance and financial reliability."

πŸ› οΈ Redis Geospatial: How It Works in Practice

Redis is the de-facto standard for this problem.

Update Driver Location (Java Snippet)

// Using Jedis to update driver location in a city-specific set
public void updateDriverLocation(String cityId, String driverId, double lat, double lng) {
    // GEOADD key longitude latitude member
    jedis.geoadd("drivers:active:" + cityId, lng, lat, driverId);

    // Set TTL on the driver's availability
    jedis.expire("drivers:active:" + cityId, 60); // Offline if no ping for 60s
}

Find Nearest Drivers (CLI Example)

# Find 10 drivers within 2km of the rider's lat/lng in San Francisco
GEORADIUS drivers:active:sf -122.41 37.77 2 km WITHDIST ASC COUNT 10

Transactional Matching (Redlock)

RLock lock = redisson.getLock("driver_lock:" + driverId);
// Try to acquire lock for 10s, release after 15s (matching window)
if (lock.tryLock(10, 15, TimeUnit.SECONDS)) {
    try {
        // Create trip in Postgres and notify driver via push
    } finally {
        lock.unlock();
    }
}

πŸ“š Lessons Learned

  • Architecture is about Frequency: High-frequency data (GPS) belongs in RAM; low-frequency data (Billing) belongs on Disk.
  • Maps are Noisy: Never show raw GPS to a user; always smooth it or snap it to the road.
  • Fail Fast: If a driver doesn't accept a match in 15 seconds, release the lock and find the next candidate immediately to keep the rider happy.

πŸ“Œ Summary & Key Takeaways

  • Redis Geospatial: The industry standard for high-velocity spatial indexing.
  • Geohashing: Converts a 2D coordinate into a 1D search problem.
  • Trip State Machine: The "Source of Truth" that prevents illegal trip transitions.
  • WebSocket Gateway: The bridge that enables the "Real-time" feel of the application.

πŸ“ Practice Quiz

  1. Why is Geohashing better than a simple radius search in a relational database?

    • A) It is more accurate.
    • B) It allows $O(\log N)$ searches by converting 2D coordinates into 1D strings/integers.
    • C) It is easier to code. Correct Answer: B
  2. What is the purpose of the Redis SET NX lock in the matching process?

    • A) To encrypt the driver's data.
    • B) To ensure a driver isn't matched to two riders simultaneously.
    • C) To speed up the database. Correct Answer: B
  3. What is "Map Matching"?

    • A) Finding the best route for a driver.
    • B) Snapping jumpy GPS coordinates to the nearest road on a map.
    • C) Comparing two maps for differences. Correct Answer: B
  4. [Open-ended] How would you handle a driver who loses their internet connection mid-trip? Discuss the impact on the State Machine and the User Experience.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms