System Design HLD Example: Hotel Booking System (Airbnb)
A senior-level HLD for a hotel booking platform handling availability, concurrency, and reservations.
Abstract AlgorithmsTLDR: A robust hotel booking system must guarantee atomicity in inventory subtraction. The core trade-off is Consistency vs. Availability: we prioritize strong consistency for the booking path (PostgreSQL with Optimistic Locking) while allowing eventual consistency and high availability for the search path (Elasticsearch). A two-phase "Hold-then-Confirm" model ensures that inventory isn't leaked during payment failures.
π The New Year's Eve Nightmare
Imagine itβs 11:59 PM on New Yearβs Eve. Two different travelers, one in London and one in New York, are looking at the exact same penthouse in Manhattan for the upcoming weekend. Both click "Book Now" at the same millisecond.
In a poorly designed system, the sequence of events looks like this:
- Request A checks the database: "Is the room available?" -> Yes.
- Request B checks the database: "Is the room available?" -> Yes.
- Request A writes a booking record: "Room booked for User A".
- Request B writes a booking record: "Room booked for User B".
Both users receive a confirmation email. Both pay their non-refundable deposits. On Friday, they both show up at the same door with their luggage. This is the Double-Booking Race Condition, and it is the single most important problem a booking system must solve. At scale, "rare" edge cases happen thousands of times a day. If you design for the average case, you fail at the edges.
π Global Reservation Systems: Use Cases & Requirements
Actors
- Guest / Traveler: Searches for rooms, views availability, and makes reservations.
- Host / Property Manager: Manages inventory, sets pricing, and views upcoming bookings.
- Admin: Handles disputes, refunds, and platform-wide monitoring.
Functional Requirements
- Search: Users can search rooms by location (geo-coordinates), date range, and guest count.
- Availability: Users see real-time availability for a listing before booking.
- Reservation (Hold): Selecting a room places a temporary 15-minute hold.
- Booking (Confirm): Successful payment converts a hold into a confirmed booking.
- Cancellation: Releasing a booking restores inventory for those specific dates.
Non-Functional Requirements
- Zero Double-Bookings: Strong consistency is non-negotiable for the final booking transaction.
- High Search Availability: Search should remain functional even if the booking database is under heavy load.
- Low Latency: Search results should return in < 200ms; booking confirmation in < 2s.
- Scalability: Handle 100k searches/sec and 500 bookings/sec (peak holiday spikes).
π Basics: Baseline Architecture
At its core, a booking system is an Inventory Management Engine. Unlike a standard e-commerce site where you might have 1,000 units of a SKU, a hotel booking system has "Perishable Inventory." A room night on December 31st is a different "product" than the same room on January 1st.
The baseline architecture involves:
- Inventory Generation: Pre-calculating available slots for every room for the next 365 days.
- The Lock Mechanism: Ensuring that only one user can transition a slot from
availabletobooked. - The Buffer (Hold): Providing a grace period for payment processing so the user doesn't lose the room mid-transaction.
Without these basics, you end up with "Phantom Inventory"βrooms that appear available but are actually locked in failing payment processes.
βοΈ Mechanics: Distribution & Processing Logic
The distribution of inventory must be handled carefully. When a host adds a new listing, we don't just add one row. We must generate 365 rows in the availability_slots table.
- Inventory Fan-out: Every update to a room's base availability (e.g., taking the room offline for maintenance) must propagate to all 365 days.
- Search Synchronization: Since search is handled by Elasticsearch, we use an asynchronous pipeline. A write to the primary DB triggers a Kafka event, which is then indexed into ES. This introduces a 1-2 second lag, which is acceptable for search but not for booking.
- State Machine: Every booking follows a strict state machine:
Available->Held->Booked(or back toAvailableif the hold expires).
π Estimations & Design Goals
The Math of Inventory
- Total Listings: 10 Million rooms.
- Booking Window: 1 year (365 days).
- Total Inventory Rows: 10M 365 = *3.65 Billion rows.
- Search-to-Booking Ratio: 20:1. If we have 10k searches/sec, we might have 500 booking attempts/sec.
Design Goal: Decouple the "Read-Heavy" search path from the "Write-Heavy" booking path. We use a Command Query Responsibility Segregation (CQRS) inspired approach where Elasticsearch handles the searches and PostgreSQL handles the ACID transactions.
π High-Level Design: Separating Search from Booking
The following architecture ensures that high-volume search traffic never interferes with the critical booking path.
graph TD
User((User)) --> LB[Load Balancer]
LB --> AG[API Gateway]
subgraph Search_Path
AG --> SS[Search Service]
SS --> ES[(Elasticsearch: Geo + Dates)]
SS --> RC[(Search Cache: Redis)]
end
subgraph Booking_Path
AG --> BS[Booking Service]
BS --> AS[Availability Service]
AS --> PDB[(Primary DB: Postgres)]
BS --> PS[Payment Service]
end
subgraph Async_Sync
PDB --> CDC[Debezium / CDC]
CDC --> Kafka[Kafka]
Kafka --> SS
Kafka --> NS[Notification Service]
end
Explanation of the Architecture: The design splits the system into two distinct paths. The Search Path leverages Elasticsearch for its superior geo-spatial and multi-attribute filtering capabilities, allowing users to find rooms based on location and date availability with sub-second latency. The Booking Path is built on PostgreSQL to take advantage of strict ACID compliance and row-level locking, which are essential for preventing double-bookings. Change Data Capture (CDC) via Debezium ensures that the Search index is eventually consistent with the source of truth in the Booking database.
π API Design: The Contract
| Endpoint | Method | Payload | Description |
/v1/search | GET | ?lat=...&lon=...&start=...&end=... | Searches for available listings in a region. |
/v1/listings/{id} | GET | {} | Fetches detailed listing info and 30-day calendar. |
/v1/holds | POST | { room_id, start_date, end_date, idempotency_key } | Places a temporary 15-minute lock on inventory. |
/v1/bookings | POST | { hold_id, payment_token } | Confirms the booking after payment success. |
/v1/bookings/{id} | DELETE | { reason } | Cancels a booking and releases inventory slots. |
ποΈ Data Model: Schema Definitions
Availability Table (PostgreSQL)
This is the critical table for inventory management. It is sharded by room_id.
| Table | Column | Type | Notes |
availability_slots | room_id | UUID (PK) | Sharding Key. |
availability_slots | slot_date | DATE (PK) | Composite PK with room_id. |
availability_slots | status | ENUM | available, held, booked. |
availability_slots | version | INT | For Optimistic Locking. |
availability_slots | held_until | TIMESTAMP | Expiry for the payment grace period. |
Search Index (Elasticsearch)
| Field | Type | Description |
listing_id | keyword | Unique identifier. |
location | geo_point | Lat/Lon for radius search. |
booked_dates | date (array) | List of dates currently unavailable. |
price_per_night | scaled_float | Dynamic pricing for search filtering. |
π§ Tech Stack & Design Choices
| Component | Choice | Rationale |
| Primary Database | PostgreSQL | Required for ACID transactions and row-level locking to prevent double-booking. |
| Search Engine | Elasticsearch | Optimized for geo-spatial queries and complex filtering across billions of documents. |
| Caching Layer | Redis | Used for session management and distributed locks on "Hot Listings." |
| Message Broker | Apache Kafka | Decouples the booking completion from the search index update (CDC). |
| Distributed Lock | Redisson (Redis) | Provides reliable locking for high-contention scenarios at the application layer. |
π§ Design Deep Dive
π‘οΈ Internals: The Atomic "Hold" Logic
To prevent double-booking, we must ensure that the transition from available to held is atomic. We use Optimistic Locking to avoid heavy database locks that could stall the system.
The Workflow:
- The application reads the current
versionof the slots for the requested dates. - It attempts an update:
UPDATE availability_slots SET status = 'held', version = version + 1 WHERE room_id = :id AND slot_date IN (:dates) AND status = 'available' AND version = :v; - If the number of affected rows matches the number of requested dates, the hold is successful. If not, a
Conflicterror is returned.
π Performance Analysis: Handling "Hot" Listings
In cases like a world-famous hotel during a festival, thousands of users might target the same room_id.
- Bottleneck: The single Postgres shard for that
room_idbecomes a CPU bottleneck due to contention. - SLO: Booking confirmation should happen in < 500ms even under load.
- Optimization: We introduce a Redis-based Pre-filter. Before hitting Postgres, we check a bitmask in Redis. If the bit for that date is already set, we reject the request at the edge, saving DB cycles.
π Real-World Applications: Beyond Hotels
This architecture isn't just for Airbnb. It is the gold standard for any Perishable Inventory System:
- Flight Reservations: Seats on a specific flight are limited and time-sensitive.
- Concert Ticketing: Thousands of fans vying for the same "Seat A12" at the same second.
- Doctor Appointments: Specific 15-minute slots that cannot be double-booked.
- Ride-Sharing: Matching a specific driver to a rider for a specific time window.
βοΈ Trade-offs & Failure Modes
- Consistency vs. Latency: We chose Strong Consistency for bookings. This means if the primary DB is slow, bookings slow down. We accept this because a double-booking is more expensive than a 1-second delay.
- Search Lag: Because we update Elasticsearch via CDC (Kafka), there is a 1-2 second "lag." A user might see a room as "Available" in search, click it, and then be told it's "Already Booked." This is a better UX than a system that is slow for everyone.
- Cascading Failure: If Kafka goes down, the search index becomes stale. Mitigation: A secondary "Availability Check" at the API Gateway level that queries a Redis cache of booked dates.
ποΈ Advanced Concepts for Production: Scaling to Millions
- Database Sharding: We shard the
availability_slotstable byroom_id. Since almost all queries are for a specific room's dates, we avoid cross-shard joins. - Read Replicas for Search: While Elasticsearch is primary for search, the
Availability Servicecan use Postgres Read Replicas to show the "Calendar View" on the listing page, reducing load on the Primary DB. - Multi-Region Availability: To handle global traffic, we replicate Listing metadata globally, but keep the "Inventory Authority" in the region where the hotel is located to minimize cross-oceanic latency during the booking transaction.
- Predictive Pricing: Integrating an ML service that adjusts
price_per_nightbased on demand signals in the search path.
π§ Decision Guide: Choosing Your Locking Strategy
| Strategy | When to Use | Pros | Cons |
| Pessimistic Locking | Low-volume, high-value | Simplest to implement; guaranteed safety. | Blocks readers; prone to deadlocks. |
| Optimistic Locking | High-volume, low-conflict | High throughput; no DB locks. | Requires retry logic in application. |
| Redis Pre-Locking | Ultra-high-volume (Hot Keys) | Protects DB from spikes; lightning fast. | Adds dependency on Redis availability. |
π§ͺ Practical Example: Interview Delivery
In a 45-minute interview, spend your time here:
- The Math (5 min): Show you understand the 3.6 billion rows of inventory.
- The Schema (10 min): Draw the
availability_slotstable. It shows you know how to model time-based inventory. - The Race Condition (15 min): Walk through the
UPDATE ... WHERE version = Vlogic. This is the "Senior Engineer" moment.
Standard Response: "I designed this using a date-level inventory table because it allows for granular row-level locking. I separated the search path into Elasticsearch to ensure that even during a massive booking spike on New Year's Eve, the rest of the site remains responsive."
π οΈ Redisson: Implementing Distributed Holds
In a Java/Spring environment, we use Redisson to manage the temporary holds before hitting the database. This acts as a high-speed arbiter for high-contention listings.
public boolean tryHoldInventory(String roomId, List<LocalDate> dates) {
RLock lock = redisson.getLock("lock:room:" + roomId);
try {
// 1. Acquire lock with 2-second timeout
if (lock.tryLock(2, 15, TimeUnit.SECONDS)) {
// 2. Perform DB Optimistic Lock Update
return databaseService.updateSlotsToHeld(roomId, dates);
}
} catch (InterruptedException e) {
Thread.currentThread().interrupt();
} finally {
lock.unlock();
}
return false;
}
This snippet demonstrates the Defense in Depth strategy: a distributed lock handles the "thundering herd" at the app layer, while the database version check handles the final source-of-truth integrity.
π Lessons Learned
- Don't use booleans for state:
is_bookedis a trap. Always model inventory over time. - Fail Fast: If an optimistic lock fails, tell the user immediately. Don't let them fill out a 5-page payment form for a room that's already gone.
- CDC is your friend: Using Kafka to keep Search in sync with Booking is more reliable than "Dual Writes" from the application layer.
π Summary
- Inventory = Date-Level Rows: The only way to scale availability checks.
- Optimistic Locking: Prevents double-bookings without killing performance.
- Search/Booking Split: Uses the right tool for the job (ES for search, Postgres for transactions).
- Hold-then-Confirm: The industry standard for handling external payment failures.
π Practice Quiz
Why is a "Hold" status necessary before payment?
- A) To make the database faster.
- B) To ensure the user has enough time to enter credit card details without someone else taking the room.
- C) To calculate the total price. Correct Answer: B
What happens if the Search Service (Elasticsearch) is 5 seconds behind the Booking Service?
- A) The system crashes.
- B) A user might see a room as available that was just booked.
- C) The price will be incorrect. Correct Answer: B
Which database constraint is the ultimate "Backstop" for double-bookings?
- A) Foreign Key.
- B) Unique Constraint on
(room_id, slot_date). - C) Not Null constraint. Correct Answer: B
[Open-ended] How would you handle a "Bulk Booking" where a user wants to book 50 rooms for a wedding? Should it be one transaction or 50? What are the risks?
π Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions β with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy β but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose β range, hash, consistent hashing, or directory β determines whether range queries stay ch...
