System Design HLD Example: Distributed Job Scheduler
A practical interview-ready HLD for a distributed job scheduler handling millions of scheduled tasks.
Abstract AlgorithmsTLDR: A distributed job scheduler ensures tasks fire reliably using a durable Job Store with a
next_fire_timeindex. To handle multiple scheduler instances without double-firing, we use optimistic row-level locking (UPDATE WHERE status='SCHEDULED'). By decoupling the Trigger Engine from the Worker Pool via a message queue (Kafka/SQS), the system achieves independent scaling and guaranteed recovery from node failures.
π» The "Ghost Task" Problem
Imagine youβre building a billing system for a global SaaS platform. Every month, on the 1st at midnight, your system must trigger 10 million subscription renewals. If your scheduler is a simple in-memory cron job (like a basic Timer in Java or a cron tab on a single server), and the server restarts at 11:59 PM, all 10 million jobs vanish. No bills are sent, and the company loses millions in revenue.
Conversely, if you run two copies of the scheduler for "high availability" without proper coordination, both might see that the clock hit midnight and both trigger the same billing job. Now, 10 million customers are double-charged. Your support lines are flooded, and your brand reputation is destroyed.
This is the Scheduling Paradox: the safer you make the system against missing a job (durability), the more likely you are to double-fire it (concurrency). A production-grade scheduler must solve two invariants simultaneously: never miss a job and never fire the same job twice. At the scale of 100 million tasks per day, solving this requires a robust distributed locking and state management strategy.
π Distributed Job Scheduler: Use Cases & Requirements
Actors & Journeys
- Client Application: Submits tasks to be executed later (e.g., "Send this email in 2 hours").
- Trigger Engine: The "clock" of the system that identifies which tasks are due for execution.
- Worker Pool: The fleet of stateless execution nodes that perform the actual work (e.g., calling an external API).
- Admin Console: Provides visibility into the job pipeline, allowing for manual retries and status monitoring.
Functional Requirements
- One-time Scheduling: Support for "Run Job A at Timestamp T."
- Recurring Schedules (Cron): Support for "Run Job B every Monday at 9:00 AM."
- Job Lifecycle Management: Ability to cancel, pause, or update the payload of a pending job.
- Task Isolation: Failure in one job should not impact the scheduling or execution of others.
Non-Functional Requirements (NFRs)
- High Reliability (Durability): 99.99%. Once a job is accepted, it must be stored in a durable database to survive system crashes.
- High Precision: Jobs should fire within $\pm 1$ second of their target time.
- Scalability: Handle a backlog of 100M+ jobs and a throughput of 10k+ task dispatches per second.
- At-Least-Once Delivery: In the event of a network failure, the system prefers retrying a job over skipping it.
π Basics: How Time-Based Triggers Work
At its simplest level, a scheduler is a Priority Queue where the "Priority" is the next_fire_time.
- The Polling Loop: A background thread wakes up every second, queries for all jobs where
next_fire_time <= now, and marks them for execution. - The Precision Problem: If your polling loop takes 2 seconds but you poll every 1 second, the polls start to overlap. If your database query takes 5 seconds, your "real-time" scheduler is now lagging significantly.
- The Solution: We must use a Durable Priority Store. Instead of a simple table scan, we use a B-tree index on the
next_fire_timecolumn. This allows the database to find the "next 100 jobs" in logarithmic time, ensuring the trigger engine stays snappy even with a 100M job backlog.
βοΈ Mechanics: The Lifecycle of a Distributed Task
The life of a scheduled job follows a precise state machine to ensure reliability:
- SCHEDULED: The job is persisted in the DB and indexed in Redis.
- ACQUIRED/RUNNING: A scheduler instance has claimed the job and is currently pushing it to the message queue.
- DISPATCHED: The job is on the queue, waiting for a worker.
- SUCCESS/FAILED: Final terminal state. If failed, the job may be rescheduled for a "Retry" with a new
next_fire_time(Exponential Backoff).
The critical "ACQUIRED" step is protected by Optimistic Locking. We use a version number or a status check in the UPDATE statement to ensure that only one node can move a job from SCHEDULED to RUNNING.
π Estimations & Design Goals
The Math of "Midnight Bursts"
- Daily Volume: 100 Million jobs/day.
- The Hot Minute: 20% of all daily jobs cluster at 00:00 (Midnight) for daily reports and billing.
- Burst Throughput: 20M jobs / 60 seconds $\approx \mathbf{333,000 \text{ jobs/sec}}$ to be triggered.
- Storage Requirement: 100M jobs $\times$ 500 bytes (JSON payload + metadata) $\approx \mathbf{50 \text{ GB/day}}$.
- Job Retention: If we keep 30 days of history, we need $\approx \mathbf{1.5 \text{ TB}}$ of storage.
Design Goals
- Stateless Trigger Nodes: Schedulers should not "know" about jobs in memory. They must poll a shared durable store.
- Decoupled Ingest: Use a REST API to accept jobs and immediately return a
202 Acceptedto the client.
π High-Level Design: The Poll-Dispatch-Execute Pattern
The architecture separates the time-keeping (Trigger) from the work-execution (Worker).
graph TD
Client -->|POST /jobs| API[Job API]
API -->|Insert| DB[(Job Store: Postgres)]
API -->|ZADD| Redis[(Redis Trigger Index)]
subgraph Trigger_Engine
Sched[Scheduler Service] -->|Poll| Redis
Sched -->|CAS Lock| DB
Sched -->|Dispatch| MQ[Message Queue: Kafka]
end
MQ --> Workers[Worker Pool]
Workers -->|Heartbeat| DB
Workers -->|Update Status| DB
Explanation of the Architecture:
The Job API persists the job metadata in Postgres and adds a lightweight "hint" (Job ID + Fire Time) to a Redis Sorted Set. The Scheduler Service polls Redis every second for jobs where timestamp <= current_time. To prevent multiple schedulers from picking the same job, it uses a Compare-And-Swap (CAS) update in Postgres. Only the scheduler that successfully updates the status to RUNNING is allowed to put the job on Kafka. Finally, the Worker Pool consumes from Kafka and executes the task.
π API Design: The Scheduling Contract
The API allows clients to manage the lifecycle of their tasks without knowing about the underlying queue.
| Endpoint | Method | Payload | Description |
/v1/jobs | POST | {"type": "email", "time": 1711706400, "data": {...}} | Schedule a new task. |
/v1/jobs/{id} | GET | N/A | Check status: SCHEDULED, RUNNING, SUCCESS, FAILED. |
/v1/jobs/{id} | DELETE | N/A | Cancel a pending job before it fires. |
Example Job Response:
{
"job_id": "j_abc_123",
"status": "SCHEDULED",
"next_fire_time": "2024-03-29T10:00:00Z",
"retry_count": 0
}
ποΈ Data Model: Schema Definitions
Job Store (PostgreSQL)
Postgres is the authoritative source of truth. We use a Partial Index to keep performance high even as the table grows to billions of rows.
| Table | Column | Type | Notes |
jobs | id | UUID (PK) | Unique job identifier. |
jobs | payload | JSONB | The data the worker needs to execute the task. |
jobs | next_fire_time | BIGINT | Unix timestamp of the next run (Indexed). |
jobs | status | VARCHAR(20) | SCHEDULED, RUNNING, SUCCESS, FAILED. |
jobs | version | INT | Used for optimistic locking. |
Trigger Index (Redis)
- Key:
jobs_due - Score:
next_fire_time - Value:
job_id
π§ Tech Stack & Design Choices
| Component | Choice | Rationale |
| Primary Database | PostgreSQL | Reliability and strong support for row-level locking and JSONB. |
| In-Memory Index | Redis | Sorted Sets allow for $O(\log N)$ retrieval of "due" jobs. |
| Message Broker | Apache Kafka | Ensures high-throughput delivery and allows workers to "catch up" if they fall behind. |
| Worker Framework | Spring Boot / Go | Optimized for long-running, IO-bound tasks. |
| Locking Mechanism | DB Optimistic Locking | Simpler and more reliable than distributed ZooKeeper locks for this use case. |
π§ Design Deep Dive
π‘οΈ Internals: The Atomic "Claim" Logic
To ensure that a job is dispatched Exactly-Once (or as close to it as possible), the Scheduler Service follows a strict "Pick and Lock" protocol:
- Poll: Fetch 1,000 IDs from Redis where
score <= now. - Lock: For each ID, execute the following SQL:
UPDATE jobs SET status = 'RUNNING', version = version + 1 WHERE id = ? AND status = 'SCHEDULED'; - Dispatch: If the database returns
Rows Updated: 1, the scheduler has "won" the job and pushes it to Kafka. If it returns0, it means another scheduler instance beat it to the punch. - Clean up: Once dispatched, remove the ID from the Redis Sorted Set.
π Performance Analysis: Partial Indexing
A table with 100M rows will be slow to query even with an index on next_fire_time. However, we only care about jobs that are not yet finished.
- The Optimization: We create a Partial Index:
CREATE INDEX idx_pending_jobs ON jobs (next_fire_time) WHERE status = 'SCHEDULED'; - The Impact: This index only contains the currently pending tasks (e.g., 500k rows instead of 100M). This keeps the B-tree shallow and ensures our "Poll" queries remain sub-10ms even as we ingest millions of new jobs.
π Real-World Applications
- Billing & Recurring Payments: High-precision tasks like subscription renewals and end-of-period billing where accuracy and durability are critical.
- Notification Campaigns: Time-delivered marketing emails or push notifications that must be scheduled and retried reliably.
- IoT Command Scheduling: Coordinated device updates and commands across regions that require durable, retryable execution semantics.
βοΈ Trade-offs & Failure Modes
- Consistency vs Throughput: Optimistic locking keeps throughput high but increases retries under contention; pessimistic locks reduce retries but add latency and resource contention.
- Durability vs Latency: Using Postgres as the single source of truth ensures durability but makes the claim path heavier; pushing only "hints" to Redis reduces latency but requires robust recovery for missed entries.
- Exactly-Once vs At-Least-Once: Exactly-once requires extra coordination (e.g., distributed transactions or idempotency stores). The recommended approach favors at-least-once delivery with strong worker-side idempotency to simplify the architecture.
ποΈ Advanced Concepts for Production Evolution
- Zombie Job Recovery (The Heartbeat): What if a worker starts a job but the pod crashes mid-execution? The status remains
RUNNINGforever.- Solution: Workers send a heartbeat to a
last_heartbeatcolumn in Postgres every 30 seconds. A "Watchdog Service" polls for jobs inRUNNINGstatus with no heartbeat for $> 2$ minutes and resets them toSCHEDULED.
- Solution: Workers send a heartbeat to a
- Time-Wheel Optimization: For ultra-high precision (millisecond level), use a Hashed Hierarchical Time-Wheel in memory on the Scheduler nodes. This reduces the DB polling frequency and provides smoother trigger firing.
- Idempotent Workers: Despite our locking, network failures can cause duplicate Kafka messages. Every job should include an
idempotency_key. Before the worker starts, it checks a RedisSETto see if that key was already processed in the last hour. - Dynamic Sharding: If a single Postgres instance becomes the bottleneck, we shard the
jobstable byjob_id % N. Each Scheduler node is then assigned to poll a specific subset of shards.
π§ Decision Guide
Use this table when choosing a scheduling strategy for your system:
| Requirement | Recommended Choice | Reasoning |
| Sub-second precision | Kafka Delayed Messages or Time-Wheel | DB polling introduces 1β5s lag |
| Durable, auditable history | PostgreSQL + Partial Index | Full ACID semantics and crash recovery |
| Simple enterprise app | Quartz (DB-backed cluster) | Battle-tested, low operational overhead |
| Millions of jobs/day | Redis Sorted Set + Kafka dispatcher | In-memory lookahead, decoupled execution |
| Fault-tolerant retries | Dead Letter Queue + heartbeat watchdog | Isolates failures from the main job flow |
| Idempotent at-least-once | Worker-side idempotency key in Redis | Simpler than distributed transactions |
π§ͺ Practical Example: Interview Delivery
If asked to design this in a 20-minute interview, focus on these three beats:
- The Double-Fire Risk: Immediately identify that running multiple schedulers is the "Hard Part."
- Optimistic Locking: Explain the
UPDATE ... WHERE status='SCHEDULED'trick. This shows you understand data consistency in distributed systems. - Decoupled Workers: Explain why the scheduler doesn't do the work itself. This allows you to scale your "Clock" (the scheduler) and your "Hands" (the workers) independently.
Standard Interview Closer: "I designed a stateless scheduler architecture that leverages Redis for low-latency triggers and PostgreSQL for durable state management. By enforcing exactly-once dispatch through optimistic locking and incorporating a heartbeat mechanism for fault tolerance, we ensure that our system remains reliable and consistent even under massive 'Midnight Burst' traffic conditions."
π οΈ Quartz Scheduler: How It Solves This in Practice
Quartz is the most mature open-source job scheduler for the JVM. It implements a clustered model very similar to the one we've designed.
Example: Quartz Cluster Lock (Java Snippet)
Quartz uses a specialized table QRTZ_LOCKS to ensure only one node in the cluster is firing at a time.
// Quartz internal logic simplified
public void triggerJob(JobKey key) {
try (Connection conn = dataSource.getConnection()) {
// Step 1: Acquire row-level lock on the trigger
// SELECT * FROM QRTZ_TRIGGERS WHERE TRIGGER_NAME = ? FOR UPDATE
// Step 2: Check if it's still 'ACQUIRED' by someone else
// Step 3: Change status to 'EXECUTING' and commit
// Step 4: Dispatch to thread pool
}
}
While Quartz is excellent for enterprise apps, for a Global SaaS Scale, you should replace its "DB Lock" mechanism with the Kafka Dispatcher model described in this HLD to achieve better horizontal scalability.
π Lessons Learned
- Never rely on Memory: If your scheduler node restarts, it should be able to rebuild its world from the DB in seconds.
- Jitter is your friend: If 10 million jobs are due at
00:00:00, don't fire them all at once. Add a random $\pm 10$ second "jitter" to the fire time to prevent a thundering herd on your downstream services. - Isolate the Failure: Use a Dead Letter Queue (DLQ). If a specific job type is crashing workers, don't let it take down the whole scheduler.
π Summary & Key Takeaways
- Durable Store: Postgres with a partial index is the anchor of the system.
- Redis Trigger: Used as a fast, in-memory "lookahead" index.
- Optimistic Locking: The standard way to prevent double-firing across multiple scheduler nodes.
- Heartbeats: Necessary for detecting and recovering from worker node failures.
π Practice Quiz
Why do we separate the Scheduler (Trigger) from the Worker?
- A) To make the code easier to write.
- B) To allow the Worker pool to scale independently based on the complexity of the jobs.
- C) To reduce the number of database connections. Correct Answer: B
What happens if a Scheduler crashes after it locks a job in Postgres but before it puts the message on Kafka?
- A) The job is lost forever.
- B) The "Watchdog Service" will eventually see that the job is in
RUNNINGstatus but has no progress, and will reset it toSCHEDULED. - C) Kafka will automatically find the job. Correct Answer: B
In the SQL query
UPDATE jobs SET status='RUNNING' WHERE id=? AND status='SCHEDULED', why is theAND status='SCHEDULED'clause critical?- A) It makes the query faster.
- B) It ensures only one scheduler "wins" the job (Atomic Compare-and-Swap).
- C) It is required by Postgres syntax. Correct Answer: B
[Open-ended] Your job scheduler needs to support millions of "Delayed Messages" (e.g., retry an order in 30 seconds). Would you use this HLD architecture or a specialized message queue like Amazon SQS or RabbitMQ? Discuss the trade-offs.
π Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts

Adapting to Virtual Threads for Spring Developers
TLDR: Platform threads (one OS thread per request) max out at a few hundred concurrent I/O-bound requests. Virtual threads (JDK 21+) allow millions β with zero I/O-blocking cost. Spring Boot 3.2 enables them with a single property. Avoid synchronized...

Java 8 to Java 25: How Java Evolved from Boilerplate to a Modern Language
TLDR: Java went from the most verbose mainstream language to one of the most expressive. Lambdas killed anonymous inner classes. Records killed POJOs. Virtual threads killed thread pools for I/O work.
Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More
TLDR: Distributed systems produce anomalies not because the code is buggy β but because physics makes it impossible to be perfectly consistent, available, and partition-tolerant simultaneously. Split brain, stale reads, clock skew, causality violatio...
Sharding Approaches in SQL and NoSQL: Range, Hash, and Directory-Based Strategies Compared
TLDR: Sharding splits your database across multiple physical nodes so no single machine carries all the data or absorbs all the writes. The strategy you choose β range, hash, consistent hashing, or directory β determines whether range queries stay ch...
