All Posts

System Design HLD Example: Distributed Job Scheduler

A practical interview-ready HLD for a distributed job scheduler handling millions of scheduled tasks.

Abstract AlgorithmsAbstract Algorithms
Β·Β·14 min read
Share
AI Share on X / Twitter
AI Share on LinkedIn
Copy link

TLDR: A distributed job scheduler ensures tasks fire reliably using a durable Job Store with a next_fire_time index. To handle multiple scheduler instances without double-firing, we use optimistic row-level locking (UPDATE WHERE status='SCHEDULED'). By decoupling the Trigger Engine from the Worker Pool via a message queue (Kafka/SQS), the system achieves independent scaling and guaranteed recovery from node failures.

πŸ‘» The "Ghost Task" Problem

Imagine you’re building a billing system for a global SaaS platform. Every month, on the 1st at midnight, your system must trigger 10 million subscription renewals. If your scheduler is a simple in-memory cron job (like a basic Timer in Java or a cron tab on a single server), and the server restarts at 11:59 PM, all 10 million jobs vanish. No bills are sent, and the company loses millions in revenue.

Conversely, if you run two copies of the scheduler for "high availability" without proper coordination, both might see that the clock hit midnight and both trigger the same billing job. Now, 10 million customers are double-charged. Your support lines are flooded, and your brand reputation is destroyed.

This is the Scheduling Paradox: the safer you make the system against missing a job (durability), the more likely you are to double-fire it (concurrency). A production-grade scheduler must solve two invariants simultaneously: never miss a job and never fire the same job twice. At the scale of 100 million tasks per day, solving this requires a robust distributed locking and state management strategy.

πŸ“– Distributed Job Scheduler: Use Cases & Requirements

Actors & Journeys

  • Client Application: Submits tasks to be executed later (e.g., "Send this email in 2 hours").
  • Trigger Engine: The "clock" of the system that identifies which tasks are due for execution.
  • Worker Pool: The fleet of stateless execution nodes that perform the actual work (e.g., calling an external API).
  • Admin Console: Provides visibility into the job pipeline, allowing for manual retries and status monitoring.

Functional Requirements

  • One-time Scheduling: Support for "Run Job A at Timestamp T."
  • Recurring Schedules (Cron): Support for "Run Job B every Monday at 9:00 AM."
  • Job Lifecycle Management: Ability to cancel, pause, or update the payload of a pending job.
  • Task Isolation: Failure in one job should not impact the scheduling or execution of others.

Non-Functional Requirements (NFRs)

  • High Reliability (Durability): 99.99%. Once a job is accepted, it must be stored in a durable database to survive system crashes.
  • High Precision: Jobs should fire within $\pm 1$ second of their target time.
  • Scalability: Handle a backlog of 100M+ jobs and a throughput of 10k+ task dispatches per second.
  • At-Least-Once Delivery: In the event of a network failure, the system prefers retrying a job over skipping it.

πŸ” Basics: How Time-Based Triggers Work

At its simplest level, a scheduler is a Priority Queue where the "Priority" is the next_fire_time.

  1. The Polling Loop: A background thread wakes up every second, queries for all jobs where next_fire_time <= now, and marks them for execution.
  2. The Precision Problem: If your polling loop takes 2 seconds but you poll every 1 second, the polls start to overlap. If your database query takes 5 seconds, your "real-time" scheduler is now lagging significantly.
  3. The Solution: We must use a Durable Priority Store. Instead of a simple table scan, we use a B-tree index on the next_fire_time column. This allows the database to find the "next 100 jobs" in logarithmic time, ensuring the trigger engine stays snappy even with a 100M job backlog.

βš™οΈ Mechanics: The Lifecycle of a Distributed Task

The life of a scheduled job follows a precise state machine to ensure reliability:

  1. SCHEDULED: The job is persisted in the DB and indexed in Redis.
  2. ACQUIRED/RUNNING: A scheduler instance has claimed the job and is currently pushing it to the message queue.
  3. DISPATCHED: The job is on the queue, waiting for a worker.
  4. SUCCESS/FAILED: Final terminal state. If failed, the job may be rescheduled for a "Retry" with a new next_fire_time (Exponential Backoff).

The critical "ACQUIRED" step is protected by Optimistic Locking. We use a version number or a status check in the UPDATE statement to ensure that only one node can move a job from SCHEDULED to RUNNING.

πŸ“ Estimations & Design Goals

The Math of "Midnight Bursts"

  • Daily Volume: 100 Million jobs/day.
  • The Hot Minute: 20% of all daily jobs cluster at 00:00 (Midnight) for daily reports and billing.
  • Burst Throughput: 20M jobs / 60 seconds $\approx \mathbf{333,000 \text{ jobs/sec}}$ to be triggered.
  • Storage Requirement: 100M jobs $\times$ 500 bytes (JSON payload + metadata) $\approx \mathbf{50 \text{ GB/day}}$.
  • Job Retention: If we keep 30 days of history, we need $\approx \mathbf{1.5 \text{ TB}}$ of storage.

Design Goals

  • Stateless Trigger Nodes: Schedulers should not "know" about jobs in memory. They must poll a shared durable store.
  • Decoupled Ingest: Use a REST API to accept jobs and immediately return a 202 Accepted to the client.

πŸ“Š High-Level Design: The Poll-Dispatch-Execute Pattern

The architecture separates the time-keeping (Trigger) from the work-execution (Worker).

graph TD
    Client -->|POST /jobs| API[Job API]
    API -->|Insert| DB[(Job Store: Postgres)]
    API -->|ZADD| Redis[(Redis Trigger Index)]

    subgraph Trigger_Engine
        Sched[Scheduler Service] -->|Poll| Redis
        Sched -->|CAS Lock| DB
        Sched -->|Dispatch| MQ[Message Queue: Kafka]
    end

    MQ --> Workers[Worker Pool]
    Workers -->|Heartbeat| DB
    Workers -->|Update Status| DB

Explanation of the Architecture: The Job API persists the job metadata in Postgres and adds a lightweight "hint" (Job ID + Fire Time) to a Redis Sorted Set. The Scheduler Service polls Redis every second for jobs where timestamp <= current_time. To prevent multiple schedulers from picking the same job, it uses a Compare-And-Swap (CAS) update in Postgres. Only the scheduler that successfully updates the status to RUNNING is allowed to put the job on Kafka. Finally, the Worker Pool consumes from Kafka and executes the task.

πŸ”Œ API Design: The Scheduling Contract

The API allows clients to manage the lifecycle of their tasks without knowing about the underlying queue.

EndpointMethodPayloadDescription
/v1/jobsPOST{"type": "email", "time": 1711706400, "data": {...}}Schedule a new task.
/v1/jobs/{id}GETN/ACheck status: SCHEDULED, RUNNING, SUCCESS, FAILED.
/v1/jobs/{id}DELETEN/ACancel a pending job before it fires.

Example Job Response:

{
  "job_id": "j_abc_123",
  "status": "SCHEDULED",
  "next_fire_time": "2024-03-29T10:00:00Z",
  "retry_count": 0
}

πŸ—„οΈ Data Model: Schema Definitions

Job Store (PostgreSQL)

Postgres is the authoritative source of truth. We use a Partial Index to keep performance high even as the table grows to billions of rows.

TableColumnTypeNotes
jobsidUUID (PK)Unique job identifier.
jobspayloadJSONBThe data the worker needs to execute the task.
jobsnext_fire_timeBIGINTUnix timestamp of the next run (Indexed).
jobsstatusVARCHAR(20)SCHEDULED, RUNNING, SUCCESS, FAILED.
jobsversionINTUsed for optimistic locking.

Trigger Index (Redis)

  • Key: jobs_due
  • Score: next_fire_time
  • Value: job_id

πŸ”§ Tech Stack & Design Choices

ComponentChoiceRationale
Primary DatabasePostgreSQLReliability and strong support for row-level locking and JSONB.
In-Memory IndexRedisSorted Sets allow for $O(\log N)$ retrieval of "due" jobs.
Message BrokerApache KafkaEnsures high-throughput delivery and allows workers to "catch up" if they fall behind.
Worker FrameworkSpring Boot / GoOptimized for long-running, IO-bound tasks.
Locking MechanismDB Optimistic LockingSimpler and more reliable than distributed ZooKeeper locks for this use case.

🧠 Design Deep Dive

πŸ›‘οΈ Internals: The Atomic "Claim" Logic

To ensure that a job is dispatched Exactly-Once (or as close to it as possible), the Scheduler Service follows a strict "Pick and Lock" protocol:

  1. Poll: Fetch 1,000 IDs from Redis where score <= now.
  2. Lock: For each ID, execute the following SQL:
    UPDATE jobs 
    SET status = 'RUNNING', version = version + 1
    WHERE id = ? AND status = 'SCHEDULED';
    
  3. Dispatch: If the database returns Rows Updated: 1, the scheduler has "won" the job and pushes it to Kafka. If it returns 0, it means another scheduler instance beat it to the punch.
  4. Clean up: Once dispatched, remove the ID from the Redis Sorted Set.

πŸ“Š Performance Analysis: Partial Indexing

A table with 100M rows will be slow to query even with an index on next_fire_time. However, we only care about jobs that are not yet finished.

  • The Optimization: We create a Partial Index: CREATE INDEX idx_pending_jobs ON jobs (next_fire_time) WHERE status = 'SCHEDULED';
  • The Impact: This index only contains the currently pending tasks (e.g., 500k rows instead of 100M). This keeps the B-tree shallow and ensures our "Poll" queries remain sub-10ms even as we ingest millions of new jobs.

🌍 Real-World Applications

  • Billing & Recurring Payments: High-precision tasks like subscription renewals and end-of-period billing where accuracy and durability are critical.
  • Notification Campaigns: Time-delivered marketing emails or push notifications that must be scheduled and retried reliably.
  • IoT Command Scheduling: Coordinated device updates and commands across regions that require durable, retryable execution semantics.

βš–οΈ Trade-offs & Failure Modes

  1. Consistency vs Throughput: Optimistic locking keeps throughput high but increases retries under contention; pessimistic locks reduce retries but add latency and resource contention.
  2. Durability vs Latency: Using Postgres as the single source of truth ensures durability but makes the claim path heavier; pushing only "hints" to Redis reduces latency but requires robust recovery for missed entries.
  3. Exactly-Once vs At-Least-Once: Exactly-once requires extra coordination (e.g., distributed transactions or idempotency stores). The recommended approach favors at-least-once delivery with strong worker-side idempotency to simplify the architecture.

πŸ—οΈ Advanced Concepts for Production Evolution

  1. Zombie Job Recovery (The Heartbeat): What if a worker starts a job but the pod crashes mid-execution? The status remains RUNNING forever.
    • Solution: Workers send a heartbeat to a last_heartbeat column in Postgres every 30 seconds. A "Watchdog Service" polls for jobs in RUNNING status with no heartbeat for $> 2$ minutes and resets them to SCHEDULED.
  2. Time-Wheel Optimization: For ultra-high precision (millisecond level), use a Hashed Hierarchical Time-Wheel in memory on the Scheduler nodes. This reduces the DB polling frequency and provides smoother trigger firing.
  3. Idempotent Workers: Despite our locking, network failures can cause duplicate Kafka messages. Every job should include an idempotency_key. Before the worker starts, it checks a Redis SET to see if that key was already processed in the last hour.
  4. Dynamic Sharding: If a single Postgres instance becomes the bottleneck, we shard the jobs table by job_id % N. Each Scheduler node is then assigned to poll a specific subset of shards.

🧭 Decision Guide

Use this table when choosing a scheduling strategy for your system:

RequirementRecommended ChoiceReasoning
Sub-second precisionKafka Delayed Messages or Time-WheelDB polling introduces 1–5s lag
Durable, auditable historyPostgreSQL + Partial IndexFull ACID semantics and crash recovery
Simple enterprise appQuartz (DB-backed cluster)Battle-tested, low operational overhead
Millions of jobs/dayRedis Sorted Set + Kafka dispatcherIn-memory lookahead, decoupled execution
Fault-tolerant retriesDead Letter Queue + heartbeat watchdogIsolates failures from the main job flow
Idempotent at-least-onceWorker-side idempotency key in RedisSimpler than distributed transactions

πŸ§ͺ Practical Example: Interview Delivery

If asked to design this in a 20-minute interview, focus on these three beats:

  1. The Double-Fire Risk: Immediately identify that running multiple schedulers is the "Hard Part."
  2. Optimistic Locking: Explain the UPDATE ... WHERE status='SCHEDULED' trick. This shows you understand data consistency in distributed systems.
  3. Decoupled Workers: Explain why the scheduler doesn't do the work itself. This allows you to scale your "Clock" (the scheduler) and your "Hands" (the workers) independently.

Standard Interview Closer: "I designed a stateless scheduler architecture that leverages Redis for low-latency triggers and PostgreSQL for durable state management. By enforcing exactly-once dispatch through optimistic locking and incorporating a heartbeat mechanism for fault tolerance, we ensure that our system remains reliable and consistent even under massive 'Midnight Burst' traffic conditions."

πŸ› οΈ Quartz Scheduler: How It Solves This in Practice

Quartz is the most mature open-source job scheduler for the JVM. It implements a clustered model very similar to the one we've designed.

Example: Quartz Cluster Lock (Java Snippet)

Quartz uses a specialized table QRTZ_LOCKS to ensure only one node in the cluster is firing at a time.

// Quartz internal logic simplified
public void triggerJob(JobKey key) {
    try (Connection conn = dataSource.getConnection()) {
        // Step 1: Acquire row-level lock on the trigger
        // SELECT * FROM QRTZ_TRIGGERS WHERE TRIGGER_NAME = ? FOR UPDATE

        // Step 2: Check if it's still 'ACQUIRED' by someone else
        // Step 3: Change status to 'EXECUTING' and commit

        // Step 4: Dispatch to thread pool
    }
}

While Quartz is excellent for enterprise apps, for a Global SaaS Scale, you should replace its "DB Lock" mechanism with the Kafka Dispatcher model described in this HLD to achieve better horizontal scalability.

πŸ“š Lessons Learned

  • Never rely on Memory: If your scheduler node restarts, it should be able to rebuild its world from the DB in seconds.
  • Jitter is your friend: If 10 million jobs are due at 00:00:00, don't fire them all at once. Add a random $\pm 10$ second "jitter" to the fire time to prevent a thundering herd on your downstream services.
  • Isolate the Failure: Use a Dead Letter Queue (DLQ). If a specific job type is crashing workers, don't let it take down the whole scheduler.

πŸ“Œ Summary & Key Takeaways

  • Durable Store: Postgres with a partial index is the anchor of the system.
  • Redis Trigger: Used as a fast, in-memory "lookahead" index.
  • Optimistic Locking: The standard way to prevent double-firing across multiple scheduler nodes.
  • Heartbeats: Necessary for detecting and recovering from worker node failures.

πŸ“ Practice Quiz

  1. Why do we separate the Scheduler (Trigger) from the Worker?

    • A) To make the code easier to write.
    • B) To allow the Worker pool to scale independently based on the complexity of the jobs.
    • C) To reduce the number of database connections. Correct Answer: B
  2. What happens if a Scheduler crashes after it locks a job in Postgres but before it puts the message on Kafka?

    • A) The job is lost forever.
    • B) The "Watchdog Service" will eventually see that the job is in RUNNING status but has no progress, and will reset it to SCHEDULED.
    • C) Kafka will automatically find the job. Correct Answer: B
  3. In the SQL query UPDATE jobs SET status='RUNNING' WHERE id=? AND status='SCHEDULED', why is the AND status='SCHEDULED' clause critical?

    • A) It makes the query faster.
    • B) It ensures only one scheduler "wins" the job (Atomic Compare-and-Swap).
    • C) It is required by Postgres syntax. Correct Answer: B
  4. [Open-ended] Your job scheduler needs to support millions of "Delayed Messages" (e.g., retry an order in 30 seconds). Would you use this HLD architecture or a specialized message queue like Amazon SQS or RabbitMQ? Discuss the trade-offs.

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms