All Posts

System Design HLD Example: Payment Processing Platform

An interview-ready HLD for payments focusing on correctness, idempotency, and recovery.

Abstract AlgorithmsAbstract Algorithms
··23 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Design a payment processing system for online checkout. This article covers idempotent authorization, two-phase authorize-capture, double-entry ledger writes, webhook delivery with retry, and nightly reconciliation — with concrete schema, Redis key layout, and per-feature Java snippets.

TLDR: Payment systems optimize for correctness first, then throughput, because mistakes are expensive, auditable, and potentially career-ending.

Stripe processes over 250 million API requests per day, and every single payment must be idempotent: a user clicking "Pay" twice must never produce two charges, even when the first HTTP call timed out and no confirmation reached the client. The Knight Capital incident of 2012 — where a software bug executed $440 million in unintended trades in 45 minutes — illustrated what happens when financial operations lack idempotency guards and a ledger write-ahead pattern.

Designing a payment system forces you to prioritise correctness above all else. It is the domain where eventual consistency is a business and regulatory risk rather than merely a technical trade-off, and where every architecture decision has a permanent audit trail.

By the end of this walkthrough you'll know why payment amounts must be stored in integer cents (not floating point decimals), why ledger writes use double-entry accounting for reconcilability, and why idempotency keys must be client-supplied and cached for 24 hours to survive network retries without producing duplicate charges.

📖 Use Cases

Actors

ActorRole
CustomerInitiates payment at checkout; expects instant authorization
MerchantReceives payment confirmation via webhook; reconciles with their own ledger
Payment APIValidates, deduplicates, routes to provider, writes to ledger
Payment providerStripe/Adyen — processes card authorization with card networks
Webhook workerDelivers payment events to merchant endpoints with retry
Reconciliation jobNightly: compares platform ledger against provider settlement file

Use Cases

  • Primary interview prompt: Design a payment processing system for online checkout.
  • Core user journeys:
    • Authorize — hold funds on customer card; provider returns auth code; create authorized payment record
    • Capture — settle held funds; called after merchant confirms shipment (or immediately for instant capture)
    • Refund — return funds to customer; creates refunded record and reverses ledger entries
    • Idempotency — if client retries with same Idempotency-Key, return cached response without charging again
    • Ledger writes — double-entry: debit customer account, credit merchant account atomically
    • Webhook processing — emit events (payment.authorized, payment.captured) to merchant webhook URL with retry
    • Reconciliation — nightly job compares platform ledger against provider CSV settlement file; flags discrepancies
  • Read and write paths are explained separately so bottlenecks and consistency boundaries are explicit.

This template starts with actors and use cases because architecture only makes sense when user behavior and workload shape are clear. In interviews, this section prevents random tool selection and keeps the answer grounded in business outcomes.

🔍 Functional Requirements

In Scope

  • Payment authorizationPOST /payments with Idempotency-Key; routes to Stripe/Adyen; returns authorized status
  • Payment capturePOST /payments/{id}/capture; settles held funds; updates ledger
  • RefundPOST /payments/{id}/refund; reverses the charge; creates offsetting ledger entries
  • Idempotency — Redis cache for 24h; DB unique constraint as the concurrent-request safety net
  • Ledger writes — double-entry bookkeeping; PostgreSQL with serializable isolation
  • Webhook delivery — Kafka-based async events; exponential backoff retry; DLQ after 5 failures
  • Reconciliation — nightly Spring Batch job; diff platform ledger vs provider CSV; alerts on discrepancies

Out of Scope (v1 boundary)

  • Multi-currency conversion (FX rates, cross-border settlement)
  • Fraud scoring and risk model evaluation
  • PCI-DSS card vault (delegate tokenization to Stripe/Adyen)
  • Chargeback dispute management

Functional Breakdown

FeatureAPI ContractKey Decision
AuthorizePOST /payments + Idempotency-Key{ paymentId, status: "authorized" }Persist payment row BEFORE calling provider
CapturePOST /payments/{id}/capture{ status: "captured" }Distributed lock prevents double-capture
RefundPOST /payments/{id}/refund{ status: "refunded" }Idempotent; check current status before refunding
IdempotencyRedis idempotency:{key} → cached response for 24hDB unique constraint handles concurrent retries
Ledger writesAtomic INSERT of 2 rows (debit + credit) per paymentSerializable isolation; SUM(credits) = SUM(debits)
WebhookKafka payment.captured → POST merchant URLRetry with backoff; DLQ after 5 failures
ReconciliationNightly Spring Batch → compare ledger vs provider CSVFlag unmatched payments; alert on discrepancy

Initial building blocks:

  • Payment API — stateless Spring Boot service; idempotency check + provider adapter + ledger write
  • Idempotency Store — Redis idempotency:{key} → serialised response JSON; 24h TTL
  • Provider Adapter — Strategy pattern wrapping Stripe/Adyen/Braintree behind a common interface
  • Ledger Service — PostgreSQL with serializable isolation; double-entry writes per payment
  • Webhook Worker — Kafka consumer; POST to merchant URL; exponential backoff + DLQ
  • Reconciliation Job — nightly Spring Batch; reads provider settlement CSV + platform ledger; diffs and alerts

A strong answer names non-goals explicitly. Interviewers use this to judge prioritization quality and architectural maturity under time constraints.

⚙️ Non Functional Requirements

DimensionTargetWhy it matters
CorrectnessZero duplicate charges; zero lost paymentsFinancial data correctness > availability
PerformanceAuthorization p95 < 3s (provider SLA bound); capture p95 < 1sProvider API latency dominates; optimise what we can
Availability99.99% for payment intake; 99.9% for webhook deliveryPayment downtime = lost revenue
ConsistencyStrong for payment and ledger writes; eventual for webhook deliveryLedger must be consistent; webhook delivery is best-effort
OperabilityAuth success rate, unreconciled count, webhook DLQ depthFinancial health signals

Non-functional requirements are where many designs fail in practice. Naming measurable targets and coupling architecture decisions to those targets is far more useful than listing technologies.

🧠 Deep Dive: Estimations and Design Goals

The Internals: Data Model

The primary tables are payments, ledger_entries, and webhooks. Every payment is identified by an idempotency_key supplied by the client — this is the deduplication anchor for the entire system.

CREATE TABLE payments (
    payment_id       UUID         PRIMARY KEY DEFAULT gen_random_uuid(),
    idempotency_key  TEXT         UNIQUE NOT NULL,   -- client-supplied dedup key
    merchant_id      UUID         NOT NULL,
    amount_cents     BIGINT       NOT NULL,           -- always store in minor currency units
    currency         CHAR(3)      NOT NULL,           -- ISO 4217: USD, EUR, GBP
    status           TEXT         NOT NULL DEFAULT 'pending'
                     CHECK (status IN ('pending','authorized','captured','refunded','failed','cancelled')),
    provider         TEXT         NOT NULL,           -- 'stripe', 'adyen', 'braintree'
    provider_txn_id  TEXT,                            -- provider's transaction reference
    created_at       TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    authorized_at    TIMESTAMPTZ,
    captured_at      TIMESTAMPTZ,
    refunded_at      TIMESTAMPTZ,
    metadata         JSONB                            -- cart_id, customer_id, etc.
);
CREATE UNIQUE INDEX idx_payments_idempotency ON payments (idempotency_key);
CREATE INDEX idx_payments_merchant_created ON payments (merchant_id, created_at DESC);

CREATE TABLE ledger_entries (
    entry_id         UUID         PRIMARY KEY DEFAULT gen_random_uuid(),
    payment_id       UUID         NOT NULL REFERENCES payments(payment_id),
    account_id       UUID         NOT NULL,           -- merchant or platform account
    entry_type       TEXT         NOT NULL CHECK (entry_type IN ('debit','credit')),
    amount_cents     BIGINT       NOT NULL,
    currency         CHAR(3)      NOT NULL,
    created_at       TIMESTAMPTZ  NOT NULL DEFAULT NOW(),
    reconciled_at    TIMESTAMPTZ                      -- NULL = unreconciled
);
-- Double-entry: every payment produces exactly 2 ledger rows (debit customer, credit merchant)
CREATE INDEX idx_ledger_reconcile ON ledger_entries (reconciled_at) WHERE reconciled_at IS NULL;

CREATE TABLE webhooks (
    webhook_id       UUID         PRIMARY KEY DEFAULT gen_random_uuid(),
    merchant_id      UUID         NOT NULL,
    payment_id       UUID         NOT NULL,
    event_type       TEXT         NOT NULL,           -- 'payment.authorized', 'payment.captured', etc.
    payload          JSONB        NOT NULL,
    status           TEXT         NOT NULL DEFAULT 'pending',
    retry_count      INT          NOT NULL DEFAULT 0,
    next_retry_at    TIMESTAMPTZ,
    delivered_at     TIMESTAMPTZ,
    created_at       TIMESTAMPTZ  NOT NULL DEFAULT NOW()
);
CREATE INDEX idx_webhooks_pending ON webhooks (next_retry_at) WHERE status = 'pending';

Redis key layout:

KeyValueTTL
idempotency:{key}Serialised payment response JSON24 hours
payment:lock:{paymentId}"1" (distributed lock)30 seconds
provider:health:{provider}"healthy" or "degraded"60 seconds

Estimations

Assumptions for a Stripe-scale service:

DimensionAssumptionDerivation
Payments/day10M/day≈ 115 payments/sec steady; 1K/sec peak
Ledger entries/day20M2 entries per payment (double-entry)
Webhook deliveries/day30M3 events per payment on average
Payment row size~2 KBPayment + metadata JSONB
Idempotency key cache~5 GB Redis10M keys × 500B response
Provider API calls10M/day1 authorize + 1 capture per payment

Key insight: Correctness, not throughput, is the primary constraint. The idempotency store and distributed lock are the two most critical components — a miss on the idempotency check means a potential duplicate charge.

Design Goals

GoalWhy it mattersDecision it drives
Idempotency is non-negotiableNetwork retries will cause duplicate charge attempts without idempotencyClient-supplied Idempotency-Key header; cache response in Redis for 24h; return cached response on retry
Money amounts must never use floating point0.1 + 0.2 = 0.30000000000000004 in IEEE 754 — catastrophic for financeAlways store and compute in integer minor units (cents); display layer converts to decimal
Ledger writes must be double-entrySingle-entry bookkeeping cannot detect consistency errorsEvery payment produces exactly 2 ledger rows: debit customer, credit merchant; reconciliation verifies sum = 0
Authorization and capture are separate stepsHotels, car rentals need to hold funds before final amount is knownTwo-phase commit: authorize holds funds; capture settles; cancel releases hold
Provider failover must be transparent to callersStripe downtime must not fail all checkoutsProvider adapter abstraction; failover to Adyen if Stripe health check fails

Performance Analysis

Pressure pointSymptomFirst responseSecond response
Provider API slowdownp95 auth latency > 3sCircuit breaker + failover to secondary providerAsync capture path
Idempotency cache miss stormDuplicate charge attempts spikeIncrease Redis memory; check TTL settingsDB unique constraint as safety net
Ledger write contentionINSERT p99 climbs under burstBatch ledger writes; use advisory locksPartition ledger by merchant_id
Webhook backpressureDLQ depth growingScale Kafka consumer groupPriority retry for high-value merchants

Payment system health signals — the five metrics that matter most:

MetricAlert thresholdWhat it reveals
payment.authorization_success_rate< 95%Provider rejections increasing; check provider dashboard
payment.p95_latency_ms> 3000msProvider API slow; network issue
idempotency.cache_hit_rate> 5% of requestsHigh retry volume; possible client-side bug or network instability
ledger.unreconciled_entries_count> 0 after daily reconciliationMissed capture or double-entry inconsistency — needs immediate investigation
webhook.delivery_failure_rate> 2%Merchant endpoint down; DLQ accumulating

📊 High Level Design - Architecture for Functional Requirements

Building Blocks

ComponentResponsibilityTechnology
Payment APIAccept payment requests; idempotency check; route to providerSpring Boot
Idempotency StoreCache payment responses for 24hRedis
Provider AdapterAbstract Stripe/Adyen/Braintree behind common interfaceStrategy pattern; HTTP client per provider
Ledger ServiceDouble-entry writes; account balance queriesPostgreSQL with serializable isolation
Webhook WorkerDeliver events to merchant URLs; retry with backoffKafka consumer + Spring Retry
Reconciliation JobNightly diff between platform ledger and provider CSVSpring Batch
Payment Event BusDecouple payment state changes from webhook deliveryKafka topics: payment.authorized, payment.captured

Design the APIs

Authorize + Capture (instant capture):

POST /payments
Content-Type: application/json
Idempotency-Key: a1b2c3d4-e5f6-7890-abcd-ef1234567890

{
  "merchantId": "merchant-uuid",
  "amountCents": 4999,
  "currency": "USD",
  "paymentMethodId": "pm_card_visa",
  "captureMode": "automatic"
}

HTTP 201 Created
{
  "paymentId": "pay_uuid",
  "status": "captured",
  "amountCents": 4999,
  "currency": "USD",
  "providerTxnId": "pi_stripe_abc123",
  "createdAt": "2025-01-15T10:23:00Z"
}

Explicit Capture:

POST /payments/{paymentId}/capture

HTTP 200 OK
{
  "paymentId": "pay_uuid",
  "status": "captured",
  "capturedAt": "2025-01-15T10:30:00Z"
}

Refund:

POST /payments/{paymentId}/refund
Content-Type: application/json

{
  "amountCents": 4999,
  "reason": "customer_request"
}

HTTP 200 OK
{
  "paymentId": "pay_uuid",
  "status": "refunded",
  "refundedAt": "2025-01-15T11:00:00Z"
}

Status Check:

GET /payments/{paymentId}

HTTP 200 OK
{
  "paymentId": "pay_uuid",
  "status": "captured",
  "amountCents": 4999,
  "currency": "USD",
  "authorizedAt": "2025-01-15T10:23:00Z",
  "capturedAt": "2025-01-15T10:30:00Z"
}

Webhook payload (payment.captured):

{
  "eventType": "payment.captured",
  "paymentId": "pay_uuid",
  "merchantId": "merchant-uuid",
  "amountCents": 4999,
  "currency": "USD",
  "providerTxnId": "pi_stripe_abc123",
  "capturedAt": "2025-01-15T10:30:00Z",
  "metadata": { "orderId": "order-123", "customerId": "cust-456" }
}

Key design decisions:

  • Idempotency-Key header is mandatory on POST /payments; server caches the response in Redis for 24h
  • Amounts are always in integer cents — never float or decimal in the API or storage layer
  • captureMode: automatic is the default; manual is for hotels and car rentals that need to adjust the final amount

Communication Between Components

PathProtocolWhy
Client → Payment APIHTTPS (synchronous)User-facing; must be fast and reliable
Payment API → RedisTCP (synchronous)Idempotency cache check exits here on hit
Payment API → PostgreSQLJDBC (synchronous)Persist payment + ledger in one transaction
Payment API → ProviderHTTPS (synchronous)Authorization requires a synchronous response
Payment API → KafkaAsync publishDecouple state-change events from webhook delivery
Kafka → Webhook WorkerStreaming consumerRetry with backoff; DLQ after 5 failures
Reconciliation Job → ProviderSFTP/HTTPS batchNightly settlement file download

Data Flow

Authorize path:

Client → POST /payments (Idempotency-Key: abc) → Payment API
  → Redis GET idempotency:abc      (cache hit: return cached response immediately)
  → Provider Adapter → Stripe Authorize API
  → PostgreSQL: payments INSERT (status=authorized) + ledger INSERT (2 rows) in single transaction
  → Kafka publish payment.authorized
  → Redis SET idempotency:abc response (24h TTL)
  → return 201

Capture path:

Client → POST /payments/{id}/capture → Payment API
  → Redis SET payment:lock:{id} (distributed lock, 30s TTL)
  → Provider Adapter → Stripe Capture API
  → PostgreSQL: UPDATE payments SET status=captured, captured_at=NOW()
  → Kafka publish payment.captured
  → return 200

Webhook path:

Kafka CONSUME payment.captured
  → Webhook Worker → POST merchant_webhook_url
      → 200 OK: UPDATE webhooks SET delivered_at=NOW(), status=delivered
      → Non-200: schedule retry (exponential backoff: 1s, 2s, 4s, 8s, 16s)
          → 5 failures: move to DLQ topic; alert on-call
flowchart TD
    A[Customer] -->|POST /payments + Idempotency-Key| B[Payment API]
    B -->|GET idempotency:key| C[Redis Idempotency Store]
    C -->|cache hit: return cached| B
    B -->|authorize| D[Provider Adapter]
    D -->|Stripe Authorize API| E[Stripe / Adyen]
    E -->|auth code| D
    D --> B
    B -->|payments INSERT + ledger INSERT 2 rows| F[PostgreSQL]
    B -->|SET idempotency:key 24h| C
    B -->|publish payment.authorized| G[Kafka]
    G -->|consume| H[Webhook Worker]
    H -->|POST webhook_url| I[Merchant Server]
    I -->|200 OK| H
    H -->|retry / DLQ after 5 failures| J[Dead Letter Queue]
    K[Reconciliation Job] -->|nightly diff| F
    K -->|provider settlement CSV| E

🌍 Real-World Applications: API Mapping and Real-World Applications

Payment platform patterns power every major checkout system. Here is how each feature maps to real production systems:

FeatureReal-world exampleDesign element
IdempotencyStripe's Idempotency-Key header on every POST requestRedis cache + DB unique constraint on idempotency_key
Two-phase authorize/captureAirbnb holds a payment on booking, captures on check-inauthorizecapture with distributed lock
Double-entry ledgerStripe's financial ledger; PayPal's balance accountingTwo ledger rows per payment: debit + credit
Webhook deliveryStripe webhooks for payment_intent.succeededKafka consumer with exponential backoff + DLQ
ReconciliationAdyen's settlement reports vs platform ledgerSpring Batch nightly diff + alerting

The $0.10 problem: PayPal's early system stored amounts as FLOAT. A payment of $1.10 could be stored as 1.0999999… due to IEEE 754 representation, leading to cents rounding errors at scale. The industry standard is now BIGINT minor units universally — 110 cents for $1.10, never 1.10 as a float.

🔑 Feature Deep Dive: Idempotency Implementation

A client's connection drops after sending POST /payments. Did the charge succeed? The client doesn't know — so it retries. Without idempotency, the customer is charged twice. With idempotency, the second request finds a cached response and returns it immediately.

The complete idempotency check pattern:

public PaymentResponse createPayment(CreatePaymentRequest req, String idempotencyKey) {
    // Step 1: Check Redis cache
    String cached = redis.opsForValue().get("idempotency:" + idempotencyKey);
    if (cached != null) return deserialize(cached, PaymentResponse.class);

    // Step 2: Try DB idempotency constraint (handles race between two identical concurrent requests)
    try {
        Payment payment = paymentRepository.save(new Payment(idempotencyKey, req));
        PaymentResponse response = providerAdapter.authorize(payment);
        payment.updateStatus(response);
        paymentRepository.save(payment);

        // Cache for 24h so retries within the day get the same response
        redis.opsForValue().set("idempotency:" + idempotencyKey, serialize(response), Duration.ofHours(24));
        return response;

    } catch (DataIntegrityViolationException e) {
        // Concurrent request with same key — wait for it to complete, then return its result
        return waitAndReturnCachedResponse(idempotencyKey);
    }
}

Why two layers (Redis + DB constraint)?

  • Redis handles the fast path: 99%+ of retries are caught here without touching the database.
  • DB unique constraint handles the race condition: two concurrent requests with the same key both miss Redis (cache not yet populated), but only one succeeds at the INSERT level. The loser gets a DataIntegrityViolationException and waits for the winner's response to appear in Redis.

🔑 Feature Deep Dive: Double-Entry Ledger

Every payment in a payment platform creates exactly two ledger entries. This is the same principle used in traditional accounting since Pacioli's 1494 Summa de arithmetica, adapted for digital payments.

The principle: Every transaction has an equal and opposite effect on two accounts:

  • Debit the customer account (funds leaving)
  • Credit the merchant account (funds arriving)

Both rows are written in a single PostgreSQL transaction with serializable isolation. If either write fails, both are rolled back — no partial state is ever persisted.

Why double-entry matters for reconciliation: The nightly reconciliation job runs this query:

-- Every payment should have exactly one debit and one credit
-- SUM(all credits) = SUM(all debits) — any row returned here is a data inconsistency
SELECT
    p.payment_id,
    SUM(CASE WHEN l.entry_type = 'credit' THEN l.amount_cents ELSE 0 END) AS total_credits,
    SUM(CASE WHEN l.entry_type = 'debit'  THEN l.amount_cents ELSE 0 END) AS total_debits
FROM payments p
JOIN ledger_entries l ON l.payment_id = p.payment_id
WHERE p.captured_at >= NOW() - INTERVAL '1 day'
GROUP BY p.payment_id
HAVING SUM(CASE WHEN l.entry_type = 'credit' THEN l.amount_cents ELSE 0 END)
    != SUM(CASE WHEN l.entry_type = 'debit'  THEN l.amount_cents ELSE 0 END);

Any row returned is a data inconsistency requiring immediate human investigation. The ledger.unreconciled_entries_count metric alerts if this count is ever non-zero after the daily run.


🔑 Feature Deep Dive: Two-Phase Authorization and Capture

Most payment UIs present authorization and capture as a single "charge" operation, but the underlying card network protocol is always two-phase:

  1. Authorize — the Payment API sends an authorization request to the card network via Stripe/Adyen. The network places a hold on the customer's funds (a pending charge appears on their statement). No money moves yet. The authorization code is stored in provider_txn_id.
  2. Capture — the Payment API sends a capture request referencing the authorization code. The card network settles the transaction and transfers funds. The captured_at timestamp is written.

Why this matters for specific industries:

  • Hotels authorize at booking time but don't know the final amount (minibar, room service). They capture at checkout with the final total.
  • Car rentals authorize a deposit hold; capture the actual rental fee on return.
  • Marketplaces authorize at order placement; capture only after the seller confirms shipment.

Authorization expiry: If capture is not called within 7 days (Visa/Mastercard standard), the authorization expires and the hold is released. The platform must track authorized_at and capture or cancel proactively before expiry.

authorize → payment.status = 'authorized', authorized_at = NOW()
   |
   ├── capture called within 7 days
   |     → payment.status = 'captured', captured_at = NOW()
   |
   └── 7 days pass without capture
         → authorization expires, hold released automatically by card network
         → platform should cancel proactively at day 6 to avoid abandoned authorizations

⚖️ Trade-offs & Failure Modes (Design Deep Dive for Non Functional Requirements)

Scaling Strategy

BottleneckSymptomFix
Idempotency cache missDuplicate charge attempts; Redis miss rate climbsIncrease Redis memory; extend TTL; monitor key eviction
Ledger write contentionINSERT p99 grows under burstShard ledger by merchant_id; use write batching
Provider API latencyAuth p95 > 3sCircuit breaker; automatic failover to secondary provider
Webhook DLQ depthMerchant events delayed > 1hScale Kafka consumer group; add priority lanes for high-value merchants

Availability and Resilience

  • Payment API is stateless — horizontal scale behind a load balancer; any pod can handle any request
  • Redis failure fallback — if Redis idempotency store is unreachable, fall back to DB unique constraint check; accept higher DB load, not downtime
  • PostgreSQL replication — synchronous replica for failover; read replicas for status-check queries
  • Circuit breaker — Resilience4j wraps every provider call; opens after 5 consecutive failures; returns 503 rather than hanging; health probe re-closes after 1 successful call

Storage and Caching

LayerWhat it storesEviction policy
Redis (idempotency)idempotency:{key} → response JSON24h TTL; LRU eviction on memory pressure
Redis (distributed lock)payment:lock:{paymentId}30s TTL; auto-release prevents lock starvation
PostgreSQL (payments)All payment records; source of truthSoft-archive after 7 years (regulatory retention)
PostgreSQL (ledger)All ledger entries; double-entry sourceNever delete; append-only; partition by year

Consistency, Security, and Monitoring

Consistency model by operation:

  • Payment creation — strong consistency; PostgreSQL write must succeed before returning paymentId
  • Ledger writes — serializable isolation; both rows commit atomically or not at all
  • Webhook delivery — eventual; at-least-once delivery with idempotency check at the merchant side
  • Reconciliation — eventual; nightly batch; discrepancies flagged next business day

Security:

  • TLS on all external calls; card data never touches the platform (tokenization by provider)
  • All amounts validated to be positive integers before provider call
  • Merchant webhook URLs validated against allowlist; HTTPS only
  • Idempotency-Key scoped per merchant — one merchant cannot replay another's key

Key SLO signals to monitor:

  • payment.authorization_success_rate — alert if < 95%
  • ledger.unreconciled_entries_count — alert if > 0 after daily reconciliation
  • webhook.delivery_failure_rate — alert if > 2%
  • idempotency.cache_hit_rate — alert if > 5% (high retry volume indicates client-side issue)

🧭 Decision Guide

SituationRecommendation
Idempotency strategyClient-supplied Idempotency-Key + Redis cache (24h) + DB unique constraint safety net — never accept opaque keys without scoping to merchant
Double-entry vs single-entry ledgerDouble-entry always — single-entry cannot detect inconsistencies; the extra row costs pennies; the audit value is priceless
Float vs integer for moneyInteger minor units (cents) always — IEEE 754 cannot represent 0.10 exactly; display layer formats for humans
At-least-once vs exactly-once webhooksAt-least-once with idempotency check on merchant side — exactly-once delivery is prohibitively complex; make webhooks idempotent instead
Automatic vs manual captureDefault automatic for e-commerce; manual for hospitality, car rental, and marketplaces that need final-amount flexibility
Provider failoverWarm secondary provider (Adyen as backup to Stripe) with health probe; circuit breaker routes traffic on degraded signal

🧪 Practical Example for Interview Delivery

A repeatable way to deliver this design in interviews:

  1. Start with actors, use cases, and scope boundaries.
  2. State estimation assumptions (QPS, storage growth, idempotency key volume).
  3. Draw HLD and explain each component responsibility.
  4. Walk through the idempotency flow end-to-end — this is the interviewer's most common follow-up.
  5. Describe the double-entry ledger and reconciliation job.

Question-specific practical note:

  • The three non-negotiables for payment systems: amounts in integer cents, idempotency keys on all write APIs, and double-entry ledger writes. Name these early.

A concise closing sentence that works well: "I would launch with this minimal architecture, monitor authorization success rate, unreconciled ledger count, and webhook DLQ depth, then scale the first saturated component before adding further complexity."

🏗️ Advanced Concepts for Production Evolution

When interviewers ask follow-up scaling questions, use a phased approach:

  1. Phase 1 — Correctness baseline: Idempotency store, double-entry ledger, and reconciliation job in place before any other concern.
  2. Phase 2 — Throughput: Partition ledger by merchant_id; scale Payment API horizontally; add read replicas for status queries.
  3. Phase 3 — Resilience: Multi-provider setup (Stripe primary, Adyen secondary); circuit breaker with automatic failover; Redis cluster for idempotency store.
  4. Phase 4 — Observability: Distributed tracing across Payment API → Provider → Ledger; payment funnel analytics; provider SLA tracking per endpoint.
  5. Phase 5 — Multi-region: Active-active payment intake with regional providers; cross-region ledger replication with conflict resolution; regulatory data residency controls.

This framing demonstrates that architecture decisions are tied to measurable outcomes, not architecture fashion trends.

🛠️ Persist-Before-Call and Content-Hash Idempotency: The Two Decisions That Prevent Double-Charges

Two architectural decisions make payment processing safe under failures: persist the payment intent record before calling the external provider (so a timed-out HTTP call can be retried without re-charging), and derive the idempotency key from a hash of the request content rather than accepting an opaque client-supplied string (so every retry from any server instance produces the same key, and the provider deduplicates the charge automatically).

// 1. Persist intent BEFORE calling Stripe — enables clean recovery on timeout
//    On retry: same intentId is found, idempotency key is recomputed, Stripe returns original result
String intentId = ledger.createPendingIntent(
    request.orderId(), request.amountCents(), request.currency());

// 2. Content-hash idempotency key — deterministic across retries from any instance
//    Stripe will return the original PaymentIntent if this key was seen before
String idempotencyKey = DigestUtils.sha256Hex(
    request.orderId() + ":" + request.amountCents() + ":" + request.currency());

PaymentIntent intent = PaymentIntent.create(
    PaymentIntentCreateParams.builder()
        .setAmount((long) request.amountCents())
        .setCurrency(request.currency())
        .setPaymentMethod(request.paymentMethodId())
        .setConfirm(true).build(),
    RequestOptions.builder().setIdempotencyKey(idempotencyKey).build());

ledger.updateStatus(intentId, intent.getStatus(), intent.getId());

If the HTTP call to Stripe times out after createPendingIntent but before updateStatus, the retry reads the existing pending intent, recomputes the same SHA-256 key, and Stripe returns the original PaymentIntent — no second charge is created. The idempotency key is derived from the request content, not accepted as an opaque client-supplied string, to prevent a misbehaving client from intentionally reusing a key to bypass duplicate detection.

Spring Retry (@Retryable with exponential backoff) handles transient Stripe 5xx responses without adding try-catch scaffolding — but the idempotency and persist-before-call decisions are what guarantee correctness regardless of retry behaviour.

For a full deep-dive on double-entry ledger writes, reconciliation pipelines, and Stripe webhook processing, a dedicated follow-up post is planned.

📚 Lessons Learned

  • Always store money as integer minor units (cents) — floating point is catastrophic for financial calculations at scale.
  • Idempotency is not optional in payment systems — every write API must be idempotent from day one.
  • Double-entry bookkeeping is the only reliable way to detect ledger inconsistencies; the reconciliation query is your safety net.
  • Persist the payment record before calling the external provider — this single decision enables clean timeout recovery.
  • Webhook delivery is inherently at-least-once; design merchant endpoints to be idempotent rather than fighting the delivery model.

📌 TLDR: Summary & Key Takeaways

  • Integer cents, not floats — IEEE 754 cannot represent 0.10 exactly; store amount_cents BIGINT, never amount DECIMAL or FLOAT.
  • Idempotency-Key on every write — Redis cache (24h) + DB unique constraint ensures retries never double-charge, even under concurrent requests.
  • Double-entry ledger — debit customer, credit merchant atomically in one PostgreSQL transaction; reconciliation verifies SUM(debits) = SUM(credits).
  • Two-phase authorize/capture — authorize holds funds; capture settles; this flexibility powers hotels, rentals, and marketplaces.
  • Provider abstraction — Strategy pattern over Stripe/Adyen enables transparent failover without caller changes.

📝 Practice Quiz

  1. Why must payment amounts be stored in integer minor units (cents) instead of decimal?

A) Integers are faster to index
B) IEEE 754 floating point cannot represent 0.10 exactly — rounding errors cause financial inconsistencies
C) Decimals are not supported by PostgreSQL

Correct Answer: B

  1. What is the purpose of the Idempotency-Key header on payment API requests?

A) To authenticate the merchant
B) To ensure that retrying a failed request does not result in duplicate charges
C) To route the request to the correct payment provider

Correct Answer: B

  1. What is double-entry bookkeeping and why does a payment platform require it?

A) Writing payment data to two databases simultaneously
B) Every transaction creates two ledger entries (debit + credit); imbalances reveal data inconsistencies
C) Sending the payment request twice to guarantee delivery

Correct Answer: B

  1. Open-ended challenge: A merchant's server fails after receiving a payment authorization but before confirming the capture. Walk through what happens with idempotency, the ledger, and the reconciliation job.
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms