All Posts

The Dual Write Problem: Why Two Writes Always Fail Eventually — and How to Fix It

Transactional Outbox, CDC publishing, and cache invalidation — the patterns that eliminate dual write failures

Abstract AlgorithmsAbstract Algorithms
··25 min read
Share
Share on X / Twitter
Share on LinkedIn
Copy link

TLDR: Any service that writes to a database and publishes a message in the same logical operation has a dual write problem. try/catch retries don't fix it — they turn failures into duplicates. The Transactional Outbox pattern co-writes business data and a pending event into the same DB transaction, then a relay delivers to Kafka separately. At-least-once delivery, idempotent consumers — no ghost orders, no phantom fulfillments.

📖 The Order That Silently Disappeared From Fulfillment

It is 11:42 PM on Black Friday. An order service receives a checkout request, writes the new row to PostgreSQL — status PENDING, payment captured — and then calls kafkaTemplate.send("order-events", payload). The Kafka broker is under peak load. The network call times out. A NetworkException is thrown, caught by catch (Exception e) { log.error(...); }, and silently swallowed.

The order exists in the database. The fulfillment service never receives the OrderPlaced event. The warehouse never picks the item. The customer sees "Order Confirmed ✅" on their screen. Three days later they call support — where is my order?

This is the dual write problem: a service commits state to two independent systems in what should be a single logical operation. Because the two writes are not atomic, any failure between them leaves the systems permanently inconsistent. And critically — both the database and the message broker appear healthy. There is no alert. There is no error page. Just a silent gap.

The dual write problem is not an edge case. It exists in 100% of services that write to a database and send a message, update a cache, or push a search index entry in the same code path — without a coordination strategy. The code looks correct. The bug is structural.


🔍 Three Failure Modes That Strike Every Multi-System Service

Before exploring solutions, it is essential to see all three failure modes. Engineers who only know Scenario A are surprised by B and C in production.

Scenario A — The Ghost Order: DB Succeeds, Kafka Fails

The most common failure. The PostgreSQL transaction commits successfully. The Kafka publish throws an exception after the commit. The downstream fulfillment consumer never receives the event — the order exists in the database but is invisible to every consumer that acts on it.

sequenceDiagram
    participant App as Order Service
    participant DB as PostgreSQL
    participant K as Kafka Broker

    App->>DB: BEGIN - INSERT INTO orders - COMMIT
    DB-->>App: Row committed id=7821
    App->>K: send order-events OrderPlaced id=7821
    K-->>App: NetworkException - timeout under load
    Note over App,K: Event never delivered. Order 7821 is a ghost in fulfillment.

The sequence diagram shows the critical timing: the COMMIT returns before the Kafka call is even attempted. PostgreSQL has no knowledge of Kafka — once committed, that row is permanent. There is no automatic compensation. The database and the message broker are now out of sync, forever, unless a human notices.

Scenario B — The Phantom Fulfillment: Kafka Succeeds, DB Fails

The failure can run in the opposite direction, and the consequences are worse. The event is published to Kafka successfully. Consumers start processing immediately — inventory is reserved, a shipment record is created. Then the database write fails with a ConstraintViolationException. The local transaction rolls back. The order row does not exist in the source-of-truth database. But the fulfillment service has already acted on the event and there is no undo. The customer may be charged for a shipment that has no corresponding order record.

Scenario C — The Stale Cache: DB Succeeds, Redis Fails

Not every dual write involves Kafka. When a user updates their shipping address, the service writes to PostgreSQL and then calls redisTemplate.opsForValue().set("user:" + userId, json). The Redis SET call fails silently — connection pool exhausted, Redis OOM, a transient network blip. The database holds the new address. The cache holds the old one. For the duration of the TTL, every read returns the stale address. The user places another order, sees their old address pre-filled, misses the discrepancy, and their package ships to the wrong location.

This variant is especially dangerous because Redis connection pool exhaustion rarely throws a loud exception in an async fire-and-forget call. The failure disappears without a trace.

The common thread across all three scenarios: writing to two independent systems without coordination creates a consistency window. Any failure inside that window produces permanent divergence.


⚙️ Why Every Instinctive Fix Makes the Problem Worse

When engineers first encounter the dual write problem, there are three instinctive fixes. All three are wrong — not because of implementation details, but because of structural incompatibilities with the systems involved.

Fix Attempt 1 — "Just Wrap It in a Try-Catch and Retry"

// This looks safe. It is not.
try {
    orderRepo.save(order);             // DB write
    kafka.send("order-events", event); // Kafka write — outside DB transaction
} catch (Exception e) {
    retryPublish(event); // Which write failed? We don't know.
}

The problem is that the catch block cannot distinguish which write failed. If the DB succeeded and Kafka failed, the retry re-publishes the event — safe only if every consumer is provably idempotent. If Kafka succeeded and the DB failed (and rolled back), the retry publishes a second message for data that no longer exists. Retry logic converts write failures into duplicate delivery problems. It moves the consistency gap; it does not close it.

Fix Attempt 2 — "Use a Distributed Transaction (2PC / XA)"

Two-Phase Commit is the theoretically correct solution. A coordinator sends PREPARE to all participants, waits for votes, then sends COMMIT. Either all commit or all roll back.

There are two fundamental blockers for modern distributed systems:

sequenceDiagram
    participant C as Coordinator
    participant DB as PostgreSQL XA
    participant K as Kafka

    C->>DB: PREPARE
    C->>K: PREPARE
    K-->>C: Protocol Not Supported - no XA implementation
    Note over C,K: 2PC cannot proceed. Kafka has no XA resource manager.

First: Kafka does not implement the XA protocol. The Kafka wire protocol has no concept of XA prepare/commit. Even Kafka's own transactions API (idempotent producers + beginTransaction) only guarantees atomicity across Kafka partitions — not across Kafka and an external RDBMS. You cannot enlist Kafka in a JTA transaction.

Second: 2PC is a blocking protocol prone to coordinator failure. If the coordinator crashes after sending PREPARE but before sending COMMIT, all participants hold locks and wait indefinitely. In a microservices environment where pods restart constantly, this pathological blocked state is not theoretical. The resolution requires manual operator intervention, which means downtime.

Fix Attempt 3 — "Write to Kafka First, Then the DB"

Reversing the order does not eliminate the dual write problem — it only changes which system holds the orphan record on failure. If Kafka publish succeeds and the DB write subsequently fails, you now have an event for an order that was never persisted. Scenario B becomes the default failure mode.

There is no safe ordering of two independent writes to two independent systems. The root cause is the absence of a shared transaction boundary.


🧠 Deep Dive: How the Transactional Outbox Achieves True Atomicity

The Transactional Outbox pattern resolves the dual write problem by changing the question. Instead of "how do we atomically write to DB and Kafka?" — which is impossible — it asks: "what if both writes go to the database, and we let a separate process deliver to Kafka?"

The answer is that you can write the business record and a pending event record to the same PostgreSQL database in the same ACID transaction. The two rows either commit together or roll back together. There is no gap. Kafka delivery becomes a separate, retryable, eventually-consistent operation.

The Internals: Outbox Table, Transaction Boundary, and Relay Pipeline

The pattern has three components:

Component 1: The outbox_events table — a staging area in the same database as your business tables. Each row represents an event that should be published to a message broker. The row is written atomically with the business record in the same transaction.

ColumnTypePurpose
event_idVARCHAR (PK)Globally unique ID — used as Kafka message key for consumer-side deduplication
aggregate_typeVARCHARDomain entity type, e.g. "Order"
aggregate_idVARCHARBusiness entity ID, e.g. "7821"
event_typeVARCHARLogical event name, e.g. "OrderPlaced"
payloadTEXTJSON-serialized event body
created_atTIMESTAMPTZCommit time — controls relay ordering
publishedBOOLEANfalse = pending delivery; true = relayed to Kafka

Component 2: The OrderService — co-writes business data and the outbox row in a single @Transactional method. If anything throws before the transaction commits, both rows roll back. Neither a ghost order nor a missing event is possible at the database level.

Component 3: The Outbox Relay — a @Scheduled background bean that reads unpublished rows, calls KafkaTemplate.send(), and marks rows published = true on success. If delivery fails, the row remains published = false and is retried on the next poll cycle. This gives at-least-once delivery — consumers must be idempotent.

Performance Analysis: Polling Latency, Batch Size, and the CDC Alternative

The polling relay introduces a latency floor equal to the poll interval — typically 500ms–2s. Three tuning variables govern the throughput ceiling:

VariableImpact
Poll interval (fixedDelay)Lower interval = lower latency, higher DB query frequency
Batch size (PageRequest)Larger batch = higher throughput per cycle, more memory pressure
Index on (published, created_at)Without this index, the SELECT WHERE published = false degrades to a full table scan as the outbox grows

The (published, created_at) composite index is non-optional on any table that will hold more than a few hundred rows. A missing index is the most common cause of outbox relay CPU spikes in production.

For scenarios where 500ms–2s latency is unacceptable — payment confirmations, real-time notifications, trading systems — the polling relay should be replaced with Change Data Capture (CDC). Debezium reads directly from the PostgreSQL Write-Ahead Log (WAL), detecting new outbox_events inserts within 50–100ms of commit without polling the primary database at all.


📊 Visualizing the Dual Write Gap and the Outbox Solution End to End

The diagram below contrasts the naive dual write (left path, failure gap visible) with the Transactional Outbox (right path, atomicity guaranteed through the shared DB transaction).

flowchart TD
    subgraph naive[Naive Dual Write - failure gap]
        N1[orderRepo.save] --> N2[COMMIT success]
        N2 --> N3[kafkaTemplate.send]
        N3 -->|NetworkException| N4[Event lost - gap opens]
    end

    subgraph outbox[Transactional Outbox - atomic]
        O1[BEGIN TX] --> O2[INSERT INTO orders]
        O2 --> O3[INSERT INTO outbox_events published=false]
        O3 --> O4[COMMIT - both rows durable]
        O4 --> O5[Outbox Relay - Scheduled 500ms]
        O5 --> O6[KafkaTemplate.send]
        O6 -->|success| O7[UPDATE outbox_events published=true]
        O6 -->|failure| O8[Row stays unpublished - retried next cycle]
    end

The two subgraphs make the structural difference explicit. In the naive path, the commit and the Kafka send are sequential and independent — any exception between them creates a permanent gap. In the Outbox path, both writes happen inside one database transaction. The Kafka send is delegated to the relay, which runs independently and retries safely. Failure means retry, not data loss.


🌍 Where Dual Writes Appear Across Real Production Architecture Layers

The dual write problem appears at every layer of a production system. Once you know the pattern, you see it everywhere.

Order service (DB + Kafka): The canonical case. Save order → publish OrderPlaced. Every e-commerce checkout path has this dual write. Netflix, Amazon, Stripe all solved it with some variant of the Transactional Outbox.

User profile service (DB + Redis): Update profile → update cache. Every read-heavy user service has this. The failure mode is stale PII served to downstream callers — wrong address, outdated payment preference, stale permission set.

Search indexing service (DB + Elasticsearch): Save product → index in Elasticsearch. Failed index writes mean stale search results. Users searching for a product that exists in the catalog cannot find it. Debezium's Elasticsearch Sink Connector solves this via CDC on the product table.

Read replica synchronization (primary DB + read replica): Technically handled by the database replication protocol, but application code that reads from a replica immediately after a write can observe stale data in the replication lag window. Services must either read from primary for consistency-sensitive reads or use read-your-writes sessions.

Notification service (DB + SQS/SNS): Save notification record → enqueue SQS message. If the SQS publish fails, the notification never delivers. The Transactional Outbox works identically with SQS as the delivery target — just change the relay to call sqsClient.sendMessage() instead of kafkaTemplate.send().

The pattern is universal. Any service that writes to a primary store and then propagates that write to a secondary system needs a coordination strategy.


⚖️ Trade-offs and Failure Modes Across Outbox, CDC, and Event Sourcing

No solution is universally optimal. Understanding the failure modes of each pattern is what lets you choose intelligently.

DimensionTransactional Outbox (Polling)CDC + DebeziumEvent SourcingCache-Aside DELETE
Atomicity guarantee✅ Both DB writes in one TX✅ Derives from WAL✅ Event IS the write✅ DB write is the source of truth
Delivery semanticsAt-least-onceAt-least-onceAt-least-onceN/A (no broker)
Event latency~500ms–2s~50–100msEventually consistent reads~1ms (cache miss path)
Operational complexityLow — one @Scheduled beanMedium — Debezium clusterHigh — CQRS, projections, schema evolutionVery Low
Consumer requirementIdempotentIdempotentIdempotentN/A
Primary failure modeRelay pod crash → delayed delivery (not data loss)Debezium connector lag → delayed deliveryProjection rebuild on schema changeDELETE call failure → cache miss (acceptable)
DB WAL/binlog requiredNoYes (wal_level = logical)N/ANo
Works with Kafka SaaS✅ Any broker✅ Via Kafka ConnectN/A
Outbox table growth❌ Needs archival job✅ Debezium marks offset, DB rows cleaned❌ Event log grows foreverN/A

The most important row is primary failure mode. The Transactional Outbox cannot lose events — a relay crash means delayed delivery, not lost delivery. When the relay restarts, it re-reads all published = false rows and retries. This at-least-once guarantee is the whole point of the pattern.

The dangerous failure mode for CDC is Debezium connector lag: if the connector falls behind the WAL (e.g., during a Kafka outage), events queue up in the WAL. Once WAL segments rotate, the connector must catch up or replay. Monitoring connector lag is mandatory in production.

Event Sourcing's primary failure mode is projection rebuild complexity: when the event schema evolves, existing projections may become invalid and must be rebuilt from the full event log, which can take hours for large event stores.


🧭 Choosing the Right Pattern for Your System's Constraints

The decision tree below maps system constraints to the appropriate pattern. Start with the type of secondary system you are writing to, then apply the latency and complexity budget.

flowchart TD
    A[Service writes to two systems] --> B{Secondary system type?}
    B -->|Message broker Kafka SQS RabbitMQ| C{Sub-100ms latency required?}
    C -->|Yes| D[CDC + Debezium Outbox Router - 50ms WAL-based]
    C -->|No| E[Transactional Outbox Polling Relay - 500ms to 2s]
    B -->|Cache Redis Memcached| F[Cache-Aside DELETE on write - read-through repopulates]
    B -->|Search Index Elasticsearch| G[CDC to ES Sink Connector]
    B -->|Another service DB or cross-service state| H{Audit trail or time-travel required?}
    H -->|Yes| I[Event Sourcing + CQRS]
    H -->|No| J[Transactional Outbox + Saga for compensation]

The flowchart above guides the initial pattern selection. Use the decision table below to validate your choice against operational constraints.

Consolidated Decision Table:

ScenarioRecommended PatternLatencyComplexityConsumer Idempotency
DB + Kafka, standard teamTransactional Outbox (polling)~500msLowRequired
DB + Kafka, sub-100ms SLACDC + Debezium Outbox Router~50msMediumRequired
DB + Redis cacheCache-Aside DELETE on write~1msVery LowN/A
DB + ElasticsearchCDC to ES Sink Connector~100msMediumRequired
Event-driven domain, audit trailEvent Sourcing + CQRSEventually consistent readsHighRequired
Multi-service distributed operationSaga + Outbox per serviceEventually consistentHighRequired

When Event Sourcing is the right choice: Audit trail is a first-class business requirement (financial services, healthcare, legal). You need time-travel or full replay capability — "what was the state of order 7821 at 3:15 PM on Tuesday?" The domain model is naturally event-driven. For the full implementation of Event Sourcing with Axon Framework, see Event Sourcing Pattern: Auditability, Replay, and Evolution of Domain State.


🧪 Full Transactional Outbox Implementation in Spring Boot

The following is a complete, production-ready implementation using Spring Data JPA for the database writes and Spring Kafka for event delivery. Each class maps directly to one of the three components described in the Deep Dive section.

The OutboxEvent Entity — Staging Table for Pending Events

@Entity
@Table(name = "outbox_events", indexes = {
    @Index(name = "idx_outbox_pending", columnList = "published, created_at")
})
public class OutboxEvent {

    @Id
    private String eventId;

    private String aggregateType;   // "Order", "Payment", "User"
    private String aggregateId;     // business entity ID — becomes Kafka message key
    private String eventType;       // "OrderPlaced", "PaymentCaptured"

    @Column(columnDefinition = "TEXT")
    private String payload;         // JSON-serialized domain event

    private Instant createdAt;
    private boolean published;

    protected OutboxEvent() {}

    public OutboxEvent(String eventId, String aggregateType, String aggregateId,
                       String eventType, String payload, Instant createdAt, boolean published) {
        this.eventId = eventId;
        this.aggregateType = aggregateType;
        this.aggregateId = aggregateId;
        this.eventType = eventType;
        this.payload = payload;
        this.createdAt = createdAt;
        this.published = published;
    }

    public String getEventId()     { return eventId; }
    public String getAggregateId() { return aggregateId; }
    public String getPayload()     { return payload; }
    public boolean isPublished()   { return published; }
    public void setPublished(boolean p) { this.published = p; }
}

The OrderService — Atomic Dual Write in a Single Transaction

Both the business record and the outbox row are written inside one @Transactional boundary. If any step throws before the transaction commits, both rows roll back atomically. Ghost orders and phantom events are structurally impossible at this layer.

@Service
@Transactional
public class OrderService {

    private final OrderRepository orderRepo;
    private final OutboxEventRepository outboxRepo;
    private final ObjectMapper mapper;

    public OrderService(OrderRepository orderRepo,
                        OutboxEventRepository outboxRepo,
                        ObjectMapper mapper) {
        this.orderRepo  = orderRepo;
        this.outboxRepo = outboxRepo;
        this.mapper     = mapper;
    }

    public Order placeOrder(PlaceOrderRequest req) throws JsonProcessingException {
        // Step 1: save business entity
        Order order = new Order(
            req.getCustomerId(),
            req.getItems(),
            OrderStatus.PENDING
        );
        orderRepo.save(order);

        // Step 2: write outbox event — SAME transaction as step 1
        String payload = mapper.writeValueAsString(
            new OrderPlacedEvent(order.getId(), req.getCustomerId(), order.getTotalAmount())
        );
        OutboxEvent event = new OutboxEvent(
            UUID.randomUUID().toString(),   // unique event_id — Kafka key for dedup
            "Order",
            order.getId().toString(),
            "OrderPlaced",
            payload,
            Instant.now(),
            false
        );
        outboxRepo.save(event);

        return order;
        // @Transactional commits both rows here — or rolls both back on any exception
    }
}

The OutboxRelay — Polls and Publishes Every 500ms

The relay is a lightweight Spring component with one responsibility: read published = false rows, deliver to Kafka, mark as delivered. It runs every 500ms with a batch cap of 100 rows to bound the Kafka round-trip window.

@Component
public class OutboxRelay {

    private final OutboxEventRepository outboxRepo;
    private final KafkaTemplate<String, String> kafka;

    public OutboxRelay(OutboxEventRepository outboxRepo,
                       KafkaTemplate<String, String> kafka) {
        this.outboxRepo = outboxRepo;
        this.kafka      = kafka;
    }

    @Scheduled(fixedDelay = 500)
    @Transactional
    public void relayPendingEvents() {
        List<OutboxEvent> pending = outboxRepo
            .findByPublishedFalseOrderByCreatedAtAsc(PageRequest.of(0, 100));

        for (OutboxEvent event : pending) {
            kafka.send("order-events", event.getAggregateId(), event.getPayload())
                 .whenComplete((result, ex) -> {
                     if (ex == null) {
                         event.setPublished(true);
                         outboxRepo.save(event);
                         // Row is now permanently published — will not be retried
                     }
                     // On failure: row stays published=false → retried next cycle
                     // This is correct at-least-once behavior — not a bug
                 });
        }
    }
}

The OutboxEventRepository

public interface OutboxEventRepository extends JpaRepository<OutboxEvent, String> {
    List<OutboxEvent> findByPublishedFalseOrderByCreatedAtAsc(Pageable pageable);
}

The CDC + Debezium Alternative: Connector Config Only

When the polling relay's 500ms latency floor is unacceptable, replace the relay entirely with a Debezium PostgreSQL connector. Debezium reads from the WAL — no poll thread, ~50ms latency from commit to Kafka delivery.

{
  "name": "order-outbox-connector",
  "config": {
    "connector.class": "io.debezium.connector.postgresql.PostgresConnector",
    "database.hostname": "postgres",
    "database.port": "5432",
    "database.user": "debezium",
    "database.password": "secret",
    "database.dbname": "orders",
    "table.include.list": "public.outbox_events",
    "transforms": "outbox",
    "transforms.outbox.type": "io.debezium.transforms.outbox.EventRouter",
    "transforms.outbox.table.field.event.id": "event_id",
    "transforms.outbox.table.field.event.key": "aggregate_id",
    "transforms.outbox.route.by.field": "event_type"
  }
}

The EventRouter Single Message Transform reads event_type from each captured row and routes the event to the corresponding Kafka topic. The application writes to one table; consumers receive type-partitioned topics. For the full mechanics of WAL-based CDC, see How CDC Works Across Databases.

The DB + Redis Cache Variant: Invalidate, Never Update

For the DB + Redis scenario, the outbox is not needed. The correct fix is a single-line change in the write path: replace SET with DELETE.

@Transactional
public void updateUserProfile(Long userId, ProfileUpdateRequest req) {
    userRepo.save(new User(userId, req));        // DB write — source of truth
    redisTemplate.delete("user:" + userId);      // invalidate key — NOT update
    // Next GET /users/{userId}: cache miss → read from DB → repopulate with fresh data
}

Why DELETE beats SET under failures:

  • DELETE is idempotent. Deleting a non-existent key is a no-op. Retry is always safe.
  • A stale SET is dangerous. A SET with an outdated value propagates incorrect data to every caller until TTL expiry — potentially minutes.
  • Read-through repopulation is correct by construction. The next read misses, fetches from the authoritative database (which has the new value), and repopulates the cache with fresh data. The window of inconsistency is bounded to a single cache miss latency — milliseconds.

🛠️ Spring Modulith: Transactional Events Without the Outbox Boilerplate

Spring Modulith offers a higher-level abstraction that delivers transactional event semantics without writing an outbox entity, relay bean, or repository. Under the hood it uses @TransactionalEventListener(phase = AFTER_COMMIT) — events are only dispatched after the owning transaction successfully commits to the database.

// Publish a domain event within the business transaction — dispatch is deferred
@Service
@Transactional
public class OrderService {

    private final ApplicationEventPublisher events;
    private final OrderRepository repo;

    public OrderService(ApplicationEventPublisher events, OrderRepository repo) {
        this.events = events;
        this.repo   = repo;
    }

    public Order placeOrder(PlaceOrderRequest req) {
        Order order = repo.save(new Order(req));
        events.publishEvent(new OrderPlacedEvent(order.getId(), req.getCustomerId()));
        // event is held in the Spring context until AFTER this @Transactional method commits
        return order;
    }
}

// Listener fires only after the DB transaction commits — if DB rolls back, this never executes
@ApplicationModuleListener
public void onOrderPlaced(OrderPlacedEvent event) {
    kafkaTemplate.send("order-events", event.orderId().toString(), toJson(event));
}

The key guarantee: if repo.save() throws and the transaction rolls back, publishEvent() is discarded — @ApplicationModuleListener never fires. This eliminates Scenario A (ghost orders) without a polling relay or a separate outbox table.

The remaining exposure: the Kafka send() inside the listener can still fail. Spring Modulith does not retry failed listener invocations by default — you need a dead-letter queue or a fallback to the polling outbox for full at-least-once guarantees. For teams that want WAL-level durability without managing Debezium, combining Spring Modulith events with a persisted outbox table gives both the developer experience benefit and the operational safety net.

For a full deep-dive on Spring Modulith's event system, see the official Spring Modulith documentation.


📚 Hard Lessons from Dual Write Failures in Production

  • Every service that writes to DB and publishes an event has a dual write problem. Audit every method body containing both save() and kafkaTemplate.send() — each one is a latent incident with a date on it.
  • "Just retry" converts write failures into duplicate events. Retry without knowing which write failed turns a consistency problem into a duplication problem. It is not a fix.
  • 2PC is dead for Kafka. Kafka has no XA implementation. Even JTA-aware brokers make 2PC a coordinator-failure liability. The engineering consensus reached this conclusion in 2010; do not rediscover it on-call at 2 AM.
  • The Transactional Outbox is the default answer for 80% of teams. One extra table, one @Scheduled bean, no external tooling. It works with any message broker. Reach for CDC only when you have hit the polling latency wall.
  • The outbox table is not self-cleaning. published = true rows accumulate. Without a periodic archival job (DELETE FROM outbox_events WHERE published = true AND created_at < NOW() - INTERVAL '7 days'), the table becomes a performance liability within weeks on high-throughput systems.
  • Delete beats update for cache invalidation. An idempotent DELETE is always safe under failures. A stale SET propagates incorrect data to every reader. Treat the cache as a read accelerator, not a write-through mirror.
  • At-least-once delivery + idempotent consumers is the correct default posture for event-driven systems. Design for it from day one. The event_id as Kafka message key, checked against a processed_events table or Redis set, is the standard consumer-side deduplication pattern.

📌 TLDR: The Dual Write Decision Cheat Sheet

  • The problem: Writing to two independent systems (DB + Kafka, DB + Redis) without coordination creates a consistency window — any failure inside that window leaves the systems permanently out of sync.
  • Try-catch retries are wrong: They cannot distinguish which write failed — they turn write failures into duplicate delivery without closing the consistency gap.
  • 2PC is a dead end for Kafka: Kafka has no XA support; 2PC coordinator failure permanently blocks all participants. The solution adds more failure modes than it removes.
  • Transactional Outbox is the default fix: Co-write business data and a pending event in one DB transaction. A relay delivers to Kafka separately. At-least-once — consumers must be idempotent.
  • CDC + Debezium upgrades the latency: WAL-reading replaces polling, cutting event latency from ~500ms to ~50ms at the cost of a Debezium cluster to operate.
  • Cache invalidation: always DELETE, never SET: A DELETE on write is idempotent and safe under failures. A stale SET propagates incorrect data to every caller until TTL expiry.

📝 Practice Quiz

Q1. An order service saves a row to PostgreSQL and then publishes to Kafka. The Kafka publish times out after the DB commit. What does the fulfillment consumer observe?

  • A) Nothing — the DB rolled back because the Kafka call failed
  • B) The OrderPlaced event arrives — Kafka internally buffers and retries
  • C) No event arrives — the order exists in the DB but fulfillment never receives it
  • D) A duplicate event — the @Transactional annotation triggers a compensating retry

Correct Answer: C — The PostgreSQL commit completes before the Kafka call is attempted. Once committed, the row is durable and permanent. The Kafka failure happens independently with no automatic compensation. The result is a ghost order: visible in the DB, invisible to every consumer.

Q2. Why can XA / Two-Phase Commit not solve the dual write problem when Kafka is involved?

  • A) XA transactions are only supported on NoSQL databases
  • B) The Kafka wire protocol has no XA resource manager — it cannot participate in a JTA transaction
  • C) XA requires all participants to share the same connection pool
  • D) 2PC only works when database schemas are identical across participants

Correct Answer: B — The Kafka protocol has no concept of XA prepare/commit. Kafka's own transactions API guarantees atomicity only across Kafka partitions, not across Kafka and an external RDBMS. There is no way to enlist a Kafka producer in a JTA transaction using the standard client.

Q3. The Outbox Relay successfully delivers an event to Kafka. Before it marks published = true, the relay pod crashes and restarts. What happens next, and what consumer behaviour does this require?

  • A) The event is lost — the relay must regenerate it by replaying the orders table
  • B) The relay re-reads the row (published = false), resends the event; the consumer receives a duplicate and must deduplicate on event_id
  • C) Kafka automatically detects the duplicate at the broker level and drops the second message
  • D) The relay detects the crash via a heartbeat and skips the row on restart

Correct Answer: B — At-least-once delivery is a deliberate design choice of the Outbox pattern. On restart, the relay re-reads all published = false rows and retries. The consumer must deduplicate using the event_id — for example via a processed_events table or a Redis set — before acting.

Q4. A team's order service publishes events with a 1.2s average latency (polling outbox, 1s interval). A new SLA requires payment confirmation events within 150ms of the DB commit. What is the architecturally correct fix?

  • A) Reduce the poll interval from 1000ms to 100ms
  • B) Add three more relay instances to parallelise polling
  • C) Replace the polling relay with a Debezium CDC connector reading from the PostgreSQL WAL
  • D) Move the Kafka send inside the @Transactional method to eliminate the relay delay

Correct Answer: C — Reducing the poll interval adds query load on the primary DB without guaranteeing sub-150ms end-to-end latency. Option D re-introduces the dual write problem. Debezium reads directly from the WAL and delivers to Kafka within 50–100ms of commit with no polling overhead.

Q5. Open-ended challenge: Your checkout service uses the Transactional Outbox for OrderPlaced events. Product wants two new behaviours: (a) a loyalty points event that can take up to 5 seconds, and (b) a fraud-screening event that must arrive within 200ms. Both originate from the same placeOrder() method. How would you design delivery for each? Should they share an outbox table and relay, or be separated? What are the trade-offs of co-locating high-latency and low-latency events in a single relay?

There is no single correct answer. Strong answers will address: separating the relay into two instances (slow-poll for loyalty, CDC for fraud), head-of-line blocking risk in a shared relay, consumer isolation so fraud latency is not affected by loyalty backlog, and alerting on published = false row age exceeding 300ms for the fraud path.


Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms