Abstract Algorithms

Data Anomalies in Distributed Systems: Split Brain, Clock Skew, Stale Reads, and More

A taxonomy of the anomalies that distributed systems produce beyond SQL transaction isolation β€” and the engineering patterns that prevent them

Abstract AlgorithmsAbstract Algorithms//System Design Interview Prep

On this page

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Executive TLDR

  • TLDR: Distributed systems produce anomalies not because the code is buggy β€” but because physics makes perfect consistency impossible across network boundaries.
  • Split brain, stale reads, clock skew, causality violations, and cascading failures are the canonical failure modes.
  • Each has a distinct root cause and a targeted set of prevention patterns.
  • This post is the index: it maps every anomaly to its cause and cure, and links to the focused deep dives for the three main families.

Core mental model

Read this as a system of state, constraints, and failure boundaries.

A taxonomy of the anomalies that distributed systems produce beyond SQL transaction isolation β€” and the engineering patterns that prevent them

Key systems visualization

The article’s conceptual path

01

πŸ“– Why Distributing Data Breaks Rules Physics Never Mentioned

->

02

βš™οΈ Split Brain: When Two Nodes Both Believe They Are the Leader

->

03

🧠 Stale Reads and Clock Skew: The Two Anomalies That Break Your Sense of Time

->

04

πŸ—οΈ Causality Violations and Vector Clocks: When the Reply Arrives Before the Post

->

05

πŸ“Š Network Partition Anomalies: What CAP Looks Like in Production

> TLDR: Distributed systems produce anomalies not because the code is buggy β€” but because physics makes perfect consistency impossible across network boundaries. Split brain, stale reads, clock skew, causality violations, and cascading failures are the canonical failure modes. Each has a distinct root cause and a targeted set of prevention patterns. This post is the index: it maps every anomaly to its cause and cure, and links to the focused deep-dives for the three main families.

A single-node relational database is a marvel of predictability. ACID properties hold because every transaction passes through one scheduler, one memory space, one disk. Distribute that same database across five nodes in two datacenters and the rules change entirely. You now have five independent clocks that drift apart. You have a network that can drop, delay, reorder, and duplicate messages. You have five failure domains that can fail independently β€” or fail in correlated ways that your monitoring cannot distinguish from healthy operation.

The anomalies covered in this post do not arise from bugs. They arise from the physics of distribution: the impossibility of instant communication, the impossibility of synchronized clocks, and the impossibility of knowing whether a silent node is dead or merely slow.

This post is the index. It provides the taxonomy, the anomaly map, the prevention cheatsheet, and links to three focused deep-dives β€” one for each major anomaly family covered here.


πŸ“– Distributed Anomalies vs. SQL Transaction Anomalies

The SQL transaction anomaly posts in this series (dirty read, phantom read, write skew, lost update) cover anomalies that arise from concurrent transactions within a single database. They are solved by choosing the right isolation level. The anomalies in this post are different in origin, detection, and cure.

Anomaly TypeWhere It ArisesPrimary CausePrevention Domain
Dirty read, phantom read, write skewSingle database, concurrent transactionsInsufficient isolation levelSQL isolation levels (READ COMMITTED, SERIALIZABLE)
Split brainDistributed replica setNetwork partition β†’ two leadersQuorum writes, Raft terms, fencing tokens
Stale readsPrimary–replica replicationAsync replication lagConsistency models (read-your-writes, LSN tokens)
Clock skew / causality violationsAny distributed systemClock drift + LWW conflict resolutionHLC, vector clocks, TrueTime
Cascading failuresMicroservice dependency chainsRetry amplification after node failureCircuit breakers, bulkheads, jitter

Understanding which family an anomaly belongs to immediately narrows the solution space. Visibility anomalies are fixed by routing and consistency tokens. Ordering anomalies are fixed by logical clocks. Availability anomalies are fixed by epoch tracking and fencing. Integrity anomalies are fixed by merge semantics or conflict-free data structures.


πŸ” The Core Principle: Consistency and Availability Cannot Coexist Under Partition

The CAP theorem states that a distributed system can guarantee at most two of three properties: Consistency (every read receives the latest write), Availability (every request receives a response), and Partition tolerance (the system operates despite network failures). Since network partitions are unavoidable in production, every distributed system must choose between consistency and availability when a partition occurs.

Systems that prefer consistency (Raft-based clusters, Zookeeper, Spanner) refuse writes when they cannot form a quorum. Systems that prefer availability (Cassandra, DynamoDB, Riak) accept writes from any node and reconcile conflicts later. Every anomaly in this taxonomy traces directly back to where a system sits on this spectrum and which guarantee it relaxes under failure.


βš™οΈ The Full Anomaly Taxonomy

flowchart TD
    A["Anomaly Detected in Production"] --> B{"Classify by family"}
    B -->|"Visibility"| C["Stale or torn data returned to reader"]
    B -->|"Ordering"| D["Events appear in wrong causal order"]
    B -->|"Availability"| E["Node falsely healthy or falsely dead"]
    B -->|"Integrity"| F["Two nodes hold conflicting values"]
    C --> C1{"Stale read or torn?"}
    C1 -->|"Stale"| C2["Bounded staleness / LSN token / Quorum read"]
    C1 -->|"Torn"| C3["WAL atomic replay / Commit marker gating"]
    D --> D1{"Clock drift or async delivery?"}
    D1 -->|"Clock drift"| D2["HLC / TrueTime / Avoid LWW"]
    D1 -->|"Async delivery"| D3["Vector clocks / Causal middleware"]
    E --> E1{"Split brain or zombie?"}
    E1 -->|"Split brain"| E2["Quorum writes / Raft terms / STONITH"]
    E1 -->|"Zombie leader"| E3["Fencing tokens / Epoch rejection"]
    F --> F1{"Merge or overwrite?"}
    F1 -->|"Overwrite acceptable"| F2["LWW with NTP discipline or HLC"]
    F1 -->|"Must preserve all"| F3["CRDTs / Application merge"]

The flowchart is the triage tool. When a production anomaly surfaces, classify it into one of the four families first. The classification step narrows the solution space from twelve options to two or three before you consult documentation.


πŸ“Š Anomaly Observable Impact in Production

AnomalyObservable ImpactMonitoring SignalFirst Response
Split brainConflicting data; duplicate writes acknowledgedTwo primary alerts; write conflict countersFence stale leader; inspect replication logs
Stale readsReads return superseded values with no errorreplication_lag; read-after-write failuresRoute reads to primary; use quorum reads
Clock skewLWW silently discards newer writesClock offset alerts; ntpstat drift > 100 msSwitch to HLC or server-side timestamps
Cascading failureError rate climbs across cluster in minutesp99 latency spike; connection pool exhaustionOpen circuit breakers; enable load shedding

🧠 Deep Dive: How Network Boundaries Break ACID

ACID properties are straightforward in a single-node database where every operation passes through one scheduler, one memory space, and one disk. Distributing the same data across five nodes turns each property into a distributed protocol with new failure modes.

Atomicity across nodes requires two-phase commit, which blocks indefinitely if the coordinator fails mid-transaction. Consistency requires consensus (Raft or Paxos), adding multiple network round-trips per write. Isolation under concurrent distributed transactions requires distributed locking or optimistic concurrency control. Durability requires writes to persist on multiple nodes before acknowledging, adding replication overhead to the critical path.

Each anomaly in this series is a consequence of one of these distributed ACID protocols either failing under partition conditions or being deliberately traded for lower latency and higher availability.

Internals: How Each Anomaly Class Traces Back to a Specific ACID Failure

Each anomaly family maps to a specific distributed ACID breakdown:

  • Split brain β†’ Atomicity failure: Two coordinators both accept writes because the cluster cannot atomically agree on primary ownership across a network partition.
  • Clock skew β†’ Consistency failure: Last-Write-Wins conflict resolution compares wall-clock timestamps, but clocks on different nodes diverge by 10–500 ms under NTP drift. The causally newer write loses.
  • Stale reads β†’ Durability/Isolation failure: Asynchronous replication writes to the primary and returns success before replicas apply the change. Readers on replicas observe pre-write state during the lag window.
  • Cascading failures β†’ Atomicity/Availability trade-off: Circuit breakers and bulkheads represent deliberate choices to sacrifice atomicity of a distributed transaction in exchange for isolating failure blast radius.

Performance Analysis: The Latency Cost of Each Anomaly Prevention Mechanism

Every prevention mechanism adds latency. Understanding the cost-to-safety trade-off is required for system design decisions:

Prevention MechanismLatency AddedThroughput Impact
Quorum reads (R + W > N)+1–3 network RTT per read~30–50% reduction vs. local read
Synchronous replication (2PC)+1–2 RTT per write commitScales linearly with replica count
Raft consensus write+2 RTT (leader + majority ACK)Leader bottleneck under high write rate
Vector clocksNegligible (metadata size grows)O(n) metadata per node in cluster
Circuit breaker state checkSub-millisecond (in-memory)No throughput impact at rest
Full-jitter retry backoffAdds delay under failureReduces retry storm by 90%+ vs. naive retry

🌍 Which Distributed Systems Are Exposed to Which Anomalies

SystemPrimary Anomaly RiskDefault ConsistencyRecommended Fix
PostgreSQL with async replicasStale readsEventual (async WAL)synchronous_commit = remote_apply or LSN token routing
CassandraClock skew LWW corruptionEventual (multi-leader, tunable)Server-side timestamps; HLC
MongoDB (w=1)Split brain on failoverEventualUpgrade to w=majority
Redis Sentinel (quorum=1)Split brain in 3-node setupEventualSet quorum to 2
Microservice chainsCascading failures via retry stormApplication-levelCircuit breakers; bulkheads; jitter

βš–οΈ Prevention Strategy Cheatsheet

AnomalyRoot CauseDetection SignalPrevention PatternKey Systems
Split brainNetwork partition isolates leader; peers elect new leaderTwo nodes claim leadership for same shardQuorum writes, fencing tokens, STONITH, Raft term numbersMongoDB, etcd, Redis Sentinel
Stale readsAsync replication lag; reads from lagging replicapg_stat_replication.write_lag; Seconds_Behind_MasterRead-your-writes (LSN), bounded staleness, quorum readsPostgreSQL, MySQL, Cassandra
Clock skewNTP imprecision; VM clock pausesClock offset metrics; ntpstat; chronyHybrid Logical Clocks (HLC), TrueTime (Spanner), avoid wall-clock LWWCassandra, Spanner, CockroachDB
Causality violationAsync message delivery via different network pathsOut-of-order events; reply seen before postVector clocks, causal consistency middleware, total order broadcastRiak, DynamoDB Streams, Kafka
Conflicting writesMulti-leader or leaderless concurrent writesVersion conflicts on read; anti-entropy divergenceLWW (lossy), application merge, CRDTsCassandra, DynamoDB, Riak
Zombie leaderGC pause exceeds election timeout; old leader resumesMultiple leaders in monitoring; duplicate writesFencing tokens (monotonic epoch), STONITH, Raft term rejectionZooKeeper, etcd, Redis Sentinel
Cascading failureSingle node failure increases load on survivors β†’ chain reactionLatency and error rate climbing across all replicasCircuit breakers, bulkheads, load shedding, jittered backoffResilience4j, Istio, Hystrix
Thundering herdSynchronized client retries after common failure eventRetry spike in traffic metrics immediately after outageFull-jitter exponential backoff, staggered restartsAll HTTP clients, gRPC
Cache stampedeHot cache key expires; all waiters simultaneously hit DBDB query spike on cache expiryProbabilistic early expiration, request coalescing (dog-pile lock)Redis, Memcached, Varnish

πŸ“š Deep-Dive Posts: The Three Major Families

The rest of this series covers each anomaly family in its own focused post. Each post provides: detailed root-cause analysis, Mermaid diagrams showing the failure sequence and prevention mechanisms, real-world incident case studies, trade-off tables, and a decision guide for choosing prevention strategies.

πŸ”΄ Split Brain β€” Two Leaders, Conflicting Writes, Permanent Data Loss

When a network partition isolates a replica set's leader from its peers, the peers trigger a new election and elect a new leader. Both the old and new leader are now simultaneously accepting writes to the same dataset. When the partition heals, the old leader's writes are rolled back. Clients that received success acknowledgments find their data gone.

Covered in: Split Brain Explained: When Two Nodes Both Think They Are Leader

Topics: Raft term numbers and quorum math, STONITH out-of-band fencing, zombie leader + GC pause scenario, etcd and Redis Sentinel configuration, MongoDB 2012 two-primary bug, Elasticsearch minimum master nodes misconfiguration.

🟑 Clock Skew and Causality Violations β€” When Timestamps Lie

Physical clocks on distributed machines drift apart continuously. NTP corrects the drift imprecisely, leaving residual errors of tens to hundreds of milliseconds. Last-Write-Wins conflict resolution using wall-clock timestamps silently discards the correct write when any clock has drifted. Causality violations β€” a reply arriving before the message that triggered it β€” can occur even with perfect clocks, purely from asynchronous message delivery.

Covered in: Clock Skew and Causality Violations: Why Distributed Clocks Lie

Topics: NTP drift characteristics, Lamport timestamps, vector clocks (causal ordering + concurrent event detection), Hybrid Logical Clocks (HLC), Google Spanner TrueTime commit-wait, the Cassandra delete-before-write anomaly, CockroachDB HLC configuration.

🟠 Stale Reads and Cascading Failures β€” Lag Windows and Collapse Chains

Stale reads occur when an asynchronous replica has not yet applied the primary's latest write and a read is routed to it during that window. The read returns the superseded value β€” with no error and no indication that anything is wrong. Cascading failures are a separate but related pattern: a single node failure redistributes load to survivors, increasing their latency, causing clients to retry, doubling the effective traffic, and triggering further failures in a self-amplifying chain.

Covered in: Stale Reads and Cascading Failures in Distributed Systems

Topics: Read-your-writes, monotonic reads, bounded staleness, quorum reads (R + W > N), WAL-based replication lag mechanics, circuit breaker state machine, bulkhead partitioning, load shedding, thundering herd + full-jitter backoff, cache stampede + probabilistic early expiration.


🧭 Decision Guide: Which Anomaly Prevention Applies to Your Use Case

The right prevention strategy depends on what your system cannot afford to get wrong:

  • Cannot afford stale data (inventory, payments, seat reservations): Use strong consistency with quorum reads, synchronous replication, or read-your-writes tokens. Accept higher latency on reads and writes.
  • Cannot afford split brain (any replicated primary): Use Raft-based consensus with majority quorum. Configure fencing tokens and STONITH. Never use w=1 or quorum=1 in sentinel setups.
  • Cannot afford LWW timestamp corruption (Cassandra, Dynamo-style): Switch to server-side timestamps, HLC, or vector clocks. Never use client-supplied wall-clock timestamps for conflict resolution.
  • Cannot afford cascading failures (microservice chains): Add circuit breakers at each service boundary. Use full-jitter exponential backoff. Implement bulkheads to prevent cross-service pool exhaustion.

πŸ§ͺ Practical Anomaly-First Design Review

Before choosing a database or distributed architecture, walk through each anomaly class explicitly:

  1. Split brain: What is the write quorum size? Can two nodes simultaneously believe they are primary without the cluster detecting it?
  2. Stale reads: What is the maximum acceptable replication lag? Does the application need read-your-writes consistency for any user-facing operation?
  3. Clock skew: Are wall-clock timestamps used for conflict resolution anywhere? What is the NTP drift budget across all application servers?
  4. Cascading failures: Is there a retry strategy defined? Do all downstream dependencies have circuit breakers with appropriate thresholds?

Addressing these four questions before selecting infrastructure eliminates the majority of the operational surprises covered in this series.


πŸ“Œ Key Takeaways

  • Distributed anomalies arise from physics β€” network partitions, clock drift, and asynchronous replication are unavoidable properties of distributed systems, not bugs.
  • Every anomaly belongs to one of four families: visibility (stale or torn data), ordering (causality violations), availability (split brain, zombie leader), or integrity (conflicting writes).
  • Each anomaly has a specific monitoring signal and a targeted prevention pattern. Classifying the anomaly type before searching for a fix narrows the solution space from twelve options to two or three.
  • The CAP theorem makes the trade-off explicit: systems optimized for availability under partition are exposed to visibility and integrity anomalies; systems optimized for consistency are exposed to availability anomalies during partitions.

Quiet AI help

Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms

Related deep dives

Continue reading

Abstract Algorithms Β· Β© 2026 Β· Engineering learning lab