Clock Skew and Causality Violations: Why Distributed Clocks Lie

Why wall clocks diverge, how NTP drift breaks Last-Write-Wins, and how vector clocks and HLC restore causal ordering

System Design Interview Prep

Abstract Algorithms

·May 3, 2026·18 min read

📚

Intermediate

For developers with some experience. Builds on fundamentals.

Estimated read time: 18 min

AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.

TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions — but under load, across datacenters, or after a VM pause, the drift can reach seconds. When systems use wall-clock timestamps to resolve write conflicts (Last-Write-Wins), that drift causes the correct write to lose to a stale one, silently. Causality violations are a related but distinct problem: even with perfect clocks, asynchronous message delivery means a reply can arrive before the question. Lamport timestamps, vector clocks, and Hybrid Logical Clocks are the tools that replace unreliable wall-clock ordering with provably correct causal ordering.

In 2017, a series of Cassandra bug reports described a mysterious data loss pattern: rows that had been written were later returned by reads as if they had never existed — or worse, old values were returned in place of newer ones. The investigation traced the root cause to client-supplied timestamps that were offset by several seconds due to NTP drift on application servers. When Cassandra applied Last-Write-Wins conflict resolution, it discarded the correct write because the client that made it had a clock that lagged behind another client's clock by two seconds. The write happened after the other in physical time. It lost because its timestamp was earlier.

No network packet was dropped. No node was down. The cluster was operating correctly by its own logic. The problem was that the logic was trusting clocks that were silently wrong.

📖 Why Two Machines Can Never Perfectly Agree on the Time

A physical clock is a counter that increments based on the oscillation frequency of a quartz crystal (in most servers) or an atomic resonator (in GPS receivers and atomic clocks). Quartz crystals drift — typically by 1 to 100 parts per million — meaning a server's clock can gain or lose between 86 milliseconds and 8.6 seconds per day even without any external disturbance.

NTP (Network Time Protocol) corrects this by synchronizing server clocks against a hierarchy of reference clocks. Under normal conditions on a well-connected server, NTP keeps clock error below 1–10 milliseconds. But the synchronization is imperfect by design: NTP cannot fully compensate for the variable network delay on the sync messages themselves, and it applies corrections gradually to avoid sudden jumps that could confuse applications.

The gap between theory and production is significant:

Condition	Typical Clock Error
Single datacenter, well-configured NTP	1–10 ms
Cross-datacenter, same cloud region	10–100 ms
Cross-region or cross-continent	100 ms – 1 s
VM with paused clock (live migration, GC, hypervisor contention)	1 s – 30 s
Network partition (NTP sync blocked)	Grows unboundedly at crystal drift rate

The analogy: two wristwatches synchronized to the same radio signal will drift apart between broadcasts. If the radio signal is unavailable for 10 minutes, the watches diverge by how much each has drifted in those 10 minutes. The longer the gap between synchronization events, the larger the disagreement. Distributed server clocks behave identically — NTP is the radio signal, and the gap between syncs is where drift accumulates.

The consequence: any distributed system that uses wall-clock timestamps to determine the order of events will sometimes get the order wrong. The systems that are most exposed are those that use Last-Write-Wins (LWW) conflict resolution, because LWW explicitly relies on comparing timestamps to decide which write should survive.

🔍 The Basics: Clock Synchronization and Its Limits

A clock skew is the difference between the wall-clock time on two different machines at the same instant. Even machines running NTP on the same local network diverge by 1–10 milliseconds between synchronization events. Across datacenters or after a VM live-migration pause, that drift can reach seconds.

Why it matters for distributed systems:

Any operation that uses a timestamp to determine event order will get that order wrong when clocks disagree.
Last-Write-Wins (LWW) conflict resolution — the default in Cassandra, the basis for Dynamo-style stores — picks the write with the highest timestamp. If one client's clock is 2 seconds behind, its write will always lose to writes from clients with accurate clocks, even if the behind-clock write happened after in physical time.
Causality is a separate but related problem: even with perfect clocks, asynchronous message delivery means a reply can arrive before its triggering message. Wall clocks cannot track this.

The fix for LWW corruption is NTP hardening plus server-side timestamps. The fix for causality tracking is a logical clock (Lamport timestamps, vector clocks, or HLC) that is independent of physical time.

⚙️ How Clock Skew Breaks Last-Write-Wins Conflict Resolution

Last-Write-Wins is the simplest conflict resolution strategy: when two replicas hold different values for the same key, the value with the higher timestamp wins. The losing value is discarded. In a system with perfectly synchronized clocks, this is a reasonable heuristic. In a system with clock skew, it silently corrupts data.

The scenario:

Node A's clock is 2 seconds behind wall time (its NTP sync was delayed).
Node B's clock is accurate.
At wall time T=10:00:00.000, a client writes user.status = "active" via Node A. The write carries timestamp 09:59:58.000 (A's clock, 2 seconds behind).
At wall time T=10:00:00.500 (half a second after the first write), a second client writes user.status = "inactive" via Node B. The write carries timestamp 10:00:00.500 (B's accurate clock).
When these two writes are replicated and resolved via LWW, B's write wins because 10:00:00.500 > 09:59:58.000.
The "inactive" write wins — even though the "active" write happened 500 ms later in physical time.

The first write, which was the most recent in physical time, is silently discarded. The client that wrote "active" received a success acknowledgment. The value on disk is "inactive". No error was returned. No alarm fired.

Cassandra uses LWW with client-supplied timestamps as its default conflict resolution mechanism. If the clients are application servers with drifted clocks, data corruption is a predictable consequence — not a rare edge case.

🧠 Deep Dive: From Lamport Clocks to Vector Clocks

Internals: Why Causality Breaks Without Synchronized Clocks

Causality is the relationship between events where one event causes or enables another. If user A posts a message and user B replies to it, A's post causally precedes B's reply. A system that delivers B's reply before A's post to some observer has violated causality — the observer sees an answer to a question that hasn't been asked yet.

This violation can happen even with perfectly synchronized clocks, purely from asynchronous network delivery. Message M1 (A's post) and message M2 (B's reply) may travel different network paths; M2 may arrive at observer C before M1. Wall-clock timestamps do not help here — both messages may carry identical or nearly identical timestamps.

The fix requires a logical clock — a mechanism for tracking which events happened before which other events that is independent of physical time.

Lamport Timestamps

Leslie Lamport's 1978 paper introduced the simplest logical clock. Each node maintains a single integer counter. The rules are:

On sending a message: increment the counter by 1, attach the counter value to the message.
On receiving a message: set the counter to max(local_counter, received_counter) + 1.
On a local event: increment the counter by 1.

This guarantees the happens-before property: if event A happened before event B, then A's Lamport timestamp is strictly less than B's. But the converse is not true — if A's timestamp is less than B's, we cannot conclude that A happened before B; they may be concurrent.

Lamport timestamps are useful for ordering events in a log and for ensuring that a response cannot carry a timestamp lower than the request that triggered it. They are insufficient for detecting concurrent events — for that, vector clocks are required.

Vector Clocks: Tracking Causal History Precisely

A vector clock is a list of counters, one per node in the system. Node i's vector clock is V = [V[1], V[2], ..., V[N]] where V[j] represents the number of events from node j that node i has observed.

The rules:

On a local event at node i: increment V[i] by 1.
On sending a message from node i: increment V[i] by 1, attach the full vector V to the message.
On receiving a message at node i with vector W: set V[j] = max(V[j], W[j]) for all j, then increment V[i] by 1.

Comparing two vector clocks V1 and V2:

V1 happens-before V2 if every component of V1 is ≤ the corresponding component of V2, and at least one component is strictly <.
V2 happens-before V1 if the reverse holds.
V1 and V2 are concurrent if neither happens-before the other — they evolved independently, neither knowing about the other's events.

The concurrent case is the key: it is the precise condition under which a conflict exists and application-level merge logic is needed.

Performance Comparison

Vector clocks are precise but expensive:

Clock Mechanism	Space per Event	Causal Detection	Concurrent Event Detection	Use Cases
Wall clock	8 bytes (int64 timestamp)	No — only physical time	No	Simple logging, rough ordering
Lamport timestamp	8 bytes (int64 counter)	Yes — partial (happens-before)	No	Log ordering, distributed debugging
Vector clock	8 × N bytes (N = node count)	Yes — full	Yes — explicit concurrent detection	Distributed version control, Riak, Dynamo
Hybrid Logical Clock (HLC)	16 bytes (wall + logical counter)	Yes — full	Partial	CockroachDB, YugabyteDB, production databases

For systems with N < 10 nodes, vector clocks are practical. For systems with N > 100 nodes (large Cassandra clusters, global microservice meshes), the O(N) space per message becomes prohibitive. Hybrid Logical Clocks provide a practical middle ground.

📊 Visualizing Causality Violations and How Vector Clocks Detect Them

The sequence diagram below shows the classic causality violation: B's reply to A's post arrives at C before A's post does. The vector clock attached to each message lets C detect that it is missing a causal predecessor before displaying any content.

sequenceDiagram
    participant A as "Node A (poster)"
    participant B as "Node B (replier)"
    participant C as "Node C (observer)"

    A->>B: Post "Hello?" with vector [A:1, B:0, C:0]
    B->>B: Reply "Yes, hi!" with vector [A:1, B:1, C:0]
    B->>C: Deliver reply first (network path is faster)
    Note over C: C sees [A:1, B:1, C:0] but has not seen A:1
    C->>C: Detect missing predecessor — buffer the reply
    A->>C: Deliver original post (delayed path)
    Note over C: C now has A:1 — causal predecessor satisfied
    C->>C: Display post first, then reply in causal order

The flowchart below shows how a distributed system should process incoming events using vector clocks to enforce causal delivery:

flowchart TD
    A["Message arrives with vector clock W"] --> B{"Is my local V causally ready\nfor W?"}
    B -->|"Yes — all W predecessors already seen"| C["Deliver message to application"]
    B -->|"No — missing causal predecessor"| D["Buffer message in causal queue"]
    D --> E["Wait for missing predecessor to arrive"]
    E --> F["Missing predecessor arrives"]
    F --> G["Deliver buffered messages in causal order"]
    C --> H["Update local vector clock\nV[j] = max(V[j], W[j]) then V[i]++"]
    G --> H

🌍 Real-World Cases: Cassandra, Spanner, and the Delete-Before-Write Anomaly

The Cassandra Delete-Before-Write Anomaly

Cassandra's LWW conflict resolution uses the timestamp embedded in each write. A particularly dangerous manifestation occurs with the delete-before-write pattern:

Client writes user.email = "a@example.com" with timestamp T=100.
Client deletes user.email (tombstone) with timestamp T=101.
Client re-writes user.email = "b@example.com" with timestamp T=102.

In a system with clock skew, these three operations may arrive at different replicas with different apparent orderings. If the re-write (step 3) is sent from an application server whose clock is 5 seconds ahead of the server that sent the delete (step 2), the re-write might carry timestamp T=107 while the delete carries T=101. LWW says T=107 wins — the re-write survives, which is correct.

But if the clock relationship is reversed — the delete was sent from the fast-clock server (timestamp T=106) and the re-write from the slow-clock server (timestamp T=97) — then LWW discards the re-write and the tombstone wins. The email address appears to have been successfully re-created, but on any replica that applies the tombstone after the re-write, the row silently disappears. The application sees an inconsistent view depending on which replica it reads from.

This is not a bug in Cassandra. It is the documented behavior of LWW with client-supplied timestamps in a system where clocks are not perfectly synchronized.

Google Spanner and TrueTime

Google Spanner takes the opposite approach from logical clocks: it does not hide clock uncertainty — it exposes and bounds it. Spanner uses GPS receivers and atomic clocks at each datacenter to bound clock uncertainty to ±7 milliseconds globally.

Rather than providing a single timestamp T, Spanner's TrueTime API returns an interval [earliest, latest] that is guaranteed to contain the true current time. The interval is narrow because of the GPS/atomic clock infrastructure — typically 1–7 ms wide.

To provide external consistency (stronger than linearizability), Spanner implements commit-wait: after a transaction's commit timestamp T_commit is chosen, Spanner waits until TrueTime.now().earliest > T_commit before returning success to the client. This guarantees that no future transaction on any node can legitimately claim an earlier timestamp, because all clocks are bounded to within 7 ms of true time.

The cost: each Spanner write incurs a commit-wait delay of 0–14 ms (bounded by the TrueTime uncertainty interval). This is the price of globally consistent external serialization — a guarantee that no other globally distributed database provides without GPS hardware.

CockroachDB and Hybrid Logical Clocks

CockroachDB achieves similar correctness guarantees to Spanner without GPS infrastructure by using Hybrid Logical Clocks (HLC). An HLC timestamp is a pair (wall_component, logical_component). The wall component tracks physical time; the logical component breaks ties.

HLC's critical property: an HLC timestamp is always at least as large as the latest HLC timestamp the node has observed. If a node receives a message with HLC (T=100, L=3) and its own HLC is (T=98, L=0), it advances its clock to (T=100, L=4) — combining the physical time from the received message with a logical counter that ensures uniqueness.

This means HLC timestamps are monotonically increasing even in the presence of significant clock drift. LWW using HLC timestamps is safe — later events in physical time will always carry a higher HLC, regardless of individual node clock accuracy.

⚖️ Trade-offs Across Clock Mechanisms

Mechanism	Accuracy	Space Overhead	Operational Dependency	Best For
Wall clock (NTP)	Low (100 ms – seconds of error)	Minimal (8 bytes)	NTP infrastructure	Rough logging, non-critical ordering
Lamport timestamp	Correct for happens-before, not concurrent	Minimal (8 bytes)	None	Event log ordering, debug tracing
Vector clock	Precise causal ordering + concurrent detection	O(N) per event	None	Small clusters, distributed version control
Hybrid Logical Clock (HLC)	Correct causal ordering, monotonic	16 bytes	NTP (for wall component)	Production databases, global systems
TrueTime (GPS + atomic)	Bounded uncertainty (±7 ms)	16 bytes	GPS receivers, atomic clocks	Google-scale global linearizability

🧭 Choosing the Right Clock Mechanism

Situation	Recommendation
Use wall clocks when	Ordering precision is not critical; events are for logging or analytics only; you control client clock discipline
Use Lamport timestamps when	You need a simple happened-before relationship for event ordering; concurrent events are acceptable without detection
Use vector clocks when	You need to detect concurrent events (conflict detection for distributed version control, Riak-style sibling values); your cluster has fewer than 20–30 nodes
Use HLC when	You need production-safe conflict resolution across nodes with imperfect NTP; you cannot afford GPS infrastructure; CockroachDB or YugabyteDB is your database
Use TrueTime when	You are operating at Google scale; globally distributed linearizable transactions are required; GPS receiver infrastructure is acceptable
Avoid client-supplied LWW timestamps when	Any node in your system can have clock drift; clients supply their own timestamps for writes (use server-assigned HLC instead)

🧪 The Delete-Before-Write Anomaly: A Detailed Walkthrough

The delete-before-write anomaly is the most practically dangerous manifestation of clock skew in Cassandra deployments. Understanding exactly why it occurs is the first step to preventing it.

Setup: A Cassandra cluster with replication factor 3. Client application has two server instances, App-1 and App-2. App-1's clock is 3 seconds ahead of App-2's clock. Both are producing writes with their local wall-clock timestamps.

Timeline:

App-2 (clock: T=1000) writes row[key="user:42"].email = "old@example.com" → timestamp 1000.
Business logic decides the row is stale and should be deleted. App-1 (clock: T=1003) issues DELETE row[key="user:42"] → tombstone with timestamp 1003.
App-2 (clock: T=1001) writes row[key="user:42"].email = "new@example.com" → timestamp 1001.

From LWW's perspective: the tombstone (1003) beats the new write (1001). The tombstone wins. The row disappears.

From physical time: the order was write (T=1000), delete (T=1003), re-create (T=1004 actual wall time, but App-2's clock said T=1001). The re-create was genuinely the last operation in physical time. LWW with skewed clocks discarded it.

The fix: Use USING TIMESTAMP with server-assigned HLC or Cassandra-generated timestamps instead of relying on client clocks. Set read_repair_chance=1.0 and monitor clock offset metrics on application servers. Bound NTP drift to under 1 second for any cluster that uses LWW semantics.

🛠️ Cassandra and CockroachDB: Clock Safety Configuration

Cassandra: Mitigating Clock Skew with Quorum Consistency

The safest mitigation for clock skew in Cassandra is to use quorum consistency for both reads and writes. This ensures that a majority of replicas participate in every operation, reducing (but not eliminating) the risk that a write with a stale timestamp silently loses to a concurrent write.

# cassandra.yaml — consistency configuration
# Use QUORUM or LOCAL_QUORUM for reads and writes on critical tables
# to mitigate the risk of stale-timestamp LWW conflicts

# Default read consistency (application-level, not cassandra.yaml)
# Set via CQL: CONSISTENCY QUORUM

# Clock skew monitoring threshold
# Alert if any node's NTP offset exceeds this value
# ntp_check_threshold: 500ms  (monitor via nodetool tpstats + NTP monitoring)

# Tombstone window: ensure gc_grace_seconds > max possible replication lag
# If tombstones are compacted before they reach all replicas, deleted data reappears
gc_grace_seconds: 864000   # 10 days — default; do not reduce without understanding tombstone mechanics

CockroachDB: HLC Configuration

CockroachDB uses HLC natively and requires no special configuration for causal correctness. The key operational parameter is the maximum clock offset tolerance:

# CockroachDB cluster settings (set via SQL: SET CLUSTER SETTING)

# Maximum acceptable clock offset between nodes
# CockroachDB will refuse to start if offset exceeds this value
# Default: 500ms — suitable for most deployments
# Reduce to 250ms for stricter consistency with well-disciplined NTP
server.clock.max_offset: 500ms

# CockroachDB monitors clock offset continuously and will:
# - Warn if offset exceeds 80% of max_offset
# - Halt the node if offset exceeds max_offset to prevent consistency violations

📚 Lessons from Clock-Driven Production Bugs

1. Clocks are observations, not facts. A timestamp from a distributed node is that node's belief about the current time. It is not a fact. Any system that treats timestamps as authoritative ordering facts will eventually encounter a case where two nodes disagree and the wrong value wins.

2. NTP discipline is necessary but not sufficient for conflict resolution. NTP keeps clocks within tens of milliseconds under normal conditions. Tens of milliseconds is more than enough for LWW to produce incorrect results in a high-throughput system — hundreds of writes per second means hundreds of potential write-ordering inversions per second.

3. Client-supplied timestamps are a trap. Cassandra, MongoDB, and several other systems allow clients to supply their own write timestamps. This is convenient, but it means that any clock skew on any client propagates directly into conflict resolution. Use server-assigned timestamps wherever possible.

4. The delete-before-write anomaly is predictable, not rare. Any deployment of Cassandra with client-supplied timestamps and NTP drift of more than a few hundred milliseconds will eventually exhibit deleted-data-reappearing or overwritten-new-values. This is not a race condition — it is a deterministic consequence of the conflict resolution model.

5. HLC is the production-safe default. CockroachDB's adoption of HLC shows that you do not need GPS infrastructure to achieve safe conflict resolution in a globally distributed system. HLC adds only 8 bytes per event and requires no external infrastructure beyond the NTP that servers already run.

📌 Key Takeaways

Physical clocks on distributed machines drift apart at rates of 1–100 ppm; NTP corrects this imprecisely, leaving residual errors of 10 ms to seconds depending on conditions.
Last-Write-Wins conflict resolution using wall-clock timestamps silently discards the correct write when any node's clock is drifted — the most-recent write in physical time loses to an earlier write with a higher timestamp.
Causality violations are distinct from clock skew: they occur when asynchronous message delivery causes a reply to arrive before the message that triggered it, even with perfect clocks.
Lamport timestamps solve the happens-before ordering problem but cannot distinguish concurrent events; vector clocks solve both but grow O(N) in size with the number of nodes.
Hybrid Logical Clocks (HLC) combine the physical time approximation of wall clocks with the monotonic correctness of logical clocks, fitting in 16 bytes and requiring no GPS infrastructure.
Google Spanner's TrueTime bounds clock uncertainty to ±7 ms using GPS and atomic clocks, then uses commit-wait to guarantee that no transaction can claim an earlier timestamp than one already committed anywhere in the system.
Any Cassandra deployment using client-supplied timestamps must either bound NTP drift below the minimum meaningful conflict window or switch to quorum consistency and server-assigned timestamps.

The meta-lesson: a distributed system that trusts physical clocks for correctness is building on a foundation that physics actively tries to undermine. Use logical clocks for ordering. Use wall clocks only for approximate human-readable timestamps and performance monitoring.

Split Brain Explained: When Two Nodes Both Think They Are Leader — fencing tokens and epoch numbers for leader election safety
Stale Reads and Cascading Failures in Distributed Systems — replication lag consistency models and circuit breaker patterns
Data Anomalies in Distributed Systems: An Overview — full distributed anomaly taxonomy including split brain, stale reads, and cascading failures
The Consistency Continuum: Patterns in Distributed Systems — read-your-writes, monotonic reads, and the full consistency model spectrum
Dirty Read Database Anomaly — the SQL transaction counterpart: reading uncommitted data from a concurrent transaction
Write Skew Database Anomaly — concurrent SQL transaction anomaly that arises under snapshot isolation

Test Your Knowledge

🧠

Ready to test what you just learned?

AI will generate 4 questions based on this article's content.

Stale Reads and Cascading Failures in Distributed Systems

TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable — stale reads...

May 3, 2026•23 min read

Split Brain Explained: When Two Nodes Both Think They Are Leader

TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader — each accepting writes the other never sees. Prevent it with quorum consensus (at least ⌊N/2⌋+1 nodes must agree before leadership is g...

May 3, 2026•20 min read

NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node — virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...

May 3, 2026•22 min read

HyperLogLog Explained: Counting Billions of Unique Items with 12 KB

TLDR: HyperLogLog estimates the number of distinct elements in a dataset using ~12 KB of memory regardless of cardinality — with ±0.81% error. The insight: if you hash every element to a random bit string, the maximum length of leading zeros you obse...

May 3, 2026•17 min read