Clock Skew and Causality Violations: Why Distributed Clocks Lie
Why wall clocks diverge, how NTP drift breaks Last-Write-Wins, and how vector clocks and HLC restore causal ordering
Abstract AlgorithmsIntermediate
For developers with some experience. Builds on fundamentals.
Estimated read time: 18 min
AI-assisted content. This post may have been written or enhanced with AI tools. Please verify critical information independently.
TLDR: Physical clocks on distributed machines cannot be perfectly synchronized. NTP keeps them within tens to hundreds of milliseconds in normal conditions โ but under load, across datacenters, or after a VM pause, the drift can reach seconds. When systems use wall-clock timestamps to resolve write conflicts (Last-Write-Wins), that drift causes the correct write to lose to a stale one, silently. Causality violations are a related but distinct problem: even with perfect clocks, asynchronous message delivery means a reply can arrive before the question. Lamport timestamps, vector clocks, and Hybrid Logical Clocks are the tools that replace unreliable wall-clock ordering with provably correct causal ordering.
In 2017, a series of Cassandra bug reports described a mysterious data loss pattern: rows that had been written were later returned by reads as if they had never existed โ or worse, old values were returned in place of newer ones. The investigation traced the root cause to client-supplied timestamps that were offset by several seconds due to NTP drift on application servers. When Cassandra applied Last-Write-Wins conflict resolution, it discarded the correct write because the client that made it had a clock that lagged behind another client's clock by two seconds. The write happened after the other in physical time. It lost because its timestamp was earlier.
No network packet was dropped. No node was down. The cluster was operating correctly by its own logic. The problem was that the logic was trusting clocks that were silently wrong.
๐ Why Two Machines Can Never Perfectly Agree on the Time
A physical clock is a counter that increments based on the oscillation frequency of a quartz crystal (in most servers) or an atomic resonator (in GPS receivers and atomic clocks). Quartz crystals drift โ typically by 1 to 100 parts per million โ meaning a server's clock can gain or lose between 86 milliseconds and 8.6 seconds per day even without any external disturbance.
NTP (Network Time Protocol) corrects this by synchronizing server clocks against a hierarchy of reference clocks. Under normal conditions on a well-connected server, NTP keeps clock error below 1โ10 milliseconds. But the synchronization is imperfect by design: NTP cannot fully compensate for the variable network delay on the sync messages themselves, and it applies corrections gradually to avoid sudden jumps that could confuse applications.
The gap between theory and production is significant:
| Condition | Typical Clock Error |
| Single datacenter, well-configured NTP | 1โ10 ms |
| Cross-datacenter, same cloud region | 10โ100 ms |
| Cross-region or cross-continent | 100 ms โ 1 s |
| VM with paused clock (live migration, GC, hypervisor contention) | 1 s โ 30 s |
| Network partition (NTP sync blocked) | Grows unboundedly at crystal drift rate |
The analogy: two wristwatches synchronized to the same radio signal will drift apart between broadcasts. If the radio signal is unavailable for 10 minutes, the watches diverge by how much each has drifted in those 10 minutes. The longer the gap between synchronization events, the larger the disagreement. Distributed server clocks behave identically โ NTP is the radio signal, and the gap between syncs is where drift accumulates.
The consequence: any distributed system that uses wall-clock timestamps to determine the order of events will sometimes get the order wrong. The systems that are most exposed are those that use Last-Write-Wins (LWW) conflict resolution, because LWW explicitly relies on comparing timestamps to decide which write should survive.
๐ The Basics: Clock Synchronization and Its Limits
A clock skew is the difference between the wall-clock time on two different machines at the same instant. Even machines running NTP on the same local network diverge by 1โ10 milliseconds between synchronization events. Across datacenters or after a VM live-migration pause, that drift can reach seconds.
Why it matters for distributed systems:
- Any operation that uses a timestamp to determine event order will get that order wrong when clocks disagree.
- Last-Write-Wins (LWW) conflict resolution โ the default in Cassandra, the basis for Dynamo-style stores โ picks the write with the highest timestamp. If one client's clock is 2 seconds behind, its write will always lose to writes from clients with accurate clocks, even if the behind-clock write happened after in physical time.
- Causality is a separate but related problem: even with perfect clocks, asynchronous message delivery means a reply can arrive before its triggering message. Wall clocks cannot track this.
The fix for LWW corruption is NTP hardening plus server-side timestamps. The fix for causality tracking is a logical clock (Lamport timestamps, vector clocks, or HLC) that is independent of physical time.
โ๏ธ How Clock Skew Breaks Last-Write-Wins Conflict Resolution
Last-Write-Wins is the simplest conflict resolution strategy: when two replicas hold different values for the same key, the value with the higher timestamp wins. The losing value is discarded. In a system with perfectly synchronized clocks, this is a reasonable heuristic. In a system with clock skew, it silently corrupts data.
The scenario:
- Node A's clock is 2 seconds behind wall time (its NTP sync was delayed).
- Node B's clock is accurate.
- At wall time T=10:00:00.000, a client writes
user.status = "active"via Node A. The write carries timestamp09:59:58.000(A's clock, 2 seconds behind). - At wall time T=10:00:00.500 (half a second after the first write), a second client writes
user.status = "inactive"via Node B. The write carries timestamp10:00:00.500(B's accurate clock). - When these two writes are replicated and resolved via LWW, B's write wins because
10:00:00.500 > 09:59:58.000. - The
"inactive"write wins โ even though the"active"write happened 500 ms later in physical time.
The first write, which was the most recent in physical time, is silently discarded. The client that wrote "active" received a success acknowledgment. The value on disk is "inactive". No error was returned. No alarm fired.
Cassandra uses LWW with client-supplied timestamps as its default conflict resolution mechanism. If the clients are application servers with drifted clocks, data corruption is a predictable consequence โ not a rare edge case.
๐ง Deep Dive: From Lamport Clocks to Vector Clocks
Internals: Why Causality Breaks Without Synchronized Clocks
Causality is the relationship between events where one event causes or enables another. If user A posts a message and user B replies to it, A's post causally precedes B's reply. A system that delivers B's reply before A's post to some observer has violated causality โ the observer sees an answer to a question that hasn't been asked yet.
This violation can happen even with perfectly synchronized clocks, purely from asynchronous network delivery. Message M1 (A's post) and message M2 (B's reply) may travel different network paths; M2 may arrive at observer C before M1. Wall-clock timestamps do not help here โ both messages may carry identical or nearly identical timestamps.
The fix requires a logical clock โ a mechanism for tracking which events happened before which other events that is independent of physical time.
Lamport Timestamps
Leslie Lamport's 1978 paper introduced the simplest logical clock. Each node maintains a single integer counter. The rules are:
- On sending a message: increment the counter by 1, attach the counter value to the message.
- On receiving a message: set the counter to
max(local_counter, received_counter) + 1. - On a local event: increment the counter by 1.
This guarantees the happens-before property: if event A happened before event B, then A's Lamport timestamp is strictly less than B's. But the converse is not true โ if A's timestamp is less than B's, we cannot conclude that A happened before B; they may be concurrent.
Lamport timestamps are useful for ordering events in a log and for ensuring that a response cannot carry a timestamp lower than the request that triggered it. They are insufficient for detecting concurrent events โ for that, vector clocks are required.
Vector Clocks: Tracking Causal History Precisely
A vector clock is a list of counters, one per node in the system. Node i's vector clock is V = [V[1], V[2], ..., V[N]] where V[j] represents the number of events from node j that node i has observed.
The rules:
- On a local event at node i: increment
V[i]by 1. - On sending a message from node i: increment
V[i]by 1, attach the full vectorVto the message. - On receiving a message at node i with vector
W: setV[j] = max(V[j], W[j])for all j, then incrementV[i]by 1.
Comparing two vector clocks V1 and V2:
- V1 happens-before V2 if every component of V1 is โค the corresponding component of V2, and at least one component is strictly <.
- V2 happens-before V1 if the reverse holds.
- V1 and V2 are concurrent if neither happens-before the other โ they evolved independently, neither knowing about the other's events.
The concurrent case is the key: it is the precise condition under which a conflict exists and application-level merge logic is needed.
Performance Comparison
Vector clocks are precise but expensive:
| Clock Mechanism | Space per Event | Causal Detection | Concurrent Event Detection | Use Cases |
| Wall clock | 8 bytes (int64 timestamp) | No โ only physical time | No | Simple logging, rough ordering |
| Lamport timestamp | 8 bytes (int64 counter) | Yes โ partial (happens-before) | No | Log ordering, distributed debugging |
| Vector clock | 8 ร N bytes (N = node count) | Yes โ full | Yes โ explicit concurrent detection | Distributed version control, Riak, Dynamo |
| Hybrid Logical Clock (HLC) | 16 bytes (wall + logical counter) | Yes โ full | Partial | CockroachDB, YugabyteDB, production databases |
For systems with N < 10 nodes, vector clocks are practical. For systems with N > 100 nodes (large Cassandra clusters, global microservice meshes), the O(N) space per message becomes prohibitive. Hybrid Logical Clocks provide a practical middle ground.
๐ Visualizing Causality Violations and How Vector Clocks Detect Them
The sequence diagram below shows the classic causality violation: B's reply to A's post arrives at C before A's post does. The vector clock attached to each message lets C detect that it is missing a causal predecessor before displaying any content.
sequenceDiagram
participant A as "Node A (poster)"
participant B as "Node B (replier)"
participant C as "Node C (observer)"
A->>B: Post "Hello?" with vector [A:1, B:0, C:0]
B->>B: Reply "Yes, hi!" with vector [A:1, B:1, C:0]
B->>C: Deliver reply first (network path is faster)
Note over C: C sees [A:1, B:1, C:0] but has not seen A:1
C->>C: Detect missing predecessor โ buffer the reply
A->>C: Deliver original post (delayed path)
Note over C: C now has A:1 โ causal predecessor satisfied
C->>C: Display post first, then reply in causal order
The flowchart below shows how a distributed system should process incoming events using vector clocks to enforce causal delivery:
flowchart TD
A["Message arrives with vector clock W"] --> B{"Is my local V causally ready\nfor W?"}
B -->|"Yes โ all W predecessors already seen"| C["Deliver message to application"]
B -->|"No โ missing causal predecessor"| D["Buffer message in causal queue"]
D --> E["Wait for missing predecessor to arrive"]
E --> F["Missing predecessor arrives"]
F --> G["Deliver buffered messages in causal order"]
C --> H["Update local vector clock\nV[j] = max(V[j], W[j]) then V[i]++"]
G --> H
๐ Real-World Cases: Cassandra, Spanner, and the Delete-Before-Write Anomaly
The Cassandra Delete-Before-Write Anomaly
Cassandra's LWW conflict resolution uses the timestamp embedded in each write. A particularly dangerous manifestation occurs with the delete-before-write pattern:
- Client writes
user.email = "a@example.com"with timestamp T=100. - Client deletes
user.email(tombstone) with timestamp T=101. - Client re-writes
user.email = "b@example.com"with timestamp T=102.
In a system with clock skew, these three operations may arrive at different replicas with different apparent orderings. If the re-write (step 3) is sent from an application server whose clock is 5 seconds ahead of the server that sent the delete (step 2), the re-write might carry timestamp T=107 while the delete carries T=101. LWW says T=107 wins โ the re-write survives, which is correct.
But if the clock relationship is reversed โ the delete was sent from the fast-clock server (timestamp T=106) and the re-write from the slow-clock server (timestamp T=97) โ then LWW discards the re-write and the tombstone wins. The email address appears to have been successfully re-created, but on any replica that applies the tombstone after the re-write, the row silently disappears. The application sees an inconsistent view depending on which replica it reads from.
This is not a bug in Cassandra. It is the documented behavior of LWW with client-supplied timestamps in a system where clocks are not perfectly synchronized.
Google Spanner and TrueTime
Google Spanner takes the opposite approach from logical clocks: it does not hide clock uncertainty โ it exposes and bounds it. Spanner uses GPS receivers and atomic clocks at each datacenter to bound clock uncertainty to ยฑ7 milliseconds globally.
Rather than providing a single timestamp T, Spanner's TrueTime API returns an interval [earliest, latest] that is guaranteed to contain the true current time. The interval is narrow because of the GPS/atomic clock infrastructure โ typically 1โ7 ms wide.
To provide external consistency (stronger than linearizability), Spanner implements commit-wait: after a transaction's commit timestamp T_commit is chosen, Spanner waits until TrueTime.now().earliest > T_commit before returning success to the client. This guarantees that no future transaction on any node can legitimately claim an earlier timestamp, because all clocks are bounded to within 7 ms of true time.
The cost: each Spanner write incurs a commit-wait delay of 0โ14 ms (bounded by the TrueTime uncertainty interval). This is the price of globally consistent external serialization โ a guarantee that no other globally distributed database provides without GPS hardware.
CockroachDB and Hybrid Logical Clocks
CockroachDB achieves similar correctness guarantees to Spanner without GPS infrastructure by using Hybrid Logical Clocks (HLC). An HLC timestamp is a pair (wall_component, logical_component). The wall component tracks physical time; the logical component breaks ties.
HLC's critical property: an HLC timestamp is always at least as large as the latest HLC timestamp the node has observed. If a node receives a message with HLC (T=100, L=3) and its own HLC is (T=98, L=0), it advances its clock to (T=100, L=4) โ combining the physical time from the received message with a logical counter that ensures uniqueness.
This means HLC timestamps are monotonically increasing even in the presence of significant clock drift. LWW using HLC timestamps is safe โ later events in physical time will always carry a higher HLC, regardless of individual node clock accuracy.
โ๏ธ Trade-offs Across Clock Mechanisms
| Mechanism | Accuracy | Space Overhead | Operational Dependency | Best For |
| Wall clock (NTP) | Low (100 ms โ seconds of error) | Minimal (8 bytes) | NTP infrastructure | Rough logging, non-critical ordering |
| Lamport timestamp | Correct for happens-before, not concurrent | Minimal (8 bytes) | None | Event log ordering, debug tracing |
| Vector clock | Precise causal ordering + concurrent detection | O(N) per event | None | Small clusters, distributed version control |
| Hybrid Logical Clock (HLC) | Correct causal ordering, monotonic | 16 bytes | NTP (for wall component) | Production databases, global systems |
| TrueTime (GPS + atomic) | Bounded uncertainty (ยฑ7 ms) | 16 bytes | GPS receivers, atomic clocks | Google-scale global linearizability |
๐งญ Choosing the Right Clock Mechanism
| Situation | Recommendation |
| Use wall clocks when | Ordering precision is not critical; events are for logging or analytics only; you control client clock discipline |
| Use Lamport timestamps when | You need a simple happened-before relationship for event ordering; concurrent events are acceptable without detection |
| Use vector clocks when | You need to detect concurrent events (conflict detection for distributed version control, Riak-style sibling values); your cluster has fewer than 20โ30 nodes |
| Use HLC when | You need production-safe conflict resolution across nodes with imperfect NTP; you cannot afford GPS infrastructure; CockroachDB or YugabyteDB is your database |
| Use TrueTime when | You are operating at Google scale; globally distributed linearizable transactions are required; GPS receiver infrastructure is acceptable |
| Avoid client-supplied LWW timestamps when | Any node in your system can have clock drift; clients supply their own timestamps for writes (use server-assigned HLC instead) |
๐งช The Delete-Before-Write Anomaly: A Detailed Walkthrough
The delete-before-write anomaly is the most practically dangerous manifestation of clock skew in Cassandra deployments. Understanding exactly why it occurs is the first step to preventing it.
Setup: A Cassandra cluster with replication factor 3. Client application has two server instances, App-1 and App-2. App-1's clock is 3 seconds ahead of App-2's clock. Both are producing writes with their local wall-clock timestamps.
Timeline:
- App-2 (clock: T=1000) writes
row[key="user:42"].email = "old@example.com"โ timestamp 1000. - Business logic decides the row is stale and should be deleted. App-1 (clock: T=1003) issues
DELETE row[key="user:42"]โ tombstone with timestamp 1003. - App-2 (clock: T=1001) writes
row[key="user:42"].email = "new@example.com"โ timestamp 1001.
From LWW's perspective: the tombstone (1003) beats the new write (1001). The tombstone wins. The row disappears.
From physical time: the order was write (T=1000), delete (T=1003), re-create (T=1004 actual wall time, but App-2's clock said T=1001). The re-create was genuinely the last operation in physical time. LWW with skewed clocks discarded it.
The fix: Use USING TIMESTAMP with server-assigned HLC or Cassandra-generated timestamps instead of relying on client clocks. Set read_repair_chance=1.0 and monitor clock offset metrics on application servers. Bound NTP drift to under 1 second for any cluster that uses LWW semantics.
๐ ๏ธ Cassandra and CockroachDB: Clock Safety Configuration
Cassandra: Mitigating Clock Skew with Quorum Consistency
The safest mitigation for clock skew in Cassandra is to use quorum consistency for both reads and writes. This ensures that a majority of replicas participate in every operation, reducing (but not eliminating) the risk that a write with a stale timestamp silently loses to a concurrent write.
# cassandra.yaml โ consistency configuration
# Use QUORUM or LOCAL_QUORUM for reads and writes on critical tables
# to mitigate the risk of stale-timestamp LWW conflicts
# Default read consistency (application-level, not cassandra.yaml)
# Set via CQL: CONSISTENCY QUORUM
# Clock skew monitoring threshold
# Alert if any node's NTP offset exceeds this value
# ntp_check_threshold: 500ms (monitor via nodetool tpstats + NTP monitoring)
# Tombstone window: ensure gc_grace_seconds > max possible replication lag
# If tombstones are compacted before they reach all replicas, deleted data reappears
gc_grace_seconds: 864000 # 10 days โ default; do not reduce without understanding tombstone mechanics
CockroachDB: HLC Configuration
CockroachDB uses HLC natively and requires no special configuration for causal correctness. The key operational parameter is the maximum clock offset tolerance:
# CockroachDB cluster settings (set via SQL: SET CLUSTER SETTING)
# Maximum acceptable clock offset between nodes
# CockroachDB will refuse to start if offset exceeds this value
# Default: 500ms โ suitable for most deployments
# Reduce to 250ms for stricter consistency with well-disciplined NTP
server.clock.max_offset: 500ms
# CockroachDB monitors clock offset continuously and will:
# - Warn if offset exceeds 80% of max_offset
# - Halt the node if offset exceeds max_offset to prevent consistency violations
๐ Lessons from Clock-Driven Production Bugs
1. Clocks are observations, not facts. A timestamp from a distributed node is that node's belief about the current time. It is not a fact. Any system that treats timestamps as authoritative ordering facts will eventually encounter a case where two nodes disagree and the wrong value wins.
2. NTP discipline is necessary but not sufficient for conflict resolution. NTP keeps clocks within tens of milliseconds under normal conditions. Tens of milliseconds is more than enough for LWW to produce incorrect results in a high-throughput system โ hundreds of writes per second means hundreds of potential write-ordering inversions per second.
3. Client-supplied timestamps are a trap. Cassandra, MongoDB, and several other systems allow clients to supply their own write timestamps. This is convenient, but it means that any clock skew on any client propagates directly into conflict resolution. Use server-assigned timestamps wherever possible.
4. The delete-before-write anomaly is predictable, not rare. Any deployment of Cassandra with client-supplied timestamps and NTP drift of more than a few hundred milliseconds will eventually exhibit deleted-data-reappearing or overwritten-new-values. This is not a race condition โ it is a deterministic consequence of the conflict resolution model.
5. HLC is the production-safe default. CockroachDB's adoption of HLC shows that you do not need GPS infrastructure to achieve safe conflict resolution in a globally distributed system. HLC adds only 8 bytes per event and requires no external infrastructure beyond the NTP that servers already run.
๐ Key Takeaways
- Physical clocks on distributed machines drift apart at rates of 1โ100 ppm; NTP corrects this imprecisely, leaving residual errors of 10 ms to seconds depending on conditions.
- Last-Write-Wins conflict resolution using wall-clock timestamps silently discards the correct write when any node's clock is drifted โ the most-recent write in physical time loses to an earlier write with a higher timestamp.
- Causality violations are distinct from clock skew: they occur when asynchronous message delivery causes a reply to arrive before the message that triggered it, even with perfect clocks.
- Lamport timestamps solve the happens-before ordering problem but cannot distinguish concurrent events; vector clocks solve both but grow O(N) in size with the number of nodes.
- Hybrid Logical Clocks (HLC) combine the physical time approximation of wall clocks with the monotonic correctness of logical clocks, fitting in 16 bytes and requiring no GPS infrastructure.
- Google Spanner's TrueTime bounds clock uncertainty to ยฑ7 ms using GPS and atomic clocks, then uses commit-wait to guarantee that no transaction can claim an earlier timestamp than one already committed anywhere in the system.
- Any Cassandra deployment using client-supplied timestamps must either bound NTP drift below the minimum meaningful conflict window or switch to quorum consistency and server-assigned timestamps.
The meta-lesson: a distributed system that trusts physical clocks for correctness is building on a foundation that physics actively tries to undermine. Use logical clocks for ordering. Use wall clocks only for approximate human-readable timestamps and performance monitoring.
๐ Related Posts
- Split Brain Explained: When Two Nodes Both Think They Are Leader โ fencing tokens and epoch numbers for leader election safety
- Stale Reads and Cascading Failures in Distributed Systems โ replication lag consistency models and circuit breaker patterns
- Data Anomalies in Distributed Systems: An Overview โ full distributed anomaly taxonomy including split brain, stale reads, and cascading failures
- The Consistency Continuum: Patterns in Distributed Systems โ read-your-writes, monotonic reads, and the full consistency model spectrum
- Dirty Read Database Anomaly โ the SQL transaction counterpart: reading uncommitted data from a concurrent transaction
- Write Skew Database Anomaly โ concurrent SQL transaction anomaly that arises under snapshot isolation
Test Your Knowledge
Ready to test what you just learned?
AI will generate 4 questions based on this article's content.

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Stale Reads and Cascading Failures in Distributed Systems
TLDR: Stale reads return superseded data from replicas that haven't yet applied the latest write. Cascading failures turn one overloaded node into a cluster-wide collapse through retry storms and redistributed load. Both are preventable โ stale reads...
Split Brain Explained: When Two Nodes Both Think They Are Leader
TLDR: Split brain happens when a network partition causes two nodes to simultaneously believe they are the leader โ each accepting writes the other never sees. Prevent it with quorum consensus (at least โN/2โ+1 nodes must agree before leadership is g...
NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data
TLDR: Every NoSQL database hides a partitioning engine behind a deceptively simple API. Cassandra uses a consistent hashing ring where a Murmur3 hash of your partition key selects a node โ virtual nodes (vnodes) make rebalancing smooth. DynamoDB mana...
HyperLogLog Explained: Counting Billions of Unique Items with 12 KB
TLDR: HyperLogLog estimates the number of distinct elements in a dataset using ~12 KB of memory regardless of cardinality โ with ยฑ0.81% error. The insight: if you hash every element to a random bit string, the maximum length of leading zeros you obse...
