All Posts

CosmosDB Partition Internals: Logical vs Physical Partitions Explained

How CosmosDB maps your partition key to physical storage โ€” and why the wrong choice costs 10x more at scale

Abstract AlgorithmsAbstract Algorithms
ยทยท19 min read

AI-assisted content.

๐Ÿ”ฅ When Your Database Bill Triples Overnight

A retail engineering team ships a flash-sale feature. Traffic spikes 10ร—. Their Azure CosmosDB bill triples within 24 hours. Queries that ran in 5ms now take 800ms. The on-call engineer bumps provisioned RU/s from 10,000 to 40,000 โ€” and the throttling barely improves.

The problem is not throughput headroom. It's that 90% of all reads hit a single partition key: status = "active". No matter how many RU/s they add to the account, those requests all funnel into one logical partition, which maps to one physical partition, which has a hard ceiling of 10,000 RU/s.

This is the hot partition problem โ€” and avoiding it starts with understanding the difference between CosmosDB's two-layer partition model: logical partitions (which you design) and physical partitions (which CosmosDB manages). Get the distinction wrong at schema design time, and you cannot fix it without a full data migration. Get it right, and you get near-linear horizontal scalability with sub-10ms reads at any traffic level.

๐Ÿ“– CosmosDB Partitioning: Why Two Layers Exist

CosmosDB is a globally distributed, multi-model database designed to scale horizontally without the operational burden of manual sharding. To do this, it splits data across nodes โ€” but it exposes two conceptually distinct partition layers rather than one.

The logical partition is your contract with the database: you declare a partition key, and CosmosDB groups every document with the same key value together. The physical partition is CosmosDB's internal implementation detail: it actually stores data in replica sets distributed across nodes, and it manages those replica sets on your behalf.

This two-layer design gives you control over data placement (so queries stay in-partition) while giving CosmosDB freedom to redistribute load and grow capacity (so you don't need to re-shard manually). Understanding what each layer can and cannot do is the foundation of every correct CosmosDB data model.

๐Ÿ” Two Layers of Partitioning โ€” One You Control, One You Don't

CosmosDB partitions data across nodes to achieve horizontal scalability. It does this with two distinct abstractions that operate at different layers:

ConceptWho controls itApplication visibilityHard size limit
Logical PartitionYou (via partition key)Fully visible20 GB
Physical PartitionCosmosDB internallyMostly hidden~50 GB

Think of logical partitions as the filing system you design โ€” you label every folder (partition key value). Physical partitions are the filing cabinets CosmosDB builds โ€” it decides which folders go in which cabinet, and it can add cabinets automatically as you fill them.

What Is a Logical Partition?

A logical partition is the set of all documents that share the same partition key value. You define the partition key when creating a container โ€” it is immutable after creation.

  • All documents with customerId = "cust-123" form one logical partition.
  • Maximum size: 20 GB per logical partition value โ€” a hard, unextendable ceiling.
  • Maximum throughput: approximately 10,000 RU/s per logical partition.
  • All documents in a logical partition are co-located on the same physical storage, which enables efficient single-partition queries without cross-node fan-out.

Logical partitions are permanent groupings. Once a document is written, its logical partition assignment is sealed by its partition key value. You cannot reassign a document to a different logical partition without deleting and re-inserting it.

What Is a Physical Partition?

A physical partition is the actual server-side storage and compute unit. CosmosDB creates, manages, and splits physical partitions entirely on your behalf.

  • Each physical partition is backed by a replica set of four nodes for durability and high availability.
  • A single physical partition holds one or more logical partitions.
  • Physical partitions grow to approximately 50 GB and can serve up to 10,000 RU/s.
  • When a physical partition becomes too large or receives too much traffic, CosmosDB automatically splits it into two physical partitions โ€” with zero downtime.

You never create, delete, or explicitly manage physical partitions. They are CosmosDB's internal elasticity mechanism.

โš™๏ธ How CosmosDB Routes Requests: The Hash-Range Mapping

The routing layer between your application and physical storage is a partition key hash space spanning from 0x0000 to 0xFFFF. CosmosDB hashes each document's partition key value into this range. Physical partitions own contiguous slices of the hash space, and logical partitions land in the physical partition that owns their hash.

`mermaid graph TD A[Client Request] --> B[CosmosDB Gateway] B --> C[Partition Key Router] C --> D{Hash of Partition Key} D --> E[Physical Partition 1 - Range 0x0000 to 0x3FFF] D --> F[Physical Partition 2 - Range 0x4000 to 0x7FFF] D --> G[Physical Partition 3 - Range 0x8000 to 0xBFFF] D --> H[Physical Partition 4 - Range 0xC000 to 0xFFFF] E --> I[LP: userId-A, userId-B, userId-C] F --> J[LP: userId-D, userId-E] G --> K[LP: userId-F, userId-G, userId-H] H --> L[LP: userId-J, userId-K]


This diagram shows how a client request reaches a physical partition. The partition key is hashed into the `0x0000โ€“0xFFFF` range. Each physical partition owns a contiguous band of that range and holds the logical partitions whose hashes land there. Multiple logical partitions can share one physical partition โ€” but a single logical partition always lives entirely within one physical partition. This atomic boundary is the critical architectural invariant: logical partitions are indivisible.

## ๐Ÿง  Partition Splitting Internals: How CosmosDB Scales Without Downtime

### The Internals of a Split

CosmosDB triggers a physical partition split under two conditions:

1. **Data volume:** The physical partition approaches ~50 GB.
2. **Throughput scale-up:** You increase provisioned or autoscale RU/s and CosmosDB needs to spread load across more partitions to fulfill the new capacity.

`mermaid
graph TD
    A[Physical Partition P1 - 45GB - 8000 RU/s] -->|Split triggered| B[Physical Partition P1a - 22GB - 4000 RU/s]
    A -->|Split triggered| C[Physical Partition P1b - 23GB - 4000 RU/s]
    B --> D[Logical Partitions LP1 to LP5]
    C --> E[Logical Partitions LP6 to LP10]

When a split occurs, CosmosDB divides the hash range of the original physical partition into two equal halves and migrates each half's logical partitions to one of the two resulting physical partitions. Data migration happens in the background while reads and writes continue โ€” the gateway layer transparently redirects each request to the correct post-split target.

What you actually observe during a split: brief latency spikes (typically under one second), no data loss, and no application code change required. The entire operation is orchestrated by the CosmosDB control plane and is invisible to your application except for transient retry events.

Performance Analysis During a Split

Physical partition splits are generally transparent, but they do have measurable performance effects worth understanding for capacity planning:

  • RU/s temporarily doubles: During migration, both the original and the new physical partition may serve reads simultaneously, meaning the total RU capacity briefly doubles before the old partition drains.
  • Write latency increases transiently: Writes are held in a two-phase commit between the old and new partition during data migration. The spike is typically under 200ms but can surface in P99 latency dashboards.
  • Partition count multiplies on scale-up: When you double provisioned RU/s, CosmosDB may double the physical partition count. More partitions enable more parallelism โ€” but only if the logical partition key provides sufficient cardinality to distribute load across them. A low-cardinality key (three status values) means most new physical partitions remain idle after the split.

Key implication: When you double provisioned RU/s via autoscale, CosmosDB may split physical partitions to accommodate the new throughput ceiling. More physical partitions means better distribution โ€” but only if your logical partition key provides enough cardinality to spread load across them.

๐Ÿ“Š Hot Partition Flow: How a Bad Key Saturates Physical Storage

Physical partition splits and increased RU/s cannot save you from a poorly chosen partition key. The fundamental problem is at the logical partition layer โ€” before CosmosDB's elasticity mechanisms even get involved.

Scenario 1: Low-Cardinality Key

container: "orders"
partitionKey: "/status"
values: ["active", "completed", "cancelled"]

Three logical partitions exist regardless of how many millions of orders you store. Ninety percent of queries target status = "active". That single logical partition and the physical partition it resides on absorb 90% of all RU/s. CosmosDB can split that physical partition, but the hot logical partition moves intact to one of the two halves โ€” the hotness follows the key, not the partition.

Scenario 2: Skewed Natural Key

container: "users"
partitionKey: "/country"

If 80% of users are in the United States, country = "US" accumulates towards the 20 GB logical partition ceiling. Adding RU/s does not help once the ceiling is hit โ€” writes to that partition fail permanently with RequestEntityTooLargeException.

Even Distribution vs. Hot Partition โ€” A Visual Contrast

`mermaid graph TD subgraph Even[Good Key - userId - Even Distribution] PP1[Physical Partition 1 - 3333 RU/s] PP2[Physical Partition 2 - 3333 RU/s] PP3[Physical Partition 3 - 3334 RU/s] end subgraph Uneven[Bad Key - status - Hot Partition] HP1[Physical Partition A - 9500 RU/s THROTTLED] HP2[Physical Partition B - 300 RU/s] HP3[Physical Partition C - 200 RU/s] end


This diagram illustrates the contrast between a well-chosen and a poorly-chosen partition key at the physical level. With `userId` as the key and millions of users, each physical partition handles a balanced fraction of total traffic โ€” all operating well below the 10,000 RU/s physical partition ceiling. With `status` as the key, the `active` partition absorbs almost all traffic and saturates its host physical partition, producing 429 throttling errors even while the other two physical partitions sit nearly idle. The account-level RU/s meter shows low utilization, yet users experience failures โ€” a counter-intuitive outcome that catches many teams off guard.

## ๐Ÿงช Practical: Choosing and Fixing Partition Keys

The goal is **high cardinality + uniform access distribution + query alignment**.

| Design Rule | Good Example | Bad Example |
|---|---|---|
| **High cardinality** | `userId` (millions of distinct values) | `status` (3โ€“5 values) |
| **Uniform access** | Each key accessed at roughly equal rate | One key dominates all reads |
| **Query alignment** | Most queries include the partition key | Most queries omit the partition key |
| **Size headroom** | Each logical partition stays well under 20 GB | Any single key will accumulate > 20 GB |
| **Write spread** | Writes distribute across many distinct key values | All writes target the "latest" key |

### Synthetic Partition Keys for Difficult Workloads

When your data does not have a natural high-cardinality key, build one:

**Suffix bucketing** โ€” append a modulo bucket to an existing key:
`orderId + "_" + (epochSeconds % 50)` โ†’ 50 logical partitions per base ID

**Prefix randomization** โ€” prepend a random segment to force even hashing:
`String.valueOf(random.nextInt(100)) + "_" + entityId`

**Composite key** โ€” combine two meaningful fields:
`tenantId + "_" + userId` โ†’ cardinality multiplies across both fields

Suffix bucketing is especially important for **time-series and append-heavy workloads** where all writes naturally cluster on the most recent timestamp. Distributing writes across `sensorId_0` through `sensorId_99` prevents 100% of write traffic from hitting a single logical partition.

## โš–๏ธ Trade-offs and Failure Modes

| Scenario | Root Cause | Observable Impact | Resolution |
|---|---|---|---|
| **429 Too Many Requests** | Logical partition exceeds physical partition RU/s ceiling | Read/write throttling on all items in that physical partition | Redesign partition key; add suffix bucketing |
| **20 GB limit exceeded** | Low-cardinality key with unbounded data growth | Writes to that partition fail permanently | Container migration to a new key โ€” cannot be done in-place |
| **Cross-partition fan-out** | Query filter omits partition key | Latency 10ร—โ€“100ร— higher; RU cost scales with partition count | Add partition key to query filter; use secondary indexes |
| **Transient latency spikes** | Physical partition split in progress | Sub-second latency blips during autoscale | Expected behavior; implement retries with exponential backoff |
| **Uneven distribution after split** | Too few distinct partition key values | Some physical partitions overloaded post-split | Higher-cardinality key improves distribution after each split |

### The Cross-Partition Query Tax

A query without a partition key filter executes as a **fan-out** โ€” CosmosDB sends the query to every physical partition in parallel and merges results. At 50 physical partitions, you pay 50ร— the RU cost of the equivalent single-partition query. Monitor the `x-ms-documentdb-query-metrics` response header: if `retrievedDocumentCount` greatly exceeds `outputDocumentCount`, you are paying for cross-partition scans. The fix is either adding the partition key to the filter or introducing a secondary index on a projected lookup field.

## ๐ŸŒ Real-World Applications

**E-commerce at scale:** Order management systems partition by `customerId` rather than `status`. Every customer's order history lives in one logical partition โ€” enabling O(1) "my orders" queries โ€” while preventing any single key from accumulating unbounded data. Order IDs (GUIDs) are used as the key only when cross-customer analytics, not per-customer retrieval, is the primary access pattern.

**IoT device telemetry:** Devices emit readings continuously. A naive `timestamp` partition key creates a single hot logical partition for every second of data. Production IoT systems use `deviceId_date` โ€” for example, `device-abc-2026-04-18` โ€” creating thousands of logical partitions per day and spreading writes evenly across physical partitions.

**Multi-tenant SaaS:** Small tenants fit comfortably within 20 GB per `tenantId` partition. For enterprise tenants approaching the limit, suffix buckets (`tenantId_0`, `tenantId_1`, ... `tenantId_9`) distribute one logical tenant across 10 logical partitions. CosmosDB 2023 introduced **hierarchical partition keys** (up to 3 levels: `/tenantId`, `/userId`, `/sessionId`) which handle this pattern natively without manual bucketing.

**Social media feed:** Partition by `userId` so each user's posts form one logical partition. Read-your-own-writes and timeline queries stay in-partition and complete in under 5ms. Celebrity accounts with millions of followers require fan-out to read but writes stay on one partition โ€” an acceptable trade-off given that reads can be cached at the CDN layer.

## ๐Ÿงญ Decision Guide: Selecting a Partition Key by Use Case

| Use Case | Recommended Key | Rationale |
|---|---|---|
| User profiles and activity | `userId` | High cardinality; even distribution; aligns with lookup pattern |
| Order management | `customerId` | Co-locates a customer's orders; efficient per-customer queries |
| IoT sensor telemetry | `deviceId_date` | Prevents write hot spots; keeps device history queryable |
| Multi-tenant app (small tenants) | `tenantId` | Isolates tenant data; all queries stay in-partition |
| Multi-tenant app (large tenants) | `tenantId_bucket` | Prevents 20 GB ceiling; spreads load across buckets |
| Product catalog | `productId` (GUID-based) | Uniform hashing; no natural cardinality problem |
| Audit / append-heavy logs | `sourceId_bucket` | Distributes high-frequency writes across logical partitions |
| Time-series analytics | `metricName_hour` | Balances writes; keeps related metrics queryable per hour |

## ๐Ÿ› ๏ธ Azure CosmosDB SDK: Configuring Partition Key at Container Creation

The partition key is set at container provisioning and is permanent. There is no migration path within the same container โ€” if you choose the wrong key, you must create a new container and copy data. The SDK call that sets this irrevocable choice is simple and easy to overlook:

```java
CosmosContainerProperties props = new CosmosContainerProperties(
    "orders",
    "/customerId"   // partition key path โ€” cannot change after creation
);

// Hierarchical partition key (CosmosDB SDK 4.37+)
// CosmosContainerProperties props = new CosmosContainerProperties("orders",
//     PartitionKeyDefinition.fromPaths(List.of("/tenantId", "/customerId")));

ThroughputProperties autoscale = ThroughputProperties.createAutoscaledThroughput(4000);

database.createContainerIfNotExists(props, autoscale).block();

CosmosDB automatically creates the initial set of physical partitions and handles all splits as data volume and provisioned RU/s grow. The hierarchical key variant (commented out above) allows up to three levels of nesting, co-locating tenant data while maintaining high cardinality โ€” eliminating the need for manual suffix bucketing in most multi-tenant scenarios.

For a full deep-dive on CosmosDB SDK configuration, indexing policies, and consistency levels, see the Azure CosmosDB Java SDK v4 documentation.

๐Ÿ“š Lessons Learned from CosmosDB Partition Failures

  • The 20 GB limit is a hard wall with no operational escape hatch. Unlike throughput limits that you can resolve by raising RU/s, hitting the logical partition size ceiling causes all writes to that partition to fail. The only fix is a full container migration with a new partition key โ€” an expensive operation requiring double-storage, a data copy pipeline, and a cutover window. Validate expected data volume per key value before launch.

  • Hot partitions are invisible in account-level metrics. CosmosDB's default monitoring shows account-level RU/s consumption. A container can report 40% account utilization while one physical partition inside it is at 100% and throttling. Always monitor per-partition RU/s metrics (available in the Azure portal under "Insights โ†’ Throughput โ†’ Normalized RU consumption by partition key range") to detect hot partitions before they cause production incidents.

  • Cross-partition queries are a cost multiplier, not just a latency issue. A query without a partition key filter fan-outs to every physical partition. At 100 physical partitions โ€” a realistic count for a multi-year production container โ€” a cross-partition query costs 100ร— more RU/s than the equivalent in-partition query. Teams that start with 5 physical partitions rarely notice the cost. Teams that have grown to 100 experience unexplained billing spikes. Audit your query patterns whenever you significantly scale a container.

  • Partition key design is a first-class architecture decision. It determines query patterns, cost structure, and the scalability ceiling for the lifetime of that container. It deserves the same deliberate analysis as schema design in a relational database โ€” before the first byte of production data is written.

  • Hierarchical partition keys (introduced 2023) eliminate most synthetic key hacks. If you are designing a new container for a multi-tenant or hierarchical access pattern, evaluate whether hierarchical keys make suffix bucketing unnecessary. They allow CosmosDB to co-locate data at the top-level key while maintaining the cardinality of the full composite key.

๐Ÿ“Œ TLDR & Key Takeaways

TLDR: In CosmosDB, logical partitions are your grouping contract (defined by partition key, max 20 GB, permanent), and physical partitions are CosmosDB's elastic storage units that hold many logical partitions and split automatically. Hot partitions โ€” caused by low-cardinality or skewed keys โ€” cannot be fixed by adding RU/s; they require a better partition key or synthetic suffix bucketing.

  • Logical partition = all documents sharing the same partition key value. You define it. Max 20 GB and ~10,000 RU/s per value. Permanent and indivisible.
  • Physical partition = CosmosDB's actual storage unit backed by a 4-node replica set. Holds multiple logical partitions. Splits automatically when data or throughput exceeds thresholds.
  • Routing = CosmosDB hashes your partition key into a fixed range; physical partitions own contiguous hash bands. Many logical partitions map to one physical partition โ€” never the reverse.
  • Hot partitions arise from low-cardinality or skewed partition keys. They cannot be resolved by increasing RU/s โ€” the bottleneck is at the logical partition level.
  • Fix hot partitions by choosing a high-cardinality key or building synthetic suffix-bucket keys to distribute data and traffic across many logical partitions.
  • Partition key is permanent โ€” plan for worst-case data volume and access patterns before the first write. Changing it requires a full container migration.

๐Ÿ“ Practice Quiz

  1. A CosmosDB container uses status as its partition key with values: draft, published, archived. Traffic is 95% reads on status = "published". What is the primary failure mode, and why can it not be resolved by adding more RU/s?
Answer Correct Answer: The published logical partition absorbs ~95% of all read RU/s. The physical partition hosting it hits its 10,000 RU/s ceiling and returns 429 Too Many Requests errors. Adding account-level RU/s does not help because the bottleneck is the single logical partition, which is capped at ~10,000 RU/s regardless of total account throughput. More RU/s is distributed across physical partitions โ€” but the hot logical partition cannot spread across multiple physical partitions. The fix requires a new partition key with higher cardinality.
  1. What is the maximum data size allowed per logical partition in CosmosDB, and what happens when it is exceeded?
Answer Correct Answer: 20 GB per logical partition. Exceeding this limit causes all writes to that logical partition to fail with RequestEntityTooLargeException. Unlike throughput limits, this ceiling cannot be raised by increasing RU/s or adding capacity. The only resolution is migrating the container to a new partition key.
  1. True or False: A single logical partition can span multiple physical partitions.
Answer Correct Answer: False. A logical partition always lives entirely within one physical partition. Physical partitions can contain many logical partitions, but the reverse is never true. Logical partitions are atomic units that move as a whole when physical partitions split.
  1. An IoT system records sensor readings every second for 50,000 sensors. The team proposes timestamp as the partition key. What problem will they encounter, and what is the recommended fix?
Answer Correct Answer: Using timestamp as a partition key concentrates all current writes into a single "latest" logical partition (one per second). Every write from every sensor in that second hits the same partition, causing a write hot spot. The fix is a synthetic composite key such as sensorId + "_" + (epochSeconds % 100). This distributes writes across 100 logical partition groups per sensor while keeping each sensor's recent history queryable with a prefix scan. Alternatively, hierarchical partition keys (/sensorId, /date) achieve the same result natively.
  1. CosmosDB splits a physical partition. Describe what happens to the logical partitions inside it and what the application observes.
Answer Correct Answer: The physical partition's hash range is divided into two halves. Each logical partition, based on its hash position, is assigned in its entirety to one of the two resulting physical partitions. Logical partitions are never split โ€” they move atomically. The application experiences brief sub-second latency spikes as the gateway layer updates its routing table. No data is lost and no application code change is required. The split completes transparently in the background while reads and writes continue.
  1. A multi-tenant SaaS product stores data for 10,000 customers. One enterprise customer accounts for 30% of all data. The container uses tenantId as the partition key. What risk exists, and how would you mitigate it?
Answer Correct Answer: The enterprise tenant's logical partition (tenantId = "enterprise-corp") may approach or exceed the 20 GB hard limit as data grows. The mitigation is suffix bucketing: replace the tenantId key with a composite key like tenantId_bucket where bucket is computed as (documentId.hashCode() % 10). This creates 10 logical partitions for that tenant, each capable of holding 20 GB โ€” a 10ร— increase in effective capacity. Alternatively, use CosmosDB hierarchical partition keys (/tenantId, /entityId) which achieve the same distribution natively without manual bucketing.
  1. Open-ended challenge: Your team is designing a new CosmosDB container for a global e-commerce platform. You expect 500 million orders, with 80% of users in the US and 20% spread across 40 other countries. The most common query is "get all orders for a customer." Propose a partition key strategy, identify the risks in your design, and explain what monitoring you would put in place on day one.
Answer Correct Answer: Use customerId as the partition key. This aligns with the primary query pattern ("get all orders for customer X"), distributes writes across millions of distinct customer IDs, and prevents any single key from accumulating > 20 GB unless a single customer generates an extraordinary volume. The geographic skew (80% US) does not affect this design because customerId is independent of country โ€” each customer gets their own logical partition. Risks: a B2B enterprise customer with millions of sub-orders could hit the 20 GB ceiling โ€” mitigate with customerId_year bucket if any customer is projected to exceed 15 GB. Day-one monitoring: enable per-partition normalized RU consumption in Azure Monitor, set an alert at 70% utilization per partition key range, and add x-ms-documentdb-query-metrics logging to your application layer to detect cross-partition fan-out queries early.
Share
Abstract Algorithms

Written by

Abstract Algorithms

@abstractalgorithms