16 min readCosmosdb Azure Nosql

CosmosDB Partition Internals: Logical vs Physical Partitions Explained

How CosmosDB maps your partition key to physical storage — and why the wrong choice costs 10x more at scale

Abstract Algorithms/Apr 19, 2026/System Design Interview Prep

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Executive TLDR

🔥 When Your Database Bill Triples Overnight A retail engineering team ships a flash sale feature.
Their Azure CosmosDB bill triples within 24 hours.
Queries that ran in 5ms now take 800ms.
The on call engineer bumps provisioned RU/s from 10,000 to 40,000 — and the throttling barely improves.

Core mental model

Read this as a system of state, constraints, and failure boundaries.

How CosmosDB maps your partition key to physical storage — and why the wrong choice costs 10x more at scale

Explain simpler Compare tradeoffs

Key systems visualization

The article’s conceptual path

Cosmosdb

Azure

Nosql

System Design

Database

🔥 When Your Database Bill Triples Overnight

A retail engineering team ships a flash-sale feature. Traffic spikes 10×. Their Azure CosmosDB bill triples within 24 hours. Queries that ran in 5ms now take 800ms. The on-call engineer bumps provisioned RU/s from 10,000 to 40,000 — and the throttling barely improves.

The problem is not throughput headroom. It's that 90% of all reads hit a single partition key: status = "active". No matter how many RU/s they add to the account, those requests all funnel into one logical partition, which maps to one physical partition, which has a hard ceiling of 10,000 RU/s.

This is the hot partition problem — and avoiding it starts with understanding the difference between CosmosDB's two-layer partition model: logical partitions (which you design) and physical partitions (which CosmosDB manages). Get the distinction wrong at schema design time, and you cannot fix it without a full data migration. Get it right, and you get near-linear horizontal scalability with sub-10ms reads at any traffic level.

📖 CosmosDB Partitioning: Why Two Layers Exist

CosmosDB is a globally distributed, multi-model database designed to scale horizontally without the operational burden of manual sharding. To do this, it splits data across nodes — but it exposes two conceptually distinct partition layers rather than one.

The logical partition is your contract with the database: you declare a partition key, and CosmosDB groups every document with the same key value together. The physical partition is CosmosDB's internal implementation detail: it actually stores data in replica sets distributed across nodes, and it manages those replica sets on your behalf.

This two-layer design gives you control over data placement (so queries stay in-partition) while giving CosmosDB freedom to redistribute load and grow capacity (so you don't need to re-shard manually). Understanding what each layer can and cannot do is the foundation of every correct CosmosDB data model.

🔍 Two Layers of Partitioning — One You Control, One You Don't

CosmosDB partitions data across nodes to achieve horizontal scalability. It does this with two distinct abstractions that operate at different layers:

Concept	Who controls it	Application visibility	Hard size limit
Logical Partition	You (via partition key)	Fully visible	20 GB
Physical Partition	CosmosDB internally	Mostly hidden	~50 GB

Think of logical partitions as the filing system you design — you label every folder (partition key value). Physical partitions are the filing cabinets CosmosDB builds — it decides which folders go in which cabinet, and it can add cabinets automatically as you fill them.

What Is a Logical Partition?

A logical partition is the set of all documents that share the same partition key value. You define the partition key when creating a container — it is immutable after creation.

All documents with customerId = "cust-123" form one logical partition.
Maximum size: 20 GB per logical partition value — a hard, unextendable ceiling.
Maximum throughput: approximately 10,000 RU/s per logical partition.
All documents in a logical partition are co-located on the same physical storage, which enables efficient single-partition queries without cross-node fan-out.

Logical partitions are permanent groupings. Once a document is written, its logical partition assignment is sealed by its partition key value. You cannot reassign a document to a different logical partition without deleting and re-inserting it.

What Is a Physical Partition?

A physical partition is the actual server-side storage and compute unit. CosmosDB creates, manages, and splits physical partitions entirely on your behalf.

Each physical partition is backed by a replica set of four nodes for durability and high availability.
A single physical partition holds one or more logical partitions.
Physical partitions grow to approximately 50 GB and can serve up to 10,000 RU/s.
When a physical partition becomes too large or receives too much traffic, CosmosDB automatically splits it into two physical partitions — with zero downtime.

You never create, delete, or explicitly manage physical partitions. They are CosmosDB's internal elasticity mechanism.

⚙️ How CosmosDB Routes Requests: The Hash-Range Mapping

The routing layer between your application and physical storage is a partition key hash space spanning from 0x0000 to 0xFFFF. CosmosDB hashes each document's partition key value into this range. Physical partitions own contiguous slices of the hash space, and logical partitions land in the physical partition that owns their hash.

graph TD
    A[Client Request] --> B[CosmosDB Gateway]
    B --> C[Partition Key Router]
    C --> D{Hash of Partition Key}
    D --> E[Physical Partition 1 - Range 0x0000 to 0x3FFF]
    D --> F[Physical Partition 2 - Range 0x4000 to 0x7FFF]
    D --> G[Physical Partition 3 - Range 0x8000 to 0xBFFF]
    D --> H[Physical Partition 4 - Range 0xC000 to 0xFFFF]
    E --> I[LP: userId-A, userId-B, userId-C]
    F --> J[LP: userId-D, userId-E]
    G --> K[LP: userId-F, userId-G, userId-H]
    H --> L[LP: userId-J, userId-K]

This diagram shows how a client request reaches a physical partition. The partition key is hashed into the 0x0000–0xFFFF range. Each physical partition owns a contiguous band of that range and holds the logical partitions whose hashes land there. Multiple logical partitions can share one physical partition — but a single logical partition always lives entirely within one physical partition. This atomic boundary is the critical architectural invariant: logical partitions are indivisible.

🧠 Partition Splitting Internals: How CosmosDB Scales Without Downtime

The Internals of a Split

CosmosDB triggers a physical partition split under two conditions:

Data volume: The physical partition approaches ~50 GB.
Throughput scale-up: You increase provisioned or autoscale RU/s and CosmosDB needs to spread load across more partitions to fulfill the new capacity.

graph TD
    A[Physical Partition P1 - 45GB - 8000 RU/s] -->|Split triggered| B[Physical Partition P1a - 22GB - 4000 RU/s]
    A -->|Split triggered| C[Physical Partition P1b - 23GB - 4000 RU/s]
    B --> D[Logical Partitions LP1 to LP5]
    C --> E[Logical Partitions LP6 to LP10]

When a split occurs, CosmosDB divides the hash range of the original physical partition into two equal halves and migrates each half's logical partitions to one of the two resulting physical partitions. Data migration happens in the background while reads and writes continue — the gateway layer transparently redirects each request to the correct post-split target.

What you actually observe during a split: brief latency spikes (typically under one second), no data loss, and no application code change required. The entire operation is orchestrated by the CosmosDB control plane and is invisible to your application except for transient retry events.

Performance Analysis During a Split

Physical partition splits are generally transparent, but they do have measurable performance effects worth understanding for capacity planning:

RU/s temporarily doubles: During migration, both the original and the new physical partition may serve reads simultaneously, meaning the total RU capacity briefly doubles before the old partition drains.
Write latency increases transiently: Writes are held in a two-phase commit between the old and new partition during data migration. The spike is typically under 200ms but can surface in P99 latency dashboards.
Partition count multiplies on scale-up: When you double provisioned RU/s, CosmosDB may double the physical partition count. More partitions enable more parallelism — but only if the logical partition key provides sufficient cardinality to distribute load across them. A low-cardinality key (three status values) means most new physical partitions remain idle after the split.

Key implication: When you double provisioned RU/s via autoscale, CosmosDB may split physical partitions to accommodate the new throughput ceiling. More physical partitions means better distribution — but only if your logical partition key provides enough cardinality to spread load across them.

📊 Hot Partition Flow: How a Bad Key Saturates Physical Storage

Physical partition splits and increased RU/s cannot save you from a poorly chosen partition key. The fundamental problem is at the logical partition layer — before CosmosDB's elasticity mechanisms even get involved.

Scenario 1: Low-Cardinality Key

container: "orders"
partitionKey: "/status"
values: ["active", "completed", "cancelled"]

Three logical partitions exist regardless of how many millions of orders you store. Ninety percent of queries target status = "active". That single logical partition and the physical partition it resides on absorb 90% of all RU/s. CosmosDB can split that physical partition, but the hot logical partition moves intact to one of the two halves — the hotness follows the key, not the partition.

Scenario 2: Skewed Natural Key

container: "users"
partitionKey: "/country"

If 80% of users are in the United States, country = "US" accumulates towards the 20 GB logical partition ceiling. Adding RU/s does not help once the ceiling is hit — writes to that partition fail permanently with RequestEntityTooLargeException.

Even Distribution vs. Hot Partition — A Visual Contrast

graph TD
    subgraph Even[Good Key - userId - Even Distribution]
        PP1[Physical Partition 1 - 3333 RU/s]
        PP2[Physical Partition 2 - 3333 RU/s]
        PP3[Physical Partition 3 - 3334 RU/s]
    end
    subgraph Uneven[Bad Key - status - Hot Partition]
        HP1[Physical Partition A - 9500 RU/s THROTTLED]
        HP2[Physical Partition B - 300 RU/s]
        HP3[Physical Partition C - 200 RU/s]
    end

This diagram illustrates the contrast between a well-chosen and a poorly-chosen partition key at the physical level. With userId as the key and millions of users, each physical partition handles a balanced fraction of total traffic — all operating well below the 10,000 RU/s physical partition ceiling. With status as the key, the active partition absorbs almost all traffic and saturates its host physical partition, producing 429 throttling errors even while the other two physical partitions sit nearly idle. The account-level RU/s meter shows low utilization, yet users experience failures — a counter-intuitive outcome that catches many teams off guard.

🧪 Practical: Choosing and Fixing Partition Keys

The goal is high cardinality + uniform access distribution + query alignment.

Design Rule	Good Example	Bad Example
High cardinality	`userId` (millions of distinct values)	`status` (3–5 values)
Uniform access	Each key accessed at roughly equal rate	One key dominates all reads
Query alignment	Most queries include the partition key	Most queries omit the partition key
Size headroom	Each logical partition stays well under 20 GB	Any single key will accumulate > 20 GB
Write spread	Writes distribute across many distinct key values	All writes target the "latest" key

Synthetic Partition Keys for Difficult Workloads

When your data does not have a natural high-cardinality key, build one:

Suffix bucketing — append a modulo bucket to an existing key: orderId + "_" + (epochSeconds % 50) → 50 logical partitions per base ID

Prefix randomization — prepend a random segment to force even hashing: String.valueOf(random.nextInt(100)) + "_" + entityId

Composite key — combine two meaningful fields: tenantId + "_" + userId → cardinality multiplies across both fields

Suffix bucketing is especially important for time-series and append-heavy workloads where all writes naturally cluster on the most recent timestamp. Distributing writes across sensorId_0 through sensorId_99 prevents 100% of write traffic from hitting a single logical partition.

⚖️ Trade-offs and Failure Modes

Scenario	Root Cause	Observable Impact	Resolution
429 Too Many Requests	Logical partition exceeds physical partition RU/s ceiling	Read/write throttling on all items in that physical partition	Redesign partition key; add suffix bucketing
20 GB limit exceeded	Low-cardinality key with unbounded data growth	Writes to that partition fail permanently	Container migration to a new key — cannot be done in-place
Cross-partition fan-out	Query filter omits partition key	Latency 10×–100× higher; RU cost scales with partition count	Add partition key to query filter; use secondary indexes
Transient latency spikes	Physical partition split in progress	Sub-second latency blips during autoscale	Expected behavior; implement retries with exponential backoff
Uneven distribution after split	Too few distinct partition key values	Some physical partitions overloaded post-split	Higher-cardinality key improves distribution after each split

The Cross-Partition Query Tax

A query without a partition key filter executes as a fan-out — CosmosDB sends the query to every physical partition in parallel and merges results. At 50 physical partitions, you pay 50× the RU cost of the equivalent single-partition query. Monitor the x-ms-documentdb-query-metrics response header: if retrievedDocumentCount greatly exceeds outputDocumentCount, you are paying for cross-partition scans. The fix is either adding the partition key to the filter or introducing a secondary index on a projected lookup field.

🌍 Real-World Applications

E-commerce at scale: Order management systems partition by customerId rather than status. Every customer's order history lives in one logical partition — enabling O(1) "my orders" queries — while preventing any single key from accumulating unbounded data. Order IDs (GUIDs) are used as the key only when cross-customer analytics, not per-customer retrieval, is the primary access pattern.

IoT device telemetry: Devices emit readings continuously. A naive timestamp partition key creates a single hot logical partition for every second of data. Production IoT systems use deviceId_date — for example, device-abc-2026-04-18 — creating thousands of logical partitions per day and spreading writes evenly across physical partitions.

Multi-tenant SaaS: Small tenants fit comfortably within 20 GB per tenantId partition. For enterprise tenants approaching the limit, suffix buckets (tenantId_0, tenantId_1, ... tenantId_9) distribute one logical tenant across 10 logical partitions. CosmosDB 2023 introduced hierarchical partition keys (up to 3 levels: /tenantId, /userId, /sessionId) which handle this pattern natively without manual bucketing.

Social media feed: Partition by userId so each user's posts form one logical partition. Read-your-own-writes and timeline queries stay in-partition and complete in under 5ms. Celebrity accounts with millions of followers require fan-out to read but writes stay on one partition — an acceptable trade-off given that reads can be cached at the CDN layer.

🧭 Decision Guide: Selecting a Partition Key by Use Case

Use Case	Recommended Key	Rationale
User profiles and activity	`userId`	High cardinality; even distribution; aligns with lookup pattern
Order management	`customerId`	Co-locates a customer's orders; efficient per-customer queries
IoT sensor telemetry	`deviceId_date`	Prevents write hot spots; keeps device history queryable
Multi-tenant app (small tenants)	`tenantId`	Isolates tenant data; all queries stay in-partition
Multi-tenant app (large tenants)	`tenantId_bucket`	Prevents 20 GB ceiling; spreads load across buckets
Product catalog	`productId` (GUID-based)	Uniform hashing; no natural cardinality problem
Audit / append-heavy logs	`sourceId_bucket`	Distributes high-frequency writes across logical partitions
Time-series analytics	`metricName_hour`	Balances writes; keeps related metrics queryable per hour

🛠️ Azure CosmosDB SDK: Configuring Partition Key at Container Creation

The partition key is set at container provisioning and is permanent. There is no migration path within the same container — if you choose the wrong key, you must create a new container and copy data. The SDK call that sets this irrevocable choice is simple and easy to overlook:

CosmosContainerProperties props = new CosmosContainerProperties(
    "orders",
    "/customerId"   // partition key path — cannot change after creation
);

// Hierarchical partition key (CosmosDB SDK 4.37+)
// CosmosContainerProperties props = new CosmosContainerProperties("orders",
//     PartitionKeyDefinition.fromPaths(List.of("/tenantId", "/customerId")));

ThroughputProperties autoscale = ThroughputProperties.createAutoscaledThroughput(4000);

database.createContainerIfNotExists(props, autoscale).block();

CosmosDB automatically creates the initial set of physical partitions and handles all splits as data volume and provisioned RU/s grow. The hierarchical key variant (commented out above) allows up to three levels of nesting, co-locating tenant data while maintaining high cardinality — eliminating the need for manual suffix bucketing in most multi-tenant scenarios.

For a full deep-dive on CosmosDB SDK configuration, indexing policies, and consistency levels, see the Azure CosmosDB Java SDK v4 documentation.

📚 Lessons Learned from CosmosDB Partition Failures

The 20 GB limit is a hard wall with no operational escape hatch. Unlike throughput limits that you can resolve by raising RU/s, hitting the logical partition size ceiling causes all writes to that partition to fail. The only fix is a full container migration with a new partition key — an expensive operation requiring double-storage, a data copy pipeline, and a cutover window. Validate expected data volume per key value before launch.
Hot partitions are invisible in account-level metrics. CosmosDB's default monitoring shows account-level RU/s consumption. A container can report 40% account utilization while one physical partition inside it is at 100% and throttling. Always monitor per-partition RU/s metrics (available in the Azure portal under "Insights → Throughput → Normalized RU consumption by partition key range") to detect hot partitions before they cause production incidents.
Cross-partition queries are a cost multiplier, not just a latency issue. A query without a partition key filter fan-outs to every physical partition. At 100 physical partitions — a realistic count for a multi-year production container — a cross-partition query costs 100× more RU/s than the equivalent in-partition query. Teams that start with 5 physical partitions rarely notice the cost. Teams that have grown to 100 experience unexplained billing spikes. Audit your query patterns whenever you significantly scale a container.
Partition key design is a first-class architecture decision. It determines query patterns, cost structure, and the scalability ceiling for the lifetime of that container. It deserves the same deliberate analysis as schema design in a relational database — before the first byte of production data is written.
Hierarchical partition keys (introduced 2023) eliminate most synthetic key hacks. If you are designing a new container for a multi-tenant or hierarchical access pattern, evaluate whether hierarchical keys make suffix bucketing unnecessary. They allow CosmosDB to co-locate data at the top-level key while maintaining the cardinality of the full composite key.

📌 TLDR & Key Takeaways

TLDR: In CosmosDB, logical partitions are your grouping contract (defined by partition key, max 20 GB, permanent), and physical partitions are CosmosDB's elastic storage units that hold many logical partitions and split automatically. Hot partitions — caused by low-cardinality or skewed keys — cannot be fixed by adding RU/s; they require a better partition key or synthetic suffix bucketing.

Logical partition = all documents sharing the same partition key value. You define it. Max 20 GB and ~10,000 RU/s per value. Permanent and indivisible.
Physical partition = CosmosDB's actual storage unit backed by a 4-node replica set. Holds multiple logical partitions. Splits automatically when data or throughput exceeds thresholds.
Routing = CosmosDB hashes your partition key into a fixed range; physical partitions own contiguous hash bands. Many logical partitions map to one physical partition — never the reverse.
Hot partitions arise from low-cardinality or skewed partition keys. They cannot be resolved by increasing RU/s — the bottleneck is at the logical partition level.
Fix hot partitions by choosing a high-cardinality key or building synthetic suffix-bucket keys to distribute data and traffic across many logical partitions.
Partition key is permanent — plan for worst-case data volume and access patterns before the first write. Changing it requires a full container migration.

Quiet AI help

Explain simpler Compare approaches What next?

Article metadata

Written by

Abstract Algorithms

@abstractalgorithms

Related deep dives

ANN Index Types Explained: When to Choose Flat, HNSW, IVF, or IVF-PQ

14 min read

SQL Partitioning: Range, Hash, List, and Composite Strategies Explained

25 min read

NoSQL Partitioning: How Cassandra, DynamoDB, and MongoDB Split Data

24 min · System Design · best next step

Open Collection