System Design HLD Example: File Storage and Sync (Google Drive and Dropbox)
Build an HLD for file storage, metadata consistency, and cross-device synchronization.
Abstract AlgorithmsTLDR: Design a cloud file storage and sync system like Dropbox. This article now follows your system design interview template flow: use cases, requirements, estimations, design goals, HLD, and design deep dive.
TLDR: File storage and sync systems separate blob durability from metadata correctness and conflict resolution.
Dropbox serves 500 million registered users who edit files simultaneously across desktop, mobile, and web clients. The hard problem is not storing bytes โ object stores handle petabyte durability reliably. The real complexity is metadata consistency: if a user edits a file offline on a laptop while the same file is modified on a phone, the system must detect the conflict, preserve both versions, and let the user resolve it without any data loss.
Designing a file sync system teaches you how to separate blob durability from metadata correctness โ a separation pattern that recurs in object databases, CDNs, and distributed filesystems at every scale.
By the end of this walkthrough you'll know why files are split into 4 MB chunks before upload (enabling resumable transfers and content-addressed deduplication across accounts), why metadata commits require version vectors to detect concurrent edit conflicts, and why the sync event bus fans out change deltas rather than full file states to avoid bandwidth amplification on every edit.
๐ Use Cases
Actors
- End users consuming the primary product surface.
- Producer entities that create or update domain content.
- Platform services enforcing policy, routing, and reliability controls.
Use Cases
- Primary interview prompt: Design a cloud file storage and sync system like Dropbox.
- Core user journeys: Upload and download, chunking, metadata commits, versioning, sharing, and multi-device sync.
- Read and write paths are explained separately so bottlenecks and consistency boundaries are explicit.
This template starts with actors and use cases because architecture only makes sense when user behavior and workload shape are clear. In interviews, this section prevents random tool selection and keeps the answer grounded in business outcomes.
๐ Functional Requirements
In Scope
- Support the core product flow end-to-end with clear API contracts.
- Preserve business correctness for critical operations.
- Expose reliable read and write interfaces with predictable behavior.
- Support an incremental scaling path instead of requiring a redesign.
Out of Scope (v1 boundary)
- Full global active-active writes across every region.
- Heavy analytical workloads mixed into latency-critical request paths.
- Complex personalization experiments in the first architecture version.
Functional Breakdown
- Prompt: Design a cloud file storage and sync system like Dropbox.
- Focus: Upload and download, chunking, metadata commits, versioning, sharing, and multi-device sync.
- Initial building-block perspective: Upload gateway, chunk store, metadata DB, sync event bus, conflict resolver, CDN read path.
A strong answer names non-goals explicitly. Interviewers use this to judge prioritization quality and architectural maturity under time constraints.
โ๏ธ Non Functional Requirements
| Dimension | Target | Why it matters |
| Scalability | Horizontal scale across services and workers | Handles growth without rewriting core flows |
| Availability | 99.9% baseline with path to 99.99% | Reduces user-visible downtime |
| Performance | Clear p95 and p99 latency SLOs | Avoids average-latency blind spots |
| Consistency | Explicit strong vs eventual boundaries | Prevents hidden correctness defects |
| Operability | Metrics, logs, traces, and runbooks | Speeds incident isolation and recovery |
Non-functional requirements are where many designs fail in practice. Naming measurable targets and coupling architecture decisions to those targets is far more useful than listing technologies.
๐ง Deep Dive: Estimations and Design Goals
The Internals
- Service boundaries should align with ownership and deployment isolation.
- Data model choices should follow access patterns, not default preferences.
- Retries, idempotency, and timeout budgets must be explicit before scale.
- Dependency failure behavior should be defined before incidents happen.
Estimations
Use structured rough-order numbers in interviews:
- Read and write throughput (steady and peak).
- Read/write ratio and burst amplification factor.
- Typical payload size and large-object edge cases.
- Daily storage growth and retention horizon.
- Cache memory for hot keys and frequently accessed entities.
| Estimation axis | Question to answer early |
| Read QPS | Which read path saturates first at 10x? |
| Write QPS | Which state mutation becomes the first bottleneck? |
| Storage growth | When does repartitioning become mandatory? |
| Memory envelope | What hot set must remain in memory? |
| Network profile | Which hops create the highest latency variance? |
Design Goals
- Keep synchronous user-facing paths short and deterministic.
- Shift heavy side effects and fan-out work to asynchronous channels.
- Minimize coupling between control-plane and data-plane components.
- Introduce complexity in phases tied to measurable bottlenecks.
Performance Analysis
| Pressure point | Symptom | First response | Second response |
| Hot partitions | Tail latency spikes | Key redesign | Repartition by load |
| Cache churn | Miss storms | TTL and key tuning | Multi-layer caching |
| Async backlog | Delayed downstream work | Worker scale-out | Priority queues |
| Dependency instability | Timeout cascades | Fail-fast budgets | Degraded fallback mode |
Metrics that should drive architecture evolution:
- p95 and p99 latency by operation.
- Error-budget burn by service and endpoint.
- Queue lag, retry volume, and dead-letter trends.
- Cache hit ratio by key family.
- Partition or shard utilization skew.
๐ High Level Design - Architecture for Functional Requirements
Building Blocks
- Upload gateway, chunk store, metadata DB, sync event bus, conflict resolver, CDN read path.
- API edge layer for authentication, authorization, and policy checks.
- Domain services for read and write responsibilities.
- Durable storage plus cache for fast retrieval and controlled consistency.
- Async event path for secondary processing and integrations.
Design the APIs
- Keep contracts explicit and version-friendly.
- Use idempotency keys for retriable writes.
- Return actionable error metadata for clients and retries.
Communication Between Components
- Synchronous path for user-visible confirmation.
- Asynchronous path for fan-out, indexing, notifications, and analytics.
Data Flow
- Chunk upload -> object persistence -> metadata commit -> sync events -> device delta apply.
flowchart TD
A[Client or Producer] --> B[API and Policy Layer]
B --> C[Core Domain Service]
C --> D[Primary Data Store and Cache]
C --> E[Async Event or Job Queue]
D --> F[User-Facing Response]
E --> G[Workers and Integrations]
G --> H[State Update and Telemetry]
๐ Real-World Applications: API Mapping and Real-World Applications
This architecture pattern appears in real production systems because traffic is bursty, dependencies fail partially, and correctness requirements vary by operation type.
Practical API mapping examples:
- POST /resources for write operations with idempotency support.
- GET /resources/{id} for low-latency object retrieval.
- GET /resources?cursor= for scalable pagination and stable traversal.
- Async event emissions for indexing, notifications, and reporting.
Real-world system behavior is defined during failure, not normal operation. Good designs clearly specify what can be stale, what must be exact, and what should fail fast to preserve reliability.
โ๏ธ Trade-offs & Failure Modes (Design Deep Dive for Non Functional Requirements)
Scaling Strategy
- Scale stateless services horizontally behind load balancing.
- Partition stateful data by access-pattern-aware keys.
- Add queue-based buffering where write bursts exceed synchronous capacity.
Availability and Resilience
- Multi-instance deployment across failure domains.
- Replication and failover planning for stateful systems.
- Circuit breakers, retries with backoff, and bounded timeouts.
Storage and Caching
- Cache-aside for read-heavy access paths.
- Explicit invalidation and refresh policy.
- Tiered storage for hot, warm, and cold access profiles.
Consistency, Security, and Monitoring
- Clear strong vs eventual consistency contracts per operation.
- Authentication, authorization, and encryption in transit and at rest.
- Monitoring stack with metrics, logs, traces, SLO dashboards, and alerting.
This section is the architecture-for-NFRs view from your template. It explains how the system remains stable under scale, failures, and incident pressure.
๐งญ Decision Guide
| Situation | Recommendation |
| Early stage with moderate traffic | Keep architecture minimal and highly observable |
| Read-heavy workload dominates | Optimize cache and read model before complex rewrites |
| Write hotspots appear | Rework key strategy and partitioning plan |
| Incident frequency increases | Strengthen SLOs, runbooks, and fallback controls |
๐งช Practical Example for Interview Delivery
A repeatable way to deliver this design in interviews:
- Start with actors, use cases, and scope boundaries.
- State estimation assumptions (QPS, payload size, storage growth).
- Draw HLD and explain each component responsibility.
- Walk through one failure cascade and mitigation strategy.
- Describe phase-based evolution for 10x traffic.
Question-specific practical note:
- Use chunk deduplication, atomic metadata versioning, and incremental sync events for clients.
A concise closing sentence that works well: "I would launch with this minimal architecture, monitor p95 latency, error-budget burn, and queue lag, then scale the first saturated component before adding further complexity."
๐๏ธ Advanced Concepts for Production Evolution
When interviewers ask follow-up scaling questions, use a phased approach:
- Stabilize critical path dependencies with better observability.
- Increase throughput by isolating heavy side effects asynchronously.
- Reduce hotspot pressure through key redesign and repartitioning.
- Improve resilience using automated failover and tested runbooks.
- Expand to multi-region only when latency, compliance, or reliability targets require it.
This framing demonstrates that architecture decisions are tied to measurable outcomes, not architecture fashion trends.
๐ ๏ธ Content-Addressed Keys and 4 MB Chunking: The Two Decisions Behind Dropbox-Scale Storage
Two decisions dominate the file storage architecture: content-addressed object keys (SHA-256 hash of file bytes becomes the S3 key, enabling automatic account-level deduplication) and 4 MB chunk boundaries (enabling resumable uploads so a retry from chunk N never re-uploads chunks 0 through Nโ1).
// Content-addressed key: identical bytes from any user โ same S3 object key
// Two users uploading the same 50 MB file store only one object in S3
String contentHash = DigestUtils.sha256Hex(file.getInputStream());
String objectKey = "uploads/" + userId + "/" + contentHash;
// Deduplication check: if this hash already exists, skip the upload entirely
if (s3ObjectExists(bucket, objectKey)) {
metadata.save(new FileRecord(userId, objectKey, contentHash, file.getSize()));
return objectKey; // zero-byte upload โ content-addressed dedup in action
}
// Chunked multipart upload: 4 MB boundary makes transfers resumable
// A network failure at chunk 7 means only chunk 7 is retried, not the full file
s3.createMultipartUpload(req -> req.bucket(bucket).key(objectKey)
.serverSideEncryption(ServerSideEncryption.AES256));
// ... upload chunks of exactly 4 * 1024 * 1024 bytes, complete multipart upload
// Full multipart upload loop: see AWS S3 SDK docs for UploadPart + CompleteMultipartUpload
The key insight is that the object key is the architectural decision โ hash(content) replaces userId + timestamp + filename. This single change eliminates duplicate storage across all accounts at zero application-layer cost. The 4 MB chunk size is equally intentional: AWS S3 minimum part size is 5 MB in production, but the boundary (not the exact size) is what enables resumability.
MinIO (S3-compatible open-source object store) accepts the same S3Client code pointed at http://localhost:9000 โ no code changes for local development or on-premise deployments.
For a full deep-dive on conflict resolution via version vectors, metadata commit atomicity, and multi-device delta sync, a dedicated follow-up post is planned.
๐ Lessons Learned
- Start with actors and use cases before drawing any diagram.
- Define in-scope and out-of-scope boundaries to prevent architecture sprawl.
- Convert NFRs into measurable SLO-style targets.
- Separate functional HLD from non-functional deep dive reasoning.
- Scale the first measured bottleneck, not the most visible component.
๐ TLDR: Summary & Key Takeaways
- Template-aligned answers are clearer, faster to evaluate, and easier to communicate.
- Good HLDs explain both request flow and state update flow.
- Non-functional architecture determines reliability under pressure.
- Phase-based evolution outperforms one-shot overengineering.
- Theory-linked reasoning improves consistency across different interview prompts.
๐ Practice Quiz
- Why should system design answers begin with actors and use cases?
A) To avoid architecture work entirely
B) To anchor architecture decisions to workload and user behavior
C) To skip non-functional requirements
Correct Answer: B
- Which section should define p95 and p99 targets?
A) Non Functional Requirements
B) Only the quiz section
C) Only the related posts section
Correct Answer: A
- What is the primary benefit of separating synchronous and asynchronous paths?
A) It removes all consistency trade-offs
B) It isolates latency-critical user flows from heavy side effects
C) It eliminates monitoring needs
Correct Answer: B
- Open-ended challenge: for this design, which component would you scale first at 10x traffic and which metric would you use to justify that decision?
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
Types of LLM Quantization: By Timing, Scope, and Mapping
TLDR: There is no single "best" LLM quantization. You classify and choose quantization along three axes: when you quantize (timing), what you quantize (scope), and how values are encoded (mapping). In practice, most teams start with weight quantizati...
Stream Processing Pipeline Pattern: Stateful Real-Time Data Products
TLDR: Stream pipelines succeed when event-time semantics, state management, and replay strategy are designed together โ and Kafka Streams lets you build all three directly inside your Spring Boot service. Stripe's real-time fraud detection processes...
Service Mesh Pattern: Control Plane, Data Plane, and Zero-Trust Traffic
TLDR: A service mesh intercepts all service-to-service traffic via injected Envoy sidecar proxies, letting a platform team enforce mTLS, retries, timeouts, and circuit breaking centrally โ without changing application code. Reach for it when cross-te...
Serverless Architecture Pattern: Event-Driven Scale with Operational Guardrails
TLDR: Serverless is strongest for spiky asynchronous workloads when cold-start, observability, and state boundaries are intentionally designed. TLDR: Serverless works best for spiky, event-driven workloads when you design for idempotency, observabili...
