System Design: Designing an Autonomous AI Coding Agent (Devin at Scale)
Learn how to design a secure, isolated, and scalable high-level system for an autonomous AI software engineer at enterprise scale.

Abstract Algorithms
Helping engineers master software engineering topics.
TLDR: Designing an autonomous AI coding agent at scale is not a prompt engineering task; it is a complex systems problem. The system requires secure multitenancy via Firecracker microVMs, a low-latency workspace syncing event loop, and a dual planning-execution engine backed by durable state to survive compile-and-retry loops.
๐ Part 1: Approach โ Requirements and Scale Challenges
When building an autonomous AI developer like Devin or Swe-agent, developers often start with a simple script that calls the OpenAI API, reads local files, runs local compilers, and writes code.
In production, this naive setup fails immediately:
- The Infinite Loop Resource Burn: An LLM planner gets stuck in an compile-fail-retry cycle, compiling a broken package 500 times in 2 minutes, consuming all system CPU and incurring thousands of dollars in API costs.
- Sandbox Security Escape: An agent writes a python script that runs malicious code locally, deletes files, or attempts to port-scan other nodes in the network.
- Workspace Sync Bottleneck: If the workspace repository size is over 5GB, copying files between the agent orchestrator node and the compiler sandbox takes minutes, destroying the interaction loop.
To prevent this, we must design an enterprise-grade system that isolates execution, tracks costs, manages task cycles, and synchronizes workspace code at sub-second speeds.
Use Cases & Requirements
Actors
- Developer User: Initiates tasks (e.g., "Fix issue #123"), inspects logs, and reviews code diffs.
- Agent Orchestrator: The central state coordinator that schedules jobs, runs the planning loop, and manages user context.
- Sandbox Supervisor: The secure infrastructure component that spins up, monitors, and terminates execution environments.
- Model Inference Provider: Submits completions and structured tool calls.
Functional Requirements
- Initialize Job: Users can clone a git repository, set context variables, and trigger a coding task.
- Execute Terminal Commands: The agent can run arbitrary shell commands inside an isolated environment and capture console stdout/stderr.
- Edit Code Files: The agent can view, search, and edit files inside the workspace repository.
- Pause/Resume for Feedback: If the agent needs help (e.g., API key missing) or requires approval for a commit, the execution loop pauses and resumes upon user response.
Non-Functional Requirements (NFRs)
- Strong Isolation: Sandboxes must run with zero shared kernel state to prevent container escapes.
- Low-Latency VM Lifecycle: Sandbox environments must spin up in less than 150 milliseconds.
- Cost and Iteration Caps: Each job has a maximum token budget and step attempt threshold, terminating immediately upon limit breach.
- Scale Targets: Sized to support 10,000 concurrent active coding sandboxes.
Capacity Estimations
For foundational capacity estimation principles and how to structure scale metrics, refer to our Capacity Estimation Guide.
Let's estimate the system demands for 10,000 active concurrent agent runs:
1. Compute & RAM Sizing
- Assume each coding agent run requires an isolated microVM sandbox with 2 vCPUs and 4 GB of RAM.
- Total RAM needed: 10,000 runs * 4 GB = 40,000 GB (40 TB).
- Total vCPUs needed: 10,000 runs * 2 vCPUs = 20,000 cores.
- Assuming server nodes with 128 cores and 512 GB RAM, we need:
- By RAM limit: 40 TB / 512 GB = 80 physical nodes.
- By CPU limit (with 2:1 overcommit): 20k cores / (128 * 2) = 78 physical nodes.
- We size our cluster with 90 physical compute nodes to provide headroom.
2. Disk Storage & I/O Sizing
- If the average git repository workspace is 200 MB, cloning 10,000 repositories concurrently requires:
- 10,000 * 200 MB = 2 TB of active workspace storage.
- Since agents run high-IOPS tasks (compiling code, installing packages via
npm install), we allocate NVMe SSD arrays capable of handling 50,000 write IOPS per storage node.
Design Goals
Our design addresses specific engineering constraints to maintain stability:
- Secure Sandbox Isolation: Prevent rogue agents from accessing host infrastructure or other sandboxes.
- State Resumability: Ensure long-running workflows (which can take hours) survive orchestrator container restarts.
- Low-Latency Loop: Sync file edits and terminal streams to the UI in real-time.
โ๏ธ High-Level Architecture and System Mechanics
The diagram below details the components of our autonomous agent system, splitting execution into a control plane and a sandbox execution plane.
graph TD
Client[Web Browser Client] -->|HTTP/Websocket| Gateway[API Gateway & Router]
Gateway -->|gRPC| Orchestrator[Agent Orchestrator Service]
Gateway -->|Stream| LogCollector[Log & Stream Collector]
Orchestrator -->|Queue| TaskQueue[Task Message Queue]
Orchestrator -->|State Updates| StateDB[(State Database)]
Orchestrator -->|Locking| Cache[(Redis Session Cache)]
TaskQueue --> SandboxSupervisor[Sandbox Supervisor Service]
SandboxSupervisor -->|Control API| VMHost[MicroVM Compute Host]
VMHost --> VM1[MicroVM Sandbox 1]
VMHost --> VM2[MicroVM Sandbox 2]
VM1 -->|gRPC Stream| LogCollector
VM2 -->|gRPC Stream| LogCollector
This high-level architecture separates orchestration from isolated compute execution. The API Gateway routes user configurations. The Agent Orchestrator manages the task planning and state database updates, while delegating command execution to the Sandbox Supervisor. The Sandbox Supervisor spins up isolated MicroVM hosts which stream build logs back to the Log Collector.
API Design
Below is the REST contract for managing coding jobs.
| Endpoint | Method | Request Payload | Response | Description |
/api/v1/jobs | POST | {repoUrl, taskDesc} | {jobId, status} | Initiates a new coding agent run. |
/api/v1/jobs/{id}/stop | POST | None | {status: "stopped"} | Interrupts execution and kills VMs. |
/api/v1/jobs/{id}/stream | GET | None | Server-Sent Events | Streams terminal logs and file diffs. |
Data Model / Schema
We store our job metadata and historical steps in a relational schema.
1. Table: agent_jobs
job_id(UUID, Primary Key)repository_url(VARCHAR, Indexed)task_description(TEXT)status(VARCHAR) - E.g.,PLANNING,RUNNING,PAUSED,COMPLETED,FAILEDcreated_at(TIMESTAMP)token_spend_usd(DECIMAL)
2. Table: execution_steps
step_id(UUID, Primary Key)job_id(UUID, Foreign Key referencingagent_jobs, Indexed)step_number(INT)planner_thought(TEXT)tool_called(VARCHAR)tool_arguments(TEXT)tool_output(TEXT)executed_at(TIMESTAMP)
Cache Schema
Redis stores transient state and lock details to prevent double-execution:
- Key:
session:lock:{jobId}| Type: String | Value: Orchestrator instance UUID | TTL: 10 seconds (heartbeat renewed). Prevents multiple orchestrators from executing the same job. - Key:
sandbox:token:{jobId}| Type: String | Value: MicroVM JWT token | TTL: 1 hour. Used to authenticate gRPC log streaming.
Tech Stack & Design Choices
| Component | Technology | Rationale |
| Sandbox Isolation | Firecracker MicroVMs | Virtualization-level safety with minimal footprint, launching in under 150ms. |
| State Storage | PostgreSQL | Relational schema is ideal for tabular structured logs and audit traces. |
| Log Streaming | gRPC / HTTP/2 | High-throughput bi-directional terminal stdout/stderr streaming. |
๐ง Deep Dive: Execution Sandboxing and Workspace Syncing
Building a production-ready system requires looking closely at how virtualization runs commands securely and syncs workspace directories between host and VM.
The Internals
To run untrusted agent code, we avoid standard Docker containers because they share the host OS kernel. Instead, we use Firecracker microVMs. Firecracker leverages KVM to create lightweight virtual machines.
When an agent changes a file, we avoid slow network transfers by using a copy-on-write overlay file system. The base image containing standard programming languages (Python, Node, Java) is shared read-only across all microVMs. When a VM starts, it gets a thin writable directory overlay. Workspace syncing runs over a shared memory protocol (Virtio-FS) connecting the host node directory to the guest VM directory.
Mathematical Model
The agent's iterative planning-execution loop can be modeled as a state machine. Let:
- $S_t \in \mathcal{S}$ represent the system state at step $t$, defined as the tuple $(C_t, W_t)$ where $C_t$ is the current context token window (system prompt, conversation history) and $W_t$ is the workspace state (directory file tree and git diff).
- $P_t = \text{LLM}(S_t)$ be the planning thought output from the model.
- $T_t = g(P_t, S_t)$ be the tool command (e.g., execute shell command
npm test, write fileindex.js) parsed from the output. - $\Omega: T \times W \to (W', R)$ be the sandbox execution transition function, taking a tool command and workspace state, and returning a updated workspace $W'$ and console output result $R$.
The system updates state sequentially: $$ W_{t+1} = \text{projection}_W(\Omega(T_t, Wt)) $$ $$ C{t+1} = C_t \cup {P_t, T_t, \text{projection}_R(\Omega(T_t, Wt))} $$ $$ S{t+1} = (C{t+1}, W{t+1}) $$
Performance Analysis
- Sandbox Boot Latency: Firecracker VM startup takes ~120ms. Mounting Virtio-FS directories adds ~30ms, bringing aggregate startup time to ~150ms, much faster than traditional heavy hypervisors.
- Workspace Sync Overhead: Virtio-FS memory mapping enables file read-writes inside the VM at ~95% of native SSD speed.
- Context Limit Gating: As $C_t$ expands with large error dumps, LLM API call time increases. To prevent latency degradation, the orchestrator runs a summarization compression step once $C_t$ exceeds 64k tokens.
๐๏ธ Advanced Concepts: Multi-Tenant Sandbox Security and Event Streaming
Secure Network Sandboxing
Each microVM is isolated using a dedicated virtual network interface (TUN/TAP). Traffic routing is restricted using host-level iptables rules:
- VMs cannot access the metadata API endpoints of the host cloud provider (e.g., AWS IMDSv2).
- Outbound internet access is restricted to whitelisted package registries (e.g., npmjs.org, pypi.org) using a proxy gateway to prevent data exfiltration.
Event-Driven Orchestration Loop
The orchestrator planning loop runs asynchronously. When an agent triggers a long test run (e.g. 5 minutes), the VM Supervisor emits execution updates to a Kafka cluster. The Orchestrator releases the processing thread, listens to the Kafka topic, and resumes the planning loop only when it receives a completion event message.
๐ Visualizing the System Flows
Write Path: Job Submission and Sandbox Boot Flow
The diagram below details the sequence of steps that occur when a user triggers a new coding task.
flowchart TD
User([User]) -->|POST /jobs| Gateway[API Gateway]
Gateway -->|Schedule| Orchestrator[Orchestrator]
Orchestrator -->|1. Create State Record| StateDB[(State Database)]
Orchestrator -->|2. Send Start Event| TaskQueue[Task Queue]
TaskQueue -->|3. Consume Event| Supervisor[Sandbox Supervisor]
Supervisor -->|4. Clone Git Repo| HostStorage[Host Storage NVMe]
Supervisor -->|5. Boot microVM with Virtio-FS| VM[MicroVM Sandbox]
VM -->|6. Start execution agent daemon| Exec([Active Agent Loop])
As shown in this sequence, the database record is persisted first. This ensures that even if the compute host hosting the supervisor crashes while cloning the repository, the state engine can recover the task and retry on a separate host.
Read Path: Live Terminal and Diffs Streaming Flow
This diagram describes how command execution output inside the VM reaches the user's browser dashboard.
flowchart TD
VM[MicroVM Sandbox] -->|1. Emit console stdout/stderr| AgentDaemon[Agent Daemon]
AgentDaemon -->|2. gRPC stream chunk| LogCollector[Log Collector Service]
LogCollector -->|3. Push updates| EventStream[Event Streaming Service]
EventStream -->|4. Publish SSE| Gateway[API Gateway]
Gateway -->|5. Render terminal view| Client([Web UI Dashboard])
By pushing logs through a streaming collector independent of the Orchestrator, we avoid putting file-transfer memory pressure on the orchestration engine, keeping API routing fast and lightweight.
๐ Real-World Applications: Devin at Scale
Two major case studies showcase the application of this design:
- Massive Library Migrations: Upgrading an API client library across 1,000 distinct service repositories. The system spawns 1,000 parallel sandboxes, allowing the agents to compile, run tests, fix import errors, and submit pull requests simultaneously.
- Auto-Patching Security Alerts: Scanning code for CVE vulnerability patterns. The system clones repositories, executes static analysis tools inside the sandbox, edits files to patch dependencies, runs unit test suites, and flags failures for human review.
โ๏ธ Architectural Trade-offs and Failure Modes
Designing an autonomous coder involves trading execution isolation against latency and host density.
| Option A | Option B | System Trade-off |
| Docker Containers | Firecracker MicroVMs | Docker yields higher host VM density and faster startup, but shares the OS kernel, creating security risks. MicroVMs provide secure virtualization-level isolation but consume more idle RAM. |
| Git Clone per Job | Pre-cached NFS Volumes | NFS volumes reduce clone latencies but introduce lock coordination overhead and network bottleneck risks. Cloning directly to NVMe SSDs avoids bottlenecks but consumes network bandwidth. |
| Continuous Autonomy | Step-Level User Approval | Uncapped autonomy speeds up completion for routine tasks but risks runaway token cost loops. Step approvals prevent cost spikes but increase task latency due to human waiting time. |
System-Specific Bottlenecks & Failure Modes
- Infinite Retry Storm: The agent tries to install an incompatible package, fails, and repeats the command. Mitigation: The orchestrator maintains a loop detector that halts execution if the identical file-change hash occurs three times.
- Sandbox Disk Space Exhaustion: Running builds fills up the VM's writable overlay filesystem. Mitigation: Implement storage quotas on the Virtio-FS mount points and kill VMs that exceed disk budgets.
๐งญ Decision Guide: Orchestrator vs. Sandbox Architecture
Use the matrix below to choose how to coordinate planning and execution layers based on your deployment constraints.
| Situation | Recommended Approach | Why |
| Early MVP, low traffic | Single Orchestrator + Docker Sandbox on same node | Low complexity, fast to deploy, minimal network overhead. |
| High traffic, open-source codebases | Separated Orchestrator + Firecracker Cluster via gRPC | Secure virtualization boundary prevents malicious container escapes. |
| Large multi-service enterprise repos | Pre-provisioned Persistent VM Nodes + NFS Cache | Minimizes code checkout latency for large internal repositories. |
| High-security compliance projects | Ephemeral VMs + Private VPC Proxy | Prevents data exfiltration and blocks unauthorized external connections. |
๐งช Interview Delivery Example
In a system design interview, structured presentation demonstrates seniority. Below is a 45-minute breakdown showing how to communicate this coding agent design.
1. Requirements & Scope Definition (Minutes 0 - 5)
- Clarify boundaries: Ask if the interviewer wants to cover sandbox orchestration, model fine-tuning, or IDE integration. Focus on sandbox lifecycle and reliability loops.
- Define scale targets: Establish that the system must support 10k concurrent active agent runs, with secure sandboxing and low-latency workspace syncing.
2. High-Level Blueprint (Minutes 5 - 15)
- Draw the core split: the Agent Orchestrator (state, planning) and the Sandbox Supervisor (virtualization execution).
- Detail API contracts, PostgreSQL schemas, and Redis cache keys used to prevent duplicate executions.
3. Deep Dive (Minutes 15 - 35)
- Secure execution: Explain why standard containers are unsafe and how Firecracker MicroVMs solve kernel sharing issues.
- File synchronization: Detail Virtio-FS memory-mapped workspace directories and how copy-on-write overlays keep sandbox startup under 150ms.
- Planning-execution math model: Walk through the state-transition equation to illustrate loop safety.
4. Failure Modes & Edge Cases (Minutes 35 - 45)
- Discuss runaway loop mitigation, disk space caps, network exfiltration prevention, and how to scale database writes using transaction isolation sandboxes.
๐ ๏ธ Open-Source Frameworks: How Open-Source Solves Sandboxing
Enterprise platforms draw patterns from active open-source agent environments:
- OpenHands (formerly OpenDevin): Utilizes Docker-in-Docker containment networks to run sandbox workspaces. It maps local directories directly into containers, enabling real-time terminal output rendering.
- SWE-bench: A benchmark harness that provisions isolated virtual environments to test agents on real GitHub issues.
- LocalSandbox: An open-source lightweight microVM controller interface that abstracts Firecracker API configurations, enabling instant VM initialization.
๐ Lessons Learned: Production Realities
- Git Loop Lockups: When compiler checks fail, agents often attempt to revert commits, creating circular git loops. Implement git-ref guardrails to prevent agents from pushing changes without a clean history stream.
- Silent Dependency Poisoning: If an agent downloads packages directly from external registries during compile phases, it can pull compromised dependencies. Force all sandboxes to read from a sanitized internal private package registry mirror.
๐ Key Takeaways
- Compute Isolation: Never run untrusted code on shared hosts; decouple orchestrators from compute execution using virtualization-level microVMs.
- State Persistency: Save execution metrics and history traces at each step transition boundary to enable crash recovery.
- Zero Fenced Code blocks: High-Level Design interviews focus on component interaction, data schemas, and trade-offs rather than application coding scripts.
- Virtio-FS Scaling: Optimize file sync times between supervisor nodes and VMs using memory-mapped filesystem overlays.
Article tools
Reader feedback
Was this article useful?
Rate it if it helped, then continue with the next deep dive when you are ready.
Article metadata