Home/Blog/System Design/System Design: Designing an Autonomous AI Coding Agent (Devin at Scale)

System DesignAdvanced•14 min read•Jun 18, 2026

System Design: Designing an Autonomous AI Coding Agent (Devin at Scale)

Learn how to design a secure, isolated, and scalable high-level system for an autonomous AI software engineer at enterprise scale.

Abstract Algorithms

Helping engineers master software engineering topics.

TLDR: Designing an autonomous AI coding agent at scale is not a prompt engineering task; it is a complex systems problem. The system requires secure multitenancy via Firecracker microVMs, a low-latency workspace syncing event loop, and a dual planning-execution engine backed by durable state to survive compile-and-retry loops.

📖 Part 1: Approach — Requirements and Scale Challenges

When building an autonomous AI developer like Devin or Swe-agent, developers often start with a simple script that calls the OpenAI API, reads local files, runs local compilers, and writes code.

In production, this naive setup fails immediately:

The Infinite Loop Resource Burn: An LLM planner gets stuck in an compile-fail-retry cycle, compiling a broken package 500 times in 2 minutes, consuming all system CPU and incurring thousands of dollars in API costs.
Sandbox Security Escape: An agent writes a python script that runs malicious code locally, deletes files, or attempts to port-scan other nodes in the network.
Workspace Sync Bottleneck: If the workspace repository size is over 5GB, copying files between the agent orchestrator node and the compiler sandbox takes minutes, destroying the interaction loop.

To prevent this, we must design an enterprise-grade system that isolates execution, tracks costs, manages task cycles, and synchronizes workspace code at sub-second speeds.

Use Cases & Requirements

Actors

Developer User: Initiates tasks (e.g., "Fix issue #123"), inspects logs, and reviews code diffs.
Agent Orchestrator: The central state coordinator that schedules jobs, runs the planning loop, and manages user context.
Sandbox Supervisor: The secure infrastructure component that spins up, monitors, and terminates execution environments.
Model Inference Provider: Submits completions and structured tool calls.

Functional Requirements

Initialize Job: Users can clone a git repository, set context variables, and trigger a coding task.
Execute Terminal Commands: The agent can run arbitrary shell commands inside an isolated environment and capture console stdout/stderr.
Edit Code Files: The agent can view, search, and edit files inside the workspace repository.
Pause/Resume for Feedback: If the agent needs help (e.g., API key missing) or requires approval for a commit, the execution loop pauses and resumes upon user response.

Non-Functional Requirements (NFRs)

Strong Isolation: Sandboxes must run with zero shared kernel state to prevent container escapes.
Low-Latency VM Lifecycle: Sandbox environments must spin up in less than 150 milliseconds.
Cost and Iteration Caps: Each job has a maximum token budget and step attempt threshold, terminating immediately upon limit breach.
Scale Targets: Sized to support 10,000 concurrent active coding sandboxes.

Capacity Estimations

For foundational capacity estimation principles and how to structure scale metrics, refer to our Capacity Estimation Guide.

Let's estimate the system demands for 10,000 active concurrent agent runs:

1. Compute & RAM Sizing

Assume each coding agent run requires an isolated microVM sandbox with 2 vCPUs and 4 GB of RAM.
Total RAM needed: 10,000 runs * 4 GB = 40,000 GB (40 TB).
Total vCPUs needed: 10,000 runs * 2 vCPUs = 20,000 cores.
Assuming server nodes with 128 cores and 512 GB RAM, we need:
- By RAM limit: 40 TB / 512 GB = 80 physical nodes.
- By CPU limit (with 2:1 overcommit): 20k cores / (128 * 2) = 78 physical nodes.
- We size our cluster with 90 physical compute nodes to provide headroom.

2. Disk Storage & I/O Sizing

If the average git repository workspace is 200 MB, cloning 10,000 repositories concurrently requires:
- 10,000 * 200 MB = 2 TB of active workspace storage.
- Since agents run high-IOPS tasks (compiling code, installing packages via npm install), we allocate NVMe SSD arrays capable of handling 50,000 write IOPS per storage node.

Design Goals

Our design addresses specific engineering constraints to maintain stability:

Secure Sandbox Isolation: Prevent rogue agents from accessing host infrastructure or other sandboxes.
State Resumability: Ensure long-running workflows (which can take hours) survive orchestrator container restarts.
Low-Latency Loop: Sync file edits and terminal streams to the UI in real-time.

⚙️ High-Level Architecture and System Mechanics

The diagram below details the components of our autonomous agent system, splitting execution into a control plane and a sandbox execution plane.

graph TD
    Client[Web Browser Client] -->|HTTP/Websocket| Gateway[API Gateway & Router]
    Gateway -->|gRPC| Orchestrator[Agent Orchestrator Service]
    Gateway -->|Stream| LogCollector[Log & Stream Collector]
    Orchestrator -->|Queue| TaskQueue[Task Message Queue]
    Orchestrator -->|State Updates| StateDB[(State Database)]
    Orchestrator -->|Locking| Cache[(Redis Session Cache)]

    TaskQueue --> SandboxSupervisor[Sandbox Supervisor Service]
    SandboxSupervisor -->|Control API| VMHost[MicroVM Compute Host]

    VMHost --> VM1[MicroVM Sandbox 1]
    VMHost --> VM2[MicroVM Sandbox 2]

    VM1 -->|gRPC Stream| LogCollector
    VM2 -->|gRPC Stream| LogCollector

This high-level architecture separates orchestration from isolated compute execution. The API Gateway routes user configurations. The Agent Orchestrator manages the task planning and state database updates, while delegating command execution to the Sandbox Supervisor. The Sandbox Supervisor spins up isolated MicroVM hosts which stream build logs back to the Log Collector.

API Design

Below is the REST contract for managing coding jobs.

Endpoint	Method	Request Payload	Response	Description
`/api/v1/jobs`	POST	`{repoUrl, taskDesc}`	`{jobId, status}`	Initiates a new coding agent run.
`/api/v1/jobs/{id}/stop`	POST	None	`{status: "stopped"}`	Interrupts execution and kills VMs.
`/api/v1/jobs/{id}/stream`	GET	None	Server-Sent Events	Streams terminal logs and file diffs.

Data Model / Schema

We store our job metadata and historical steps in a relational schema.

1. Table: `agent_jobs`

job_id (UUID, Primary Key)
repository_url (VARCHAR, Indexed)
task_description (TEXT)
status (VARCHAR) - E.g., PLANNING, RUNNING, PAUSED, COMPLETED, FAILED
created_at (TIMESTAMP)
token_spend_usd (DECIMAL)

2. Table: `execution_steps`

step_id (UUID, Primary Key)
job_id (UUID, Foreign Key referencing agent_jobs, Indexed)
step_number (INT)
planner_thought (TEXT)
tool_called (VARCHAR)
tool_arguments (TEXT)
tool_output (TEXT)
executed_at (TIMESTAMP)

Cache Schema

Redis stores transient state and lock details to prevent double-execution:

Key: session:lock:{jobId} | Type: String | Value: Orchestrator instance UUID | TTL: 10 seconds (heartbeat renewed). Prevents multiple orchestrators from executing the same job.
Key: sandbox:token:{jobId} | Type: String | Value: MicroVM JWT token | TTL: 1 hour. Used to authenticate gRPC log streaming.

Tech Stack & Design Choices

Component	Technology	Rationale
Sandbox Isolation	Firecracker MicroVMs	Virtualization-level safety with minimal footprint, launching in under 150ms.
State Storage	PostgreSQL	Relational schema is ideal for tabular structured logs and audit traces.
Log Streaming	gRPC / HTTP/2	High-throughput bi-directional terminal stdout/stderr streaming.

🧠 Deep Dive: Execution Sandboxing and Workspace Syncing

Building a production-ready system requires looking closely at how virtualization runs commands securely and syncs workspace directories between host and VM.

The Internals

To run untrusted agent code, we avoid standard Docker containers because they share the host OS kernel. Instead, we use Firecracker microVMs. Firecracker leverages KVM to create lightweight virtual machines.

When an agent changes a file, we avoid slow network transfers by using a copy-on-write overlay file system. The base image containing standard programming languages (Python, Node, Java) is shared read-only across all microVMs. When a VM starts, it gets a thin writable directory overlay. Workspace syncing runs over a shared memory protocol (Virtio-FS) connecting the host node directory to the guest VM directory.

Mathematical Model

The agent's iterative planning-execution loop can be modeled as a state machine. Let:

$S_t \in \mathcal{S}$ represent the system state at step $t$, defined as the tuple $(C_t, W_t)$ where $C_t$ is the current context token window (system prompt, conversation history) and $W_t$ is the workspace state (directory file tree and git diff).
$P_t = \text{LLM}(S_t)$ be the planning thought output from the model.
$T_t = g(P_t, S_t)$ be the tool command (e.g., execute shell command npm test, write file index.js) parsed from the output.
$\Omega: T \times W \to (W', R)$ be the sandbox execution transition function, taking a tool command and workspace state, and returning a updated workspace $W'$ and console output result $R$.

The system updates state sequentially: $$ W_{t+1} = \text{projection}_W(\Omega(T_t, Wt)) $$ $$ C{t+1} = C_t \cup {P_t, T_t, \text{projection}_R(\Omega(T_t, Wt))} $$ $$ S{t+1} = (C{t+1}, W{t+1}) $$

Performance Analysis

Sandbox Boot Latency: Firecracker VM startup takes ~120ms. Mounting Virtio-FS directories adds ~30ms, bringing aggregate startup time to ~150ms, much faster than traditional heavy hypervisors.
Workspace Sync Overhead: Virtio-FS memory mapping enables file read-writes inside the VM at ~95% of native SSD speed.
Context Limit Gating: As $C_t$ expands with large error dumps, LLM API call time increases. To prevent latency degradation, the orchestrator runs a summarization compression step once $C_t$ exceeds 64k tokens.

🏗️ Advanced Concepts: Multi-Tenant Sandbox Security and Event Streaming

Secure Network Sandboxing

Each microVM is isolated using a dedicated virtual network interface (TUN/TAP). Traffic routing is restricted using host-level iptables rules:

VMs cannot access the metadata API endpoints of the host cloud provider (e.g., AWS IMDSv2).
Outbound internet access is restricted to whitelisted package registries (e.g., npmjs.org, pypi.org) using a proxy gateway to prevent data exfiltration.

Event-Driven Orchestration Loop

The orchestrator planning loop runs asynchronously. When an agent triggers a long test run (e.g. 5 minutes), the VM Supervisor emits execution updates to a Kafka cluster. The Orchestrator releases the processing thread, listens to the Kafka topic, and resumes the planning loop only when it receives a completion event message.

📊 Visualizing the System Flows

Write Path: Job Submission and Sandbox Boot Flow

The diagram below details the sequence of steps that occur when a user triggers a new coding task.

flowchart TD
    User([User]) -->|POST /jobs| Gateway[API Gateway]
    Gateway -->|Schedule| Orchestrator[Orchestrator]
    Orchestrator -->|1. Create State Record| StateDB[(State Database)]
    Orchestrator -->|2. Send Start Event| TaskQueue[Task Queue]
    TaskQueue -->|3. Consume Event| Supervisor[Sandbox Supervisor]
    Supervisor -->|4. Clone Git Repo| HostStorage[Host Storage NVMe]
    Supervisor -->|5. Boot microVM with Virtio-FS| VM[MicroVM Sandbox]
    VM -->|6. Start execution agent daemon| Exec([Active Agent Loop])

As shown in this sequence, the database record is persisted first. This ensures that even if the compute host hosting the supervisor crashes while cloning the repository, the state engine can recover the task and retry on a separate host.

Read Path: Live Terminal and Diffs Streaming Flow

This diagram describes how command execution output inside the VM reaches the user's browser dashboard.

flowchart TD
    VM[MicroVM Sandbox] -->|1. Emit console stdout/stderr| AgentDaemon[Agent Daemon]
    AgentDaemon -->|2. gRPC stream chunk| LogCollector[Log Collector Service]
    LogCollector -->|3. Push updates| EventStream[Event Streaming Service]
    EventStream -->|4. Publish SSE| Gateway[API Gateway]
    Gateway -->|5. Render terminal view| Client([Web UI Dashboard])

By pushing logs through a streaming collector independent of the Orchestrator, we avoid putting file-transfer memory pressure on the orchestration engine, keeping API routing fast and lightweight.

🌍 Real-World Applications: Devin at Scale

Two major case studies showcase the application of this design:

Massive Library Migrations: Upgrading an API client library across 1,000 distinct service repositories. The system spawns 1,000 parallel sandboxes, allowing the agents to compile, run tests, fix import errors, and submit pull requests simultaneously.
Auto-Patching Security Alerts: Scanning code for CVE vulnerability patterns. The system clones repositories, executes static analysis tools inside the sandbox, edits files to patch dependencies, runs unit test suites, and flags failures for human review.

⚖️ Architectural Trade-offs and Failure Modes

Designing an autonomous coder involves trading execution isolation against latency and host density.

Option A	Option B	System Trade-off
Docker Containers	Firecracker MicroVMs	Docker yields higher host VM density and faster startup, but shares the OS kernel, creating security risks. MicroVMs provide secure virtualization-level isolation but consume more idle RAM.
Git Clone per Job	Pre-cached NFS Volumes	NFS volumes reduce clone latencies but introduce lock coordination overhead and network bottleneck risks. Cloning directly to NVMe SSDs avoids bottlenecks but consumes network bandwidth.
Continuous Autonomy	Step-Level User Approval	Uncapped autonomy speeds up completion for routine tasks but risks runaway token cost loops. Step approvals prevent cost spikes but increase task latency due to human waiting time.

System-Specific Bottlenecks & Failure Modes

Infinite Retry Storm: The agent tries to install an incompatible package, fails, and repeats the command. Mitigation: The orchestrator maintains a loop detector that halts execution if the identical file-change hash occurs three times.
Sandbox Disk Space Exhaustion: Running builds fills up the VM's writable overlay filesystem. Mitigation: Implement storage quotas on the Virtio-FS mount points and kill VMs that exceed disk budgets.

🧭 Decision Guide: Orchestrator vs. Sandbox Architecture

Use the matrix below to choose how to coordinate planning and execution layers based on your deployment constraints.

Situation	Recommended Approach	Why
Early MVP, low traffic	Single Orchestrator + Docker Sandbox on same node	Low complexity, fast to deploy, minimal network overhead.
High traffic, open-source codebases	Separated Orchestrator + Firecracker Cluster via gRPC	Secure virtualization boundary prevents malicious container escapes.
Large multi-service enterprise repos	Pre-provisioned Persistent VM Nodes + NFS Cache	Minimizes code checkout latency for large internal repositories.
High-security compliance projects	Ephemeral VMs + Private VPC Proxy	Prevents data exfiltration and blocks unauthorized external connections.

🧪 Interview Delivery Example

In a system design interview, structured presentation demonstrates seniority. Below is a 45-minute breakdown showing how to communicate this coding agent design.

1. Requirements & Scope Definition (Minutes 0 - 5)

Clarify boundaries: Ask if the interviewer wants to cover sandbox orchestration, model fine-tuning, or IDE integration. Focus on sandbox lifecycle and reliability loops.
Define scale targets: Establish that the system must support 10k concurrent active agent runs, with secure sandboxing and low-latency workspace syncing.

2. High-Level Blueprint (Minutes 5 - 15)

Draw the core split: the Agent Orchestrator (state, planning) and the Sandbox Supervisor (virtualization execution).
Detail API contracts, PostgreSQL schemas, and Redis cache keys used to prevent duplicate executions.

3. Deep Dive (Minutes 15 - 35)

Secure execution: Explain why standard containers are unsafe and how Firecracker MicroVMs solve kernel sharing issues.
File synchronization: Detail Virtio-FS memory-mapped workspace directories and how copy-on-write overlays keep sandbox startup under 150ms.
Planning-execution math model: Walk through the state-transition equation to illustrate loop safety.

4. Failure Modes & Edge Cases (Minutes 35 - 45)

Discuss runaway loop mitigation, disk space caps, network exfiltration prevention, and how to scale database writes using transaction isolation sandboxes.

🛠️ Open-Source Frameworks: How Open-Source Solves Sandboxing

Enterprise platforms draw patterns from active open-source agent environments:

OpenHands (formerly OpenDevin): Utilizes Docker-in-Docker containment networks to run sandbox workspaces. It maps local directories directly into containers, enabling real-time terminal output rendering.
SWE-bench: A benchmark harness that provisions isolated virtual environments to test agents on real GitHub issues.
LocalSandbox: An open-source lightweight microVM controller interface that abstracts Firecracker API configurations, enabling instant VM initialization.

📚 Lessons Learned: Production Realities

Git Loop Lockups: When compiler checks fail, agents often attempt to revert commits, creating circular git loops. Implement git-ref guardrails to prevent agents from pushing changes without a clean history stream.
Silent Dependency Poisoning: If an agent downloads packages directly from external registries during compile phases, it can pull compromised dependencies. Force all sandboxes to read from a sanitized internal private package registry mirror.

📌 Key Takeaways

Compute Isolation: Never run untrusted code on shared hosts; decouple orchestrators from compute execution using virtualization-level microVMs.
State Persistency: Save execution metrics and history traces at each step transition boundary to enable crash recovery.
Zero Fenced Code blocks: High-Level Design interviews focus on component interaction, data schemas, and trade-offs rather than application coding scripts.
Virtio-FS Scaling: Optimize file sync times between supervisor nodes and VMs using memory-mapped filesystem overlays.

Article tools

Explain simpler Compare approaches What next?

Reader feedback

Was this article useful?

Rate it if it helped, then continue with the next deep dive when you are ready.

Article metadata