Jan 11, 2025

Building a Sandbox for AI Agents in YOLO Mode

Last week, I wrote about how I was able to create a robust environment for horizontally scaling autonomous coding. The bottleneck quickly became me: I simply couldn’t keep up with the permission prompts the agents were generating. Today, I’d like to expand on how I’m addressing this.

Autonomous agents: safe until they’re not

An autonomous agent trapped in a box is not a threat on its own. It can’t be: nothing gets in or out. This is the principle behind air gaps and some early AI safety concepts. It follows that, even if this agent has sensitive information, risk only arises to the extent that the information can leave the system.

In a now-famous blog post, Simon Willison identified the so-called “lethal trifecta.” An autonomous agent becomes dangerous if (and only if) it has three things at once: sensitive data, untrusted information, and external communication.

The bad news is that most use cases require some degree of all three. The good news is that there is an implicit fourth condition, which is a lack of supervision. That’s the one we want to play with, because it’s the only one that’s negotiable.

AI agents invert the cost of risk mitigation

The obvious solution to supervision is, you know, to supervise. The AI says “mind if I do so-and-so?” and you get to decide if this so-and-so is alright. The only failure mode is you. We’ve lived with this threat model since the first beasts of burden.

The over-engineered solution is to build a custom semantic layer that rigidly screens for the kinds of stuff you don’t want to travel across the boundary, and clamps down hard whenever it sees something sketchy. That kind of layer has historically been available only to enterprises, and at enterprise prices.

But the concept of “over-engineering” presupposes that engineering labor is scarce. In fact, it no longer is; what’s scarce is supervision coupled to judgment. And spending that scarce resource deciding whether to fetch an API spec is incredibly wasteful.

Delegation is the point

The entire value proposition of an autonomous agent is that it does things without you. If you’re approving every action, you have a fancy autocomplete with extra steps.

So we want to reduce supervision. But we can’t eliminate it; the lethal trifecta doesn’t go away just because we’re busy. What we can do is delegate it.

Traditional agent supervision puts a human in the loop for every sensitive operation. Command execution? Prompt. Network request? Prompt. File write? Prompt. The human is the gating mechanism. This works, but it doesn’t scale. Three agents generating prompts across three sessions and you’re playing whack-a-mole. Ten agents is not a thing you can do.

The alternative is to replace the human gate with an algorithmic one. Container isolation handles system operations: the agent can do what it wants, but only inside a locked-down box. Egress filtering handles network operations: the agent can call out, but a proxy scans every request for secrets before it leaves. The gating still happens; the human just isn’t doing it anymore.

What the human does instead is review work product. Not “may I run this command” but “here’s the branch I built.” The supervision moves up the stack.

Three layers

I’m running Claude Code in what Anthropic calls “YOLO mode”: no permission prompts, no guardrails, no safety net. This is typically used for throwaway experiments where you don’t care what happens. I’m using it for real work on a real codebase. The difference is containment.

Layer 1: Secrets hygiene. Treat your development environment as pre-compromised. This is good practice regardless of AI; secrets should never be in your codebase, not even in gitignored dotfiles. Use a secret store. Separate dev and prod credentials. Your AI doesn’t need prod secrets. It probably doesn’t need most secrets, either.

Layer 2: Container isolation. The agent runs in a Kubernetes pod with restricted privileges: non-root user, isolated filesystem, no host access. If the agent goes sideways, the blast radius stops at the container wall. This is standard practice for untrusted workloads, and an AI agent is an untrusted workload.

Layer 3: Egress filtering. All HTTP/HTTPS traffic passes through a scanning proxy. The proxy uses LLM-Guard (built on detect-secrets) to identify API keys, tokens, and private keys in outbound requests. It also blocklists paste sites and file-sharing services. Clean traffic passes; dirty traffic gets a 403.

The lethal trifecta is still present: internet access, code execution, secrets. But there’s now an algorithmic layer where the human used to sit. The trifecta is supervised; I’m just not the one doing it.

What’s novel, what’s not

None of the components here are new. Container sandboxing is standard. MITM proxies for data loss prevention exist. Secret scanning in CI/CD is common (GitGuardian, TruffleHog, detect-secrets). Kubernetes NetworkPolicy is just Kubernetes.

What’s new is the application. These tools were built for different threat models: malicious insiders, accidental commits, compliance requirements. Applying them to AI agent containment is a reframing, not an invention.

What’s also new is the economics. Five years ago, I priced out AWS egress filtering appliances: tens of thousands a year, minimum. Now LLM-Guard is open source. mitmproxy is open source. Claude wired them together in an evening. The enterprise moat evaporated.

Known gaps

I’m not claiming this is airtight, and the gaps are worth enumerating.

SSH bypasses the proxy. Git operations go direct to GitHub, which means secrets could theoretically be exfiltrated via commit messages or branch names. Mitigation: deploy keys with minimal permissions, or switch to git-over-HTTPS. I’m living with the risk for now.

Prompt injection is a real threat. A malicious file could instruct the agent to exfiltrate data. The egress filter actually helps here: even if the agent tries to send secrets, they get blocked at the perimeter. But sophisticated attacks (DNS exfiltration, steganography, encoding secrets in URL paths) could slip through.

The system is fail-open by default. If LLM-Guard goes down, traffic passes through with a warning. I chose availability over security because this is a dev environment, and I’d rather debug a downed service than debug a hung agent. Flip one config value if your threat model differs.

The result

I now run multiple Claude Code instances in parallel, each on its own feature branch, each in YOLO mode. I check in periodically to review what they’ve produced. The bottleneck moved from approving individual operations to reviewing completed work.

The implementation is at github.com/borenstein/yolo-cage. For the tmux and Tailscale setup that makes this practical on a remote server, see [link to Post 1].