Your Proxy Is Probably Your Bottleneck
Here’s something nobody tells you when you start building autonomous AI agents: your LLM proxy is likely the weakest link in your entire stack.
Not the model. Not your prompt engineering. The proxy—the thing that sits between your code and the provider API, quietly adding latency to every single request.
I’ve been running heavy parallel agent workloads for months now. Dozens of autonomous coding assistants hitting APIs simultaneously, streaming responses, retrying failures. And I kept running into the same wall: my proxy infrastructure would buckle before the upstream providers even broke a sweat.
So I did what any reasonable engineer would do. I went digging for hard numbers instead of marketing pages. What I found was eye-opening.
The Numbers Don't Lie
Let’s start with some cold, hard data from independent benchmarks. No marketing fluff—just measurements from independent load tests, Ferro Labs’ k6 study, and Kong’s own published benchmarks.
LiteLLM: The Popular Choice That Hits a Ceiling
LiteLLM is fantastic for prototyping. It supports 100+ providers, the Python ecosystem loves it, and the API is clean. But the production numbers paint a different picture:
- CPU-bound ceiling: In independent k6 testing, LiteLLM flatlined at roughly 175 RPS regardless of whether you threw 150 or 1,000 virtual users at it. The Python GIL simply won’t let it scale further on a single instance.
- Memory creep: RAM usage ranged from 335 MB to 1,124 MB in load tests. In one production postmortem on Reddit, memory climbed steadily from 3.2 GB to hitting the 8 GB OOM limit over just two hours at 350 RPS.
- Tail latency explosion: At 500 RPS on a t3.medium, the P99 latency reached a staggering 90.72 seconds. That’s not a typo. Ninety seconds of waiting for a proxy response.
The Broader Landscape
| Gateway | Language | RPS (500 VU) | Memory | Proxy Overhead |
|---|---|---|---|---|
| Kong AI | Lua/Go | ~8,133 | 43 MB | ~150 µs |
| Bifrost | Go | ~2,441* | 120 MB | ~59 µs |
| Portkey | Node.js | ~855 | N/A (hosted) | ~20–40 ms |
| LiteLLM | Python | ~175 | 335–1,124 MB | ~40–200 ms |
*Bifrost showed connection-pool starvation issues above 300 VU in one independent test.
Source data: Ferro Labs benchmark — all tested against the same 60ms mock upstream.
The pattern is crystal clear. Compiled languages (Go, Lua, Rust) operate in microsecond overhead territory. Interpreted languages (Python, Node.js) operate in millisecond territory. That’s three orders of magnitude of difference.
Why Microseconds Matter for Agents
“So what? A few milliseconds won’t kill me.”
Actually, it will. Think about how modern AI agents work. A single task might chain 10 to 15 sequential LLM calls—planning, tool use, reflection, correction, final output. If your proxy adds 40 ms of overhead per hop (a realistic number for LiteLLM with logging and retries enabled), that’s 400–600 ms of dead time per agent task, purely from infrastructure friction.
At scale, across hundreds of parallel agents, that dead time compounds into real dollars and real latency that your users feel.
A Go-based proxy adding 11 µs per hop? That same 10-step chain costs you 0.11 ms total. The infrastructure becomes invisible.
Where SwitchAILocal Fits In
This is where I started testing SwitchAILocal—an open-source AI gateway written in pure Go.
It’s not trying to be LiteLLM. It’s not a massive multi-tenant SaaS orchestrator with Redis clustering and Prisma ORM. It’s a bare-metal, single-binary proxy designed for one thing: routing your local AI traffic with as close to zero overhead as physically possible.
After running it under sustained parallel agent load for over 3 hours straight:
- Memory footprint: Stayed comfortably under 30 MB total. Compare that to LiteLLM’s 335–1,124 MB.
- Proxy overhead: Sub-5 ms on real traffic, including Lua hook execution and routing decisions. No GIL. No worker processes. Just Goroutines.
- Zero OOM events: Not once. Not even close. The Go garbage collector ran concurrent sweeps in under 1 ms without ever blocking the request path.
- Connection multiplexing: Thousands of warm HTTP connections reused through native
net/httptransport pooling—nohttpxconnection-pool bottleneck limiting you to 100 total / 10 per host like in Python.
Is it as feature-rich as LiteLLM? Not yet. It doesn’t have 100+ provider adapters or a managed cloud dashboard. But for local-first developers running autonomous agents, the raw performance advantage is undeniable.
Join Us
SwitchAILocal is fully open-source and actively looking for contributors. If you’re a Go developer, an infrastructure nerd, or you’ve ever watched your Python proxy eat 8 GB of RAM and wished there was a better way—come take a look.
- 📚 Documentation: ail.traylinx.com/introduction
- 💻 GitHub: github.com/traylinx/switchAILocal
Star it, fork it, break it, improve it. The AI infrastructure layer is still wide open—let’s build the right foundation.
Sebastian Schkudlara
Scoutica Protocol: Turn Your CV Into a Skill Card That AI Agents Can Discover, Score, and Hire