ai-infrastructure, open-source,

I Stress-Tested 5 AI Gateways. Only One Didn't Choke.

Sebastian Schkudlara Sebastian Schkudlara Follow Mar 27, 2026 · 4 mins read
I Stress-Tested 5 AI Gateways. Only One Didn't Choke.
Share this

Your Proxy Is Probably Your Bottleneck

Here’s something nobody tells you when you start building autonomous AI agents: your LLM proxy is likely the weakest link in your entire stack.

Not the model. Not your prompt engineering. The proxy—the thing that sits between your code and the provider API, quietly adding latency to every single request.

I’ve been running heavy parallel agent workloads for months now. Dozens of autonomous coding assistants hitting APIs simultaneously, streaming responses, retrying failures. And I kept running into the same wall: my proxy infrastructure would buckle before the upstream providers even broke a sweat.

So I did what any reasonable engineer would do. I went digging for hard numbers instead of marketing pages. What I found was eye-opening.


The Numbers Don't Lie

Let’s start with some cold, hard data from independent benchmarks. No marketing fluff—just measurements from independent load tests, Ferro Labs’ k6 study, and Kong’s own published benchmarks.

LiteLLM is fantastic for prototyping. It supports 100+ providers, the Python ecosystem loves it, and the API is clean. But the production numbers paint a different picture:

  • CPU-bound ceiling: In independent k6 testing, LiteLLM flatlined at roughly 175 RPS regardless of whether you threw 150 or 1,000 virtual users at it. The Python GIL simply won’t let it scale further on a single instance.
  • Memory creep: RAM usage ranged from 335 MB to 1,124 MB in load tests. In one production postmortem on Reddit, memory climbed steadily from 3.2 GB to hitting the 8 GB OOM limit over just two hours at 350 RPS.
  • Tail latency explosion: At 500 RPS on a t3.medium, the P99 latency reached a staggering 90.72 seconds. That’s not a typo. Ninety seconds of waiting for a proxy response.

The Broader Landscape

Gateway Language RPS (500 VU) Memory Proxy Overhead
Kong AI Lua/Go ~8,133 43 MB ~150 µs
Bifrost Go ~2,441* 120 MB ~59 µs
Portkey Node.js ~855 N/A (hosted) ~20–40 ms
LiteLLM Python ~175 335–1,124 MB ~40–200 ms

*Bifrost showed connection-pool starvation issues above 300 VU in one independent test.

Source data: Ferro Labs benchmark — all tested against the same 60ms mock upstream.

The pattern is crystal clear. Compiled languages (Go, Lua, Rust) operate in microsecond overhead territory. Interpreted languages (Python, Node.js) operate in millisecond territory. That’s three orders of magnitude of difference.


Why Microseconds Matter for Agents

“So what? A few milliseconds won’t kill me.”

Actually, it will. Think about how modern AI agents work. A single task might chain 10 to 15 sequential LLM calls—planning, tool use, reflection, correction, final output. If your proxy adds 40 ms of overhead per hop (a realistic number for LiteLLM with logging and retries enabled), that’s 400–600 ms of dead time per agent task, purely from infrastructure friction.

At scale, across hundreds of parallel agents, that dead time compounds into real dollars and real latency that your users feel.

A Go-based proxy adding 11 µs per hop? That same 10-step chain costs you 0.11 ms total. The infrastructure becomes invisible.


Where SwitchAILocal Fits In

This is where I started testing SwitchAILocal—an open-source AI gateway written in pure Go.

It’s not trying to be LiteLLM. It’s not a massive multi-tenant SaaS orchestrator with Redis clustering and Prisma ORM. It’s a bare-metal, single-binary proxy designed for one thing: routing your local AI traffic with as close to zero overhead as physically possible.

After running it under sustained parallel agent load for over 3 hours straight:

  • Memory footprint: Stayed comfortably under 30 MB total. Compare that to LiteLLM’s 335–1,124 MB.
  • Proxy overhead: Sub-5 ms on real traffic, including Lua hook execution and routing decisions. No GIL. No worker processes. Just Goroutines.
  • Zero OOM events: Not once. Not even close. The Go garbage collector ran concurrent sweeps in under 1 ms without ever blocking the request path.
  • Connection multiplexing: Thousands of warm HTTP connections reused through native net/http transport pooling—no httpx connection-pool bottleneck limiting you to 100 total / 10 per host like in Python.

Is it as feature-rich as LiteLLM? Not yet. It doesn’t have 100+ provider adapters or a managed cloud dashboard. But for local-first developers running autonomous agents, the raw performance advantage is undeniable.


Join Us

SwitchAILocal is fully open-source and actively looking for contributors. If you’re a Go developer, an infrastructure nerd, or you’ve ever watched your Python proxy eat 8 GB of RAM and wished there was a better way—come take a look.

Star it, fork it, break it, improve it. The AI infrastructure layer is still wide open—let’s build the right foundation.

Bridging Architecture & Execution

Struggling to implement Agentic AI or Enterprise Microservices in your organization? I help CTOs and technical leaders transition from architectural bottlenecks to production-ready systems.

View My Full Profile & Portfolio
Sebastian Schkudlara
Written by Sebastian Schkudlara Follow View Profile →
Hi, I am Sebastian Schkudlara, the author of Jevvellabs. I hope you enjoy my blog!