The Failure Mode Nobody Benchmarks
Every AI gateway benchmark you’ll find online measures the same thing: how fast can the proxy shuffle requests under ideal conditions? RPS, P50 latency, memory at peak. All useful numbers, sure.
But here’s what none of those benchmarks test: what happens when the upstream provider just stops responding?
Not a clean error. Not a 429 rate limit. Not a 500 server error. The provider simply… hangs. The TCP connection stays open. No response ever arrives. Your proxy sits there, holding the socket, waiting.
This is actually the most dangerous failure mode in AI infrastructure. And it’s shockingly common. Model providers have cold-start delays, GPU queue buildups, and regional outages that can stall individual requests for minutes at a time.
I ran into this exact scenario in production last week and wanted to share what happened—because the outcome genuinely surprised me.
Why Hanging Connections Kill Servers
To understand why this matters, you need to think about what a proxy does with a hanging connection.
In a Python-based proxy like LiteLLM, each active request occupies resources in the async event loop—memory buffers, HTTP client connection slots, internal routing state. LiteLLM’s httpx client defaults to a connection pool of 100 total connections and 10 per host. When connections hang, those slots fill up. New requests queue internally, waiting for a slot to open.
This is how you get the scary numbers people report on Reddit: LiteLLM climbing from 3.2 GB to 8 GB over two hours before hitting an OOM crash, or P99 latencies spiking to 90.72 seconds at 500 RPS.
In the worst case, one hanging upstream connection blocks the entire async event loop—which is exactly what happens with synchronous stream iterators (GitHub Issue #20268). A single frozen Bedrock request can freeze every other concurrent request, health check, and background task in the process.
The Real Incident
I was running SwitchAILocal as my local proxy with multiple autonomous coding agents hammering it in parallel. The upstream provider was MiniMax, and the traffic was heavy—constant parallel /v1/chat/completions requests.
Then this showed up in the logs:
[11:02:28] [1b8b85b2] [error] 504 | 10m0s | POST "/v1/chat/completions"
A full 10-minute TCP hang before the provider finally returned a 504 Gateway Time-out. Ten minutes of a connection sitting there, doing nothing, consuming resources.
In a naive proxy, this is where the cascade begins. But SwitchAILocal has a built-in Circuit Breaker—a state machine (Closed → Open → Half-Open) that monitors provider health in real time.
Here’s what the telemetry showed:
1. Instant circuit trip. The moment that 504 landed, the circuit breaker flipped to OPEN for the MiniMax provider. No human intervention. No configuration change. Automatic.
2. Load shedding in milliseconds. Over the next 15 seconds, three more parallel agents tried to push requests through:
[11:02:31] [ec79b99e] 500 | 24ms ← intercepted locally
[11:02:35] [79ca8f1e] 500 | 24ms ← intercepted locally
[11:02:43] [71779012] 500 | 29ms ← intercepted locally
None of those requests were forwarded to the provider. SwitchAILocal rejected them locally in under 30 milliseconds, returning clean 500 errors so the agents could gracefully retry or fail.
3. Automatic recovery. Exactly one minute later, the circuit breaker shifted to Half-Open, allowing a single test request through:
[11:03:28] [e56a13ed] Use API key sk-c...TUf0
[11:03:45] [e56a13ed] 200 | 46.379s ← provider is back
The test request succeeded. The circuit closed. Traffic resumed. Total downtime for the proxy: zero.
It Happened Again 30 Minutes Later
The best part? The exact same scenario repeated at 11:33:12. Another 10-minute upstream hang. And the circuit breaker handled it identically—but even faster this time:
- Load shedding at 23ms, 14ms, and 18ms (faster than the first incident because memory caches were warmed)
- Automatic recovery one minute later with a clean
200 OK
This isn’t a fluke. It’s a deterministic state machine doing exactly what it was designed to do, over and over, without human intervention.
Why This Matters for Your Stack
The difference between “my server survived” and “my server crashed at 3 AM and nobody noticed until morning” often comes down to one thing: does your proxy have a real circuit breaker, or is it just blindly forwarding traffic into a black hole?
Most Python-based gateways don’t have built-in circuit breaking at the provider level. You’d need to add it yourself—wrapping calls in tenacity or circuitbreaker libraries, managing state across async workers, dealing with race conditions in multi-process deployments. It’s doable but fragile.
SwitchAILocal handles this natively, in compiled Go, with lock-free atomic state transitions that don’t add measurable overhead to the happy path.
If you’re running autonomous agents in production (or planning to), circuit breaking isn’t optional. It’s the difference between enterprise-grade reliability and a ticking time bomb.
- 📚 Documentation: ail.traylinx.com/introduction
- 💻 GitHub: github.com/traylinx/switchAILocal
The project is open-source and actively looking for contributors. Come build resilient AI infrastructure with us.
Sebastian Schkudlara