Routing should feel instantaneous. If you have to wait for your router to “think,” you’ve already lost the game.
The mandate for Cortex Router Phase 2 was simple but brutal: Route intelligently in under 20 milliseconds.
We didn’t want standard, brittle “keyword matching.” We wanted true Semantic Understanding. But we also couldn’t afford the massive latency of an LLM call for every single request.
The solution? A biological architecture. We mimicked the human brain’s own efficiency layers.
The 4 Tiers of Cognition
We built a tiered system where every request fights its way up the evolutionary ladder.
Tier 0: The Semantic Cache (< 1ms)
“Déjà vu.” Before doing any work, the system hits the vector cache. If we’ve seen a semantically similar prompt (Cosine Similarity > 0.95), we don’t think. We react. We replay the exact successful routing decision from last time.
- Latency: Net Zero.
Tier 1: The Reflex Tier (< 1ms)
“The Reptilian Brain.” If the cache misses, we drop to optimized Regex patterns for immediate safety.
- Security: Catches PII (API keys, email addresses) before they ever hit a model.
- Safety: Spots massive binary pastes that would choke an LLM.
- Overrides: Respects explicit
model="gpt-4"demands from the user.
This is the fail-safe. If everything else burns, the Reflex Tier still keeps the traffic moving.
Tier 2: The Semantic Tier (< 20ms)
“The Instinct.” This is the breakthrough. We run a quantised embedding model via ONNX Runtime directly inside the Go binary. We vectorise your prompt and compare it against a pre-computed registry of 15+ Intents (Coding, Creative, Reasoning) and Skills (SQL Optimization, Go Refactoring).
If we find a high-confidence match (> 0.85), we bypass the classification LLM entirely.
- “Write a Python script…” -> Vector Match: CODING -> Route to Claude 3.5 Sonnet.
- Cost: $0.00. Time: 12ms.
Tier 3: The Cognitive Tier (200ms+)
“The Prefrontal Cortex.” Only when the prompt is truly ambiguous do we wake up the Router LLM (Gemma 2 or Haiku). But we changed the game here too. We don’t just ask for a classification. We implemented a Verification Loop.
If the Router is unsure (Confidence < 0.60), it triggers a “Reasoning Check”—a secondary prompt to verify its own logic. “Measure twice, cut once.”
The Dynamic Matrix: Survival of the Fittest
Static config.yaml files are a relic.
The Intelligence Service runs a background process called the Capability Analyzer. It constantly tests your available providers.
- Is Ollama responding?
- Is the Gemini quota full?
- Is the latency on DeepSeek spiking?
It builds a Dynamic Matrix. When the router says “I need a Coder,” it doesn’t look at a stale config file. It asks the Matrix: “Who is the best available Coder on the network right now?”
If your primary model dies, the system doesn’t error out. It seamlessly degrades (or promotes) the next best “Coder” in the matrix. You never even notice the switch.
Data Structures
For the engineers, here’s what the brain looks like in Go:
type IntelligenceService struct {
discovery *DiscoveryService
capability *CapabilityAnalyzer
matrix *DynamicMatrixBuilder
embedding *EmbeddingEngine // ONNX Runtime
semantic *SemanticTier // Vector Logic
cache *SemanticCache // Redis/In-Memory
confidence *ConfidenceScorer
verifier *Verifier
}
This isn’t just routing. It’s orchestration. It’s the difference between a switchboard operator and a traffic controller.
Sebastian Schkudlara
Evolution: From Dumb Pipes to Intelligent Gateways