Routing inference across LLM providers without breaking latency
2025-09-12 / 8 min / llm / infra / cost / latency
An orchestration layer that picks the right provider per request. 28% off infrastructure spend. Sub-second latency held. Caller code never changed.
Why single-provider setups stop scaling
Once an LLM product crosses meaningful traffic, single-provider becomes the bottleneck for three reasons. Cost is set by a vendor who can re-price you on 30 days' notice. Throughput is rate-limited at the org level and you fight every other tenant for headroom. Quality on your specific workload is rarely uniform across models, so you pay for the most capable model on requests that did not need it.
The naive answer is 'put a router in front'. That is the right answer. The implementation is where teams quietly burn weeks.
The orchestration layer
A thin Node.js service in front of OpenAI, Anthropic, and Gemini. Callers spoke a single internal request shape: prompt, tool definitions, expected output schema, latency budget, cost ceiling. Provider selection was opaque.
Selection ran a tiered policy. First, capability match (does this model support the tools and schema this caller needs). Second, rolling p95 latency from the last five minutes per provider. Third, cost per expected token. A strict 50ms budget on the routing decision itself; anything that took longer fell back to the default route.
Failures within a tier fell through to a same-tier backup. Replays of the same request were idempotent against a content hash, so a 5xx never billed us twice.
Where the 28% savings came from
Most of the savings did not come from picking the cheapest model. They came from getting off the most-expensive default on traffic that did not need it. Internal-tool calls, summarisation, structured extraction with strict schemas: these went to cheaper models with no measurable quality drop. The expensive model held the long-tail reasoning workloads.
A smaller share came from killing duplicate calls during retries. The idempotency hash caught roughly 4% of total traffic that would otherwise have been double-billed.
Failure mode to watch for
Cross-provider behaviour is not uniform on edge cases. JSON-mode strictness, tool-call argument shapes, refusal patterns, token counting: these all vary. If your callers depend on undocumented quirks of one provider's output, the router will turn those bugs into intermittent flakes that look like infrastructure problems.
Add a normalisation layer between the router and the providers, and pin known-good provider versions per tier. Or you will spend the savings on debugging tickets that look like 'sometimes it returns weird JSON'.