Multi-LLM Routing: Cut Costs 40% Without Quality Loss

Why Single-Provider LLM Infrastructure Is Expensive and Fragile

Running everything through GPT-4o is simple. It is also expensive. At $15 per million input tokens and $60 per million output tokens, a moderate-volume RAG pipeline can run $3,000-8,000 per month on API calls alone - before you count embedding costs and infrastructure overhead.

The more insidious problem is reliability. When OpenAI has an incident - which happens more often than OpenAI publicly acknowledges - every feature backed by GPT-4o goes down simultaneously. Your incident is your vendor's Tuesday afternoon.

Multi-LLM routing solves both. By intelligently directing each request to the most appropriate model based on cost, latency, and capability requirements, you can run the same workloads for significantly less money while gaining resilience against provider outages. Ensembling takes this further: sending the same request to multiple models and combining their outputs, which can improve quality for ambiguous or high-stakes queries.

The catch is that naive routing - round-robin, or always picking the cheapest model - tanks quality. Effective routing requires a routing layer that understands request characteristics, provider SLAs, and model capabilities well enough to make the right call every time.

Routing vs Ensembling: Not the Same Thing

Routing and ensembling are often conflated. They are fundamentally different strategies:

Routing - Send each request to exactly one model. The routing layer decides which one. The result comes back from a single provider. This is primarily a cost and reliability strategy.
Ensembling - Send each request to two or more models simultaneously. Combine the outputs according to a fusion strategy (voting, confidence-weighted, or LLM-as-judge). This is primarily a quality strategy for high-stakes outputs.

Most teams need routing. A subset of teams - those handling high-stakes decisions where output quality is more important than latency and cost - need ensembling. A small number of teams do both: route most requests, and escalate specific queries (flagged by content classification) to an ensemble.

The Routing Strategies I Tested

I ran three routing strategies against a corpus of 200,000 production requests from our RAG pipeline, classified into five query types:

Factual lookup - Direct questions with a single correct answer
Analytical - Questions requiring multi-step reasoning
Creative - Open-ended generation tasks
Code generation - Technical programming tasks
Context-heavy - Questions where the answer depends heavily on retrieved context (typical RAG)

Strategy 1: Cost-Only Routing

Route every request to the cheapest model capable of handling it. Priority order: Gemini 2.0 Flash (about $0.01 per million input tokens), then Claude 3.5 Sonnet (about $3 per million input), then GPT-4o ($15 per million input).

Result: 51% cost savings. 12% quality regression.

The quality regression came almost entirely from analytical and code generation queries sent to Gemini 2.0 Flash. Gemini Flash is excellent at factual lookups and creative tasks at a fraction of the cost, but it consistently underperformed on multi-step reasoning and code generation compared to Claude and GPT-4o.

Strategy 2: Latency-Optimized Routing

Route based on rolling 5-minute average latency per provider, always picking the fastest available provider with acceptable quality. Quality gates: code generation requires Claude or GPT-4o; analytical requires Claude or GPT-4o; factual, creative, and context-heavy can use any model.

Result: 38% cost savings. 2.1% quality regression (within margin of error).

Latency routing naturally gravitates toward cheaper models when they are faster - and at the volumes I tested, Gemini Flash was consistently the fastest provider on sub-1000 token responses. The quality gate prevented the worst regressions by routing complex tasks away from Flash.

Strategy 3: Composite Score Routing

Score each request with a composite formula. The quality weight is a per-query-type multiplier (0.6 for factual and creative, 1.0 for context-heavy, 1.4 for analytical and code). This biases analytical and code tasks toward higher-quality models while still allowing factual and creative tasks to flow to cheaper, faster models.

Result: 44% cost savings. 0.7% quality regression (statistically insignificant).

Composite routing achieved nearly the cost savings of cost-only routing while keeping quality regressions within noise. The key insight: quality weights tuned per query type prevent the cheap models from being applied where they fail.

The Architecture: How the Routing Layer Works

The routing layer sits between your application and the LLM providers. Every request goes through it; the routing layer decides which provider(s) to call and how to handle the response.

Request Classification

Before routing, classify each request by type. The simplest approach is a lightweight classifier - I used a fine-tuned DistilBERT model with roughly 50K labeled examples, running locally on CPU with sub-millisecond classification time. If you do not have labeled data for fine-tuning, a keyword-based heuristic works for many use cases:

QUERY_KEYWORDS = {
    "creative": ["write", "generate", "create", "story", "email"],
    "analytical": ["compare", "analyze", "why", "how", "explain"],
    "code_generation": ["code", "function", "class", "api", "implement"],
}

def classify_query(query):
    query_lower = query.lower()
    for qtype, keywords in QUERY_KEYWORDS.items():
        if any(kw in query_lower for kw in keywords):
            return qtype
    return "factual"

query_type = classify_query(user_query)

Provider Health and Latency Tracking

The router maintains a rolling window of provider health metrics. These are updated every 5 minutes via background probes. When a provider error rate exceeds 5% or its p95 latency exceeds 2x the rolling average, it is temporarily removed from the candidate pool until metrics recover.

Routing Decision Logic

QUALITY_WEIGHTS = {
    "factual": 0.6,
    "creative": 0.6,
    "context_heavy": 1.0,
    "analytical": 1.4,
    "code_generation": 1.4,
}

def route_request(query, query_type, providers):
    # Quality gates - analytical and code need premium models
    if query_type in ("analytical", "code_generation"):
        candidates = [p for p in providers if "gemini" not in p]
    else:
        candidates = list(providers.keys())

    # Remove unhealthy providers
    candidates = [p for p in candidates if providers[p]["error_rate"] < 0.05]

    # Score each candidate
    scored = []
    for provider in candidates:
        m = providers[provider]
        quality_weight = QUALITY_WEIGHTS[query_type]
        latency_score = 0.5 / m["avg_latency_ms"]
        cost_score = 0.3 / m["cost_per_1k_input"]
        quality_score = 0.2 * quality_weight
        total_score = latency_score + cost_score + quality_score
        scored.append((total_score, provider))

    scored.sort(reverse=True)
    return scored[0][1]

Ensembling: When Quality Matters More Than Cost

For high-stakes outputs - legal advice, medical information, financial analysis - routing to a single model introduces unacceptable risk. Ensembling addresses this by sending the same request to multiple models and combining their outputs.

The fusion strategies I tested:

Majority voting - For classification and multiple-choice tasks. Send to 3 models; take the majority answer. Works well for sentiment analysis, entity extraction, and fact verification.
Confidence-weighted - For open-ended generation. Use each model log probability or confidence score to weight its contribution to the final output. Higher confidence means more influence on the combined response.
LLM-as-judge - For quality-critical generation. Send to 2 models, then use a third model (often a cheaper one) to evaluate and merge the two outputs. More expensive but produces better results for complex analytical tasks.

Ensembling costs 2-3x more than single-model routing, so reserve it for queries that genuinely warrant the extra spend. The practical filter: queries where output quality is validated by a human reviewer, or where a wrong answer has material consequences.

Implementation: LiteLLM as the Routing Layer

If you are already running LiteLLM as your API gateway, you already have a routing layer - you just are not using it. LiteLLM supports model groups and fallbacks that enable all three routing strategies:

# LiteLLM config for latency-based routing
model_list:
  - model_name: gpt-4o
    litellm_params:
      model: gpt-4o
      api_key: os.environ.GPT4O_API_KEY
  - model_name: claude-3.5-sonnet
    litellm_params:
      model: claude-3-5-sonnet
      api_key: os.environ.ANTHROPIC_API_KEY
  - model_name: gemini-2.0-flash
    litellm_params:
      model: gemini-2.0-flash
      api_key: os.environ.GEMINI_API_KEY

router_settings:
  routing_strategy: "latency-based-routing"
  allowed_fallback_models: ["claude-3.5-sonnet"]
  redis_host: "localhost"
  redis_port: 6379

For composite scoring, you need a custom router. LiteLLM routing is extensible - subclass the Router class and implement your own scoring logic. The key methods to override: get_available_models(), get_model_for_request(), and health_check().

Results: Where the Savings Actually Come From

Across 200,000 requests using composite routing:

Factual queries (38% of volume) - Routed to Gemini Flash. 94% cost reduction vs GPT-4o. Quality delta: none measurable.
Creative queries (22% of volume) - Routed to Gemini Flash. 94% cost reduction. Quality delta: minor (Flash outputs were slightly less polished on creative writing tasks).
Context-heavy RAG (25% of volume) - Routed to Claude Sonnet. 80% cost reduction vs GPT-4o. Quality delta: none - Claude Sonnet matched or exceeded GPT-4o on context-grounded tasks.
Analytical (10% of volume) - Routed to Claude Sonnet. 80% cost reduction. Quality delta: Claude Sonnet produced more structured reasoning chains than GPT-4o on average - a quality improvement, not a regression.
Code generation (5% of volume) - Routed to GPT-4o. No savings, but no quality regression either.

The net effect: 44% average cost reduction across all query types, with quality improvements on analytical and context-heavy tasks offsetting minor regressions on creative tasks.

When Not to Route

Routing is not universally applicable. There are legitimate cases where you should pin to a single provider:

Model-specific behavior dependencies - If your prompt relies on specific model behaviors (e.g., Claude constitutional AI, GPT-4o JSON mode quirks), routing breaks these assumptions.
Latency-critical synchronous user interfaces - If a user is waiting synchronously for a response and p99 latency is an SLA metric, the added routing overhead (typically 5-20ms) may matter.
Debugging and reproducibility - During incident investigation, pinning to a specific model eliminates a variable and simplifies diagnosis.
Regulated environments - Some compliance frameworks require deterministic output provenance. Routing breaks auditability.

The practical solution: route by default, with a bypass mechanism for requests that need it. Tag requests with a routing flag (do_not_route: true) to force them through a specific model.

Conclusion

Multi-LLM routing is not a theoretical optimization. With composite scoring tuned per query type, I reduced LLM infrastructure costs by 44% across a mixed workload with statistically insignificant quality regression. The key is the quality gate: without query-type classification and model-specific quality weights, cost-only routing tanks quality on complex tasks. With them, routing becomes a win on both cost and quality for most query types.

Start with latency-based routing (lowest implementation complexity), measure per-query-type quality to calibrate your quality weights, then evolve to composite scoring once you have enough data to tune the coefficients. The savings compound quickly: 44% off a $10K per month API bill is $4,400 per month reinvested in engineering.