Inference API Gateways 2026: LiteLLM vs BentoML vs Ray Serve

Three months ago I watched a small team spend four weeks hand-rolling a provider router in Python. Three SDK integrations, two retry strategies, a spreadsheet for cost attribution. They swapped in LiteLLM and per-team cost attribution went from a Friday afternoon of grep work to a single SQL query. That moment is the gateway layer: the piece in front of your inference engine that most teams forget to design for until they are already drowning in SDK drift.

The StackPulsar articles cover the inference engine layer well — vLLM, SGLang, TGI, Ollama, Triton. What they do not cover is the gateway layer that sits in front: the piece that decides which model a request hits, counts tokens, enforces budget, and falls back when a provider 429s. Three open-source projects dominate in 2026: LiteLLM, BentoML, and Ray Serve. This is the comparison I wish I had when I started running multi-model stacks.

What the Gateway Layer Actually Does

Conflating the gateway with the engine is the most common mistake I see in production LLM stacks. They solve different problems:

Inference engine (vLLM, SGLang, TGI, TensorRT-LLM): takes tokens in, runs the forward pass on GPUs, streams tokens out. Optimizes for throughput, p99 latency, KV cache efficiency. Knows nothing about who is paying for the call.
Inference gateway (LiteLLM, BentoML, Ray Serve, plus commercial Portkey/Cloudflare AI Gateway): sits between the application and one or more engines or providers. Handles routing, auth, rate limiting, cost tracking, fallbacks, semantic caching, retries, request shaping.

An engine is a single-model high-throughput server. A gateway is the OpenAI-compatible façade that may forward to one or many engines, plus external providers like Anthropic, OpenAI, or Bedrock. They compose. The decision is not "gateway or engine" — it is "which gateway in front of which engine, and is my gateway also an engine?"

That last question is where the three tools diverge. LiteLLM is a pure gateway — it does not run models itself, it just routes to them. BentoML and Ray Serve are gateways and serving frameworks. That distinction drives most of the rest of this article.

LiteLLM: The Multi-Provider Router

LiteLLM is the most-installed inference gateway in the open-source LLM world, and for good reason. If you have ever had to write code like this, LiteLLM is the answer:

# The old way: provider-specific SDKs everywhere
if model.startswith("gpt-"):
    response = openai.chat.completions.create(...)
elif model.startswith("claude-"):
    response = anthropic.messages.create(...)
elif model.startswith("llama"):
    response = ollama.chat(...)

LiteLLM collapses that to a single call signature. Behind the scenes it has adapters for over 100 providers as of v1.88.0 — OpenAI, Anthropic, Azure OpenAI, Bedrock, Vertex, vLLM (running locally), Ollama, SGLang, Together, Groq, Fireworks. The relay proxy gives you a single OpenAI-compatible endpoint on top, and the SDK lets you swap providers via a model name string.

What LiteLLM is genuinely great at:

Spend tracking and budget enforcement: per-team, per-key virtual budgets, with hard stops at the proxy layer. The spend_logs table in the database adapter is the cleanest cost-attribution surface in the open-source LLM stack.
Provider routing and fallbacks: configure a primary, a list of backups, and a cooldown window. When Anthropic starts throttling, LiteLLM rolls you over to OpenAI automatically. I covered the monitoring side in the LiteLLM production monitoring article.
Semantic caching via Redis with embedding-based similarity. Cache hit rates above 50% are realistic for internal-facing chat workloads.
Rate limiting per model, per team, per API key — a feature the engine layer generally does not give you.

Where LiteLLM is not the right tool: if you are running self-hosted models and want the gateway and the engine in one binary, LiteLLM is the wrong shape — it is a pure proxy, and if the vLLM instance behind it is down, you are down. Its autoscaler is request-rate based, not GPU-aware. And the proxy adds roughly 5-15ms of overhead per request, which is real on TTFT-sensitive workloads.

BentoML: The Pythonic Serving Framework

BentoML started life as a generic model-serving framework for any ML model, then grew an LLM-specific path via its openllm integration. The mental model is different from LiteLLM: you package a model plus its code plus its dependencies into a "Bento" (a serializable artifact), and a "Runner" executes inference against that Bento inside a "Service."

What makes BentoML useful in 2026 is the OpenLLM runner. The v1.4 line ships native LLM runners that wrap vLLM, SGLang, and llama.cpp under a uniform Python API:

import bentoml

@bentoml.service(resources={"gpu": 1})
class LlamaService:
    def __init__(self):
        from openllm import LLM
        self.llm = LLM("meta-llama/Llama-3-8b-instruct", backend="vllm")

    @bentoml.api(batchable=True, max_batch_size=32)
    def generate(self, prompts: list[str]) -> list[str]:
        return self.llm.generate(prompts, max_new_tokens=256)

You get a Dockerfile, a deployment descriptor, and a runnable container with one CLI command (bentoml serve or bentoml deploy). The Bento artifact model standardizes what "production-ready" means across PyTorch, TensorFlow, scikit-learn, and LLM workloads — genuinely nice for ML platform teams.

What BentoML is good at: self-hosted LLM serving with a clean packaging story (the Bento is the unit of deployment — versioned, immutable, reproducible), adaptive batching via the batchable=True decorator, multi-model services in one process, and first-class YAML deployment configs for K8s, EC2, and BentoCloud.

Where BentoML struggles: multi-provider routing is weak (BentoML is built around "I own this Bento"), cost tracking is DIY, and the dashboard story is less polished than LiteLLM's.

Ray Serve: The Distributed Engine-Gateway

Ray Serve is the outlier in this comparison. It is part of the Ray framework, which means you are not adopting a serving library — you are adopting a distributed computing runtime that happens to have a serving primitive. If you already run Ray for training or batch jobs, Serve is the natural place to put inference.

The architectural model is "inference graphs": you compose multiple deployments into a DAG. A typical chat pipeline might look like:

from ray import serve

@serve.deployment(ray_actor_options={"num_gpus": 1})
class VLLMEngine:
    def __init__(self, model_id: str):
        from vllm import LLM
        self.llm = LLM(model_id, tensor_parallel_size=1)
    def __call__(self, request):
        return self.llm.generate(request["prompt"])

@serve.deployment
class Preprocessor:
    def __call__(self, request):
        return {"prompt": request.json()["prompt"].strip()}

@serve.deployment
class Router:
    def __init__(self, primary, fallback):
        self.primary = primary
        self.fallback = fallback
    async def __call__(self, request):
        try:
            return await self.primary.remote(request)
        except Exception:
            return await self.fallback.remote(request)

app = Router.bind(
    VLLMEngine.bind("meta-llama/Llama-3-8b-instruct"),
    VLLMEngine.bind("mistralai/Mistral-7B-Instruct-v0.3"),
)

Where Ray Serve is genuinely strong: autoscaling (Ray's autoscaler in 2.40+ reacts to in-flight requests, queue depth, and custom metrics, so you can autoscale GPU pods based on real inference backlog), distributed composition (multi-model ensembles, retrieval + generation pipelines, and agent graphs are first-class), already-Ray teams (Serve removes a runtime — one observability stack, one scheduler, one resource manager), and multi-region / heterogeneous hardware.

Where Ray Serve is the wrong tool: if you are not already on Ray, adopting it for the serving primitive alone is expensive (head node, autoscaler, dashboard, GCS — non-trivial). If you just need multi-provider routing, Ray is overkill — standing up a Ray cluster to route between OpenAI and Anthropic is absurd. And for latency-sensitive single-model serving, the deployment abstraction adds overhead, and you are often better running vLLM directly behind a thin proxy.

The Comparison Table

Dimension	LiteLLM	BentoML	Ray Serve
Primary role	Pure gateway	Service framework + gateway	Distributed runtime + gateway
Best for	Multi-provider LLM routing, FinOps	Self-hosted model packaging, Pythonic serving	Ray-native teams, multi-model pipelines
Multi-provider routing (OpenAI, Anthropic, Bedrock)	First-class (100+ adapters)	DIY	DIY (route via Python)
Built-in cost tracking	Yes (spend logs, budgets)	No	No
Built-in semantic caching	Yes (Redis)	No	DIY
Autoscaling model	Request rate on the proxy	Adaptive batching, K8s HPA	Queue depth, in-flight, custom (Ray autoscaler)
GPU-aware scaling	Indirect (via K8s HPA on engine)	Yes (K8s GPU resources)	First-class (Ray autoscaler reacts to GPU queue)
Learning curve	Low (config files, Python SDK)	Medium (Bento concept, decorators)	High (Ray concepts, clusters, actors)
Observability story	Polished (Prometheus, Grafana templates, DB)	OpenTelemetry support, less turnkey	Ray dashboard, Prometheus exporter
Lock-in risk	Low (open spec, OpenAI-compatible)	Medium (Bento format is theirs)	Medium-high (Ray runtime)

Decision Guide: Which One for Which Team

Strong recommendations, not hedging.

You call OpenAI, Anthropic, and one or two other hosted providers, and you do not self-host models. Use LiteLLM. Cost attribution, rate limiting, and fallbacks in under a day. The multi-LLM routing article covers the routing layer on top.
You self-host one or two open-weight models on a fixed GPU pool. Use BentoML. The packaging story pays off the first time you deploy a fine-tune. Point LiteLLM at the Bento endpoint if you also need provider-style routing.
You already run Ray for training, batch, or feature pipelines. Use Ray Serve. The cost of an additional runtime is zero, and you get GPU-aware autoscaling for free.
You run 5+ models, multiple accelerators, multi-region, and have an actual platform team. Ray Serve with vLLM engines inside the deployments, with LiteLLM as the public-facing gateway. Layered, not simple, but it scales.
You run a single model on a single GPU box for a prototype. Skip all three. Run vLLM with --api-key and --port 8000 directly. Add a gateway when the second use case appears.

The most common anti-pattern is teams adopting Ray Serve when they have one model and one team using it — Ray's operational cost dwarfs the benefit. The mirror anti-pattern is hand-rolling provider routing in Python when LiteLLM would replace a month of work in a day.

What We Actually Run at StackPulsar

For internal AI tooling across a handful of providers, we run LiteLLM as the gateway with vLLM behind it for the two self-hosted models we keep warm. We charge back AI spend to feature teams, and LiteLLM's spend_logs table is the only thing between us and a Friday afternoon of log-grepping. The vLLM article covers the engine side.

For the inference comparison workload we run for clients, we use Ray Serve. We need to swap engines and rerun benchmarks against multiple models in parallel, and Ray's deployment model lets us spin up a vLLM-backed deployment, kill it, and replace it without touching orchestration code.

For the BentoML-shaped use case, we have clients running it but not us. The Bento packaging model is excellent for platform teams shipping dozens of internal models with consistent deploys; for a two-model stack, it is overhead.

Limits and Anti-Patterns

A few things to be honest about:

All three share a tail-latency problem. When the engine stalls on a long prefill, the gateway cannot help. Time-to-first-token is dominated by the engine, not the gateway. Measure them separately.
Ray Serve for low-traffic self-hosted models is wasteful. The Ray head node, GCS, and dashboard consume resources a single vLLM instance does not need. We have measured roughly 2-4GB of overhead on the head node plus worker daemons.
Do not stack two gateways. LiteLLM in front of Ray Serve in front of vLLM is three proxying layers for no good reason. Pick one layer unless you have a specific reason to do otherwise.

The gateway choice is a FinOps and operability decision far more than a performance one. The latency overhead of all three is in the same order of magnitude. The differences are in who can debug it at 2am and who can swap a provider without a redeploy. Pick the one that matches your team's actual operating model.