Latency is the most visible dimension of LLM performance. When a language model response streams back token-by-token, users experience it as a continuous conversation — and any glitch in that flow is immediately felt. A response that starts fast but crawls to a finish, or one that hesitates mid-stream, creates the impression of unreliability even when the underlying model is performing correctly.
Traditional API latency monitoring (p99 response time, error rate) is insufficient for LLM workloads. Language models have a fundamentally different performance profile: they produce partial results that stream to the user, they consume variable amounts of compute depending on output length, and their latency is a function of both input complexity and output length. Understanding LLM latency requires breaking it into components you can measure, act on, and correlate with user experience.
The Anatomy of LLM Latency
When a client sends a prompt and receives a complete model response, the end-to-end latency is the sum of distinct phases, each with different causes and remedies:
Time to First Token (TTFT) is the delay between the client sending a request and receiving the first generated token. It is the most perceptible latency dimension — the moment the user knows the model is working. TTFT is dominated by prefill processing: the model ingesting the entire prompt, computing attention across all input tokens, and preparing to generate the first output token. For long prompts, TTFT is primarily a function of input token count. For short prompts, it is more sensitive to model size and GPU compute availability.
Time Per Output Token (TPOT) is the average latency between each successive token during generation. Where TTFT tells you how quickly the model starts, TPOT tells you how consistently it sustains output. TPOT is a function of model compute per token, GPU memory bandwidth, and KV cache availability. A TPOT that climbs over time is a signal of memory pressure or model degradation.
Total End-to-End Latency is TTFT plus (TPOT times the number of output tokens). The ratio between TTFT and total latency changes dramatically with output length: for a 10-token response, TTFT might represent 60% of total latency; for a 500-token response, it drops to 5%. This means TTFT optimization matters most for short, interactive responses, while TPOT optimization matters more for long-form generation.
What to Instrument
Effective LLM latency monitoring starts with capturing these four metrics per request:
- TTFT (ms) — time to first token, measured client-side or via a proxy that marks request receipt and first byte
- TPOT (ms/token) — average time per output token during the generation phase
- Input token count — the number of tokens in the prompt, including system messages and context
- Output token count — the number of tokens generated before the model produced a stop token or hit its max token limit
From these four raw measurements, you can derive the signals that matter for production reliability:
- Cost per request (estimated) — input tokens plus output tokens, multiplied by your model's per-token pricing
- Generation throughput (tokens/sec) — 1000 / TPOT, the sustained generation speed
- Prefill-to-decode ratio — TTFT / (TTFT + TPOT x output_tokens), the fraction of latency attributable to prompt processing
Tracking the prefill-to-decode ratio reveals what kind of optimization matters most for your workload. A ratio above 0.5 means prompt processing dominates latency — focus on input-side optimizations (prompt caching, semantic caching, or a smaller/faster model for simple queries). A ratio below 0.2 means generation dominates — focus on GPU throughput, batch scheduling, or quantization.
Measuring TTFT and TPOT from vLLM
If you are serving models with vLLM, the Prometheus metrics endpoint provides TTFT and TPOT histograms directly. The key metrics are:
vllm:time_to_first_token_seconds— histogram of TTFT across all requestsvllm:time_per_output_token_seconds— histogram of TPOT across all generation stepsvllm:e2e_request_latency_seconds— total request latency from receipt to final tokenvllm:num_queue_tokens— current queue depth, the backpressure signal
For a Grafana dashboard that immediately surfaces actionable signals, set up these four panels:
- TTFT p50 / p95 / p99 — if p99 climbs above 2 seconds for your target model and prompt length, you have a prefill bottleneck. Check GPU utilization during the spike.
- TPOT distribution over time — a TPOT that trends upward week-over-week is an early warning of KV cache pressure or GPU saturation.
- Queue depth — a climbing queue depth means requests are backing up. Scale horizontally (add replicas) or implement priority batching.
- Tokens generated per minute — your throughput signal. If this drops without a traffic decrease, something is wrong upstream.
The Grafana dashboard for vLLM is covered in depth in our vLLM production monitoring guide, including Prometheus scrape configs and panel definitions.
OpenTelemetry Tracing for Latency Breakdown
Prometheus gives you aggregated metrics. OpenTelemetry traces give you per-request latency breakdowns that let you isolate which step in your request pipeline is the bottleneck. For LLM applications that involve retrieval, prompt augmentation, or multi-step chains, this granularity is essential.
A well-structured trace for a RAG-augmented LLM request captures the latency of each stage:
- Embedding generation — converting the user query to a vector
- Vector search — retrieving top-k chunks from the vector database
- Context assembly — building the full prompt with system message, retrieved chunks, and conversation history
- Model inference — the LLM call itself (TTFT + generation)
- Response post-processing — any streaming transform, safety filter, or output parsing
The openinference-instrumentation library from Arize captures these spans automatically when you instrument your OpenAI-compatible client. For LangChain applications, LangSmith provides this breakdown natively. For custom stacks, the OTel Python SDK makes it straightforward to create custom spans around each operation.
from opentelemetry import trace
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
# Initialize the tracer
provider = TracerProvider()
processor = BatchSpanProcessor(
OTLPSpanExporter(endpoint="http://collector:4317")
)
provider.add_span_processor(processor)
trace.set_tracer_provider(provider)
tracer = trace.get_tracer("llm-pipeline")
# Wrap each pipeline stage
with tracer.start_as_current_span("embedding") as span:
span.set_attribute("operation", "embed_query")
span.set_attribute("input_tokens", len(query_tokens))
embedding = embed_model.encode(query)
with tracer.start_as_current_span("retrieval") as span:
span.set_attribute("operation", "vector_search")
span.set_attribute("top_k", top_k)
chunks = vector_db.search(embedding, k=top_k)
with tracer.start_as_current_span("llm_inference") as span:
span.set_attribute("model", model_name)
span.set_attribute("input_tokens", prompt_token_count)
response = llm.generate(prompt) These spans flow to Grafana Tempo (or Jaeger for development) and give you a waterfall diagram of where latency is concentrated per request. The pattern that appears most often in production: embedding and retrieval together take 20-50ms, but the LLM inference step takes 500-3000ms. Optimizing the retrieval step (which can be done with caching and better indexing) yields a better user experience per dollar than upgrading the model.
Client-Side TTFT Measurement
Server-side metrics from vLLM or the model API exclude network transit time and any proxy processing. For a complete picture of user-perceived latency, measure TTFT at the client level. The technique: record the timestamp when the request is sent, measure the timestamp when the first byte of the response arrives, and report the delta as client-side TTFT.
If you are using streaming responses via Server-Sent Events (SSE) or WebSocket, the first byte arrives when the model starts generating — which is the server-side TTFT plus network RTT. For non-streaming responses, you measure the time until the full response body is received, which includes the entire generation time.
The gap between client-side TTFT and server-side TTFT is your network latency overhead. If this gap is consistently above 50ms, you may have geographic distance issues between your application servers and your inference endpoint. The fix is either moving the inference endpoint closer or implementing a caching layer that can serve responses without hitting the model.
Latency SLOs for LLM Applications
Defining latency SLOs for LLM workloads requires understanding what "good enough" means for your specific use case. A coding assistant tolerates higher TTFT than a customer-facing chatbot — the user context switch cost is lower when waiting for code. A real-time voice interface requires TTFT below 500ms or the conversation feels broken.
A practical starting point for most LLM applications:
- TTFT SLO: p95 below 1.5 seconds for prompts under 512 tokens on a 7B-class model
- TPOT SLO: p95 below 50ms/token for continuous generation (1000 tokens/min minimum throughput)
- End-to-end availability: 99.5% — successful non-timeout responses
These SLOs are targets, not guarantees. They vary based on model size, prompt complexity, and hardware. The important practice is tracking SLO burn rate: the rate at which you are consuming your error budget. If your TTFT p95 is above 1.5 seconds for more than 5% of requests over a 1-hour window, you are burning budget faster than you can recover and should investigate.
What Causes Latency Spikes
The three most common causes of LLM latency spikes in production, and how to identify them:
GPU memory saturation. When the KV cache grows beyond what fits in GPU memory, vLLM must spill to CPU RAM, which dramatically slows token generation. You see this as a TPOT that climbs gradually over hours or days rather than suddenly. The fix: reduce the maximum concurrent requests (limit batch size), enable prefix caching to reuse KV cache across requests with shared prefixes, or scale horizontally to distribute the memory load.
Prefill bottlenecks from long prompts. When a prompt with a long context window arrives, the prefill phase processes all input tokens before generating the first output. For prompts exceeding 8K tokens, this can add seconds to TTFT even on fast hardware. The fix: implement prompt caching (vLLM supports prefix caching natively), or route long-context requests to a separate inference pool provisioned for memory-heavy workloads.
Queue buildup during traffic spikes. If traffic arrives faster than the model can process, requests queue up and each one waits for its turn. Queue latency is invisible in server-side TTFT metrics — the model measures from when the request reaches the front of the queue. Use queue depth (vLLM's vllm:num_queue_tokens) as your leading indicator. Set an alert when queue depth exceeds 10x your normal baseline.
Optimizing LLM Latency
Once you can measure latency accurately, the optimization path becomes clear:
- Quantization. Running a model at INT8 or FP8 precision reduces compute per token with minimal quality loss for most tasks. This directly reduces TPOT. GPTQ, AWQ, and vLLM's built-in quantization support make this straightforward to test.
- Semantic caching. If 30% of your queries are semantically similar to recent queries, a semantic cache can serve responses without invoking the model at all — zero latency for cached hits. Tools like Portkey and GPTCache provide this layer.
- Smaller models for simple queries. A 7B model answering "what is the weather?" should not route through the same inference endpoint as a 70B model answering complex code review questions. Tiered model routing based on query complexity is one of the highest-leverage latency optimizations available.
- Batch scheduling. vLLM's continuous batching maximizes GPU utilization by keeping the inference engine busy while requests arrive and complete asynchronously. Ensure your deployment uses continuous batching (the default in vLLM) rather than static batching, which introduces latency spikes.
For a complete guide to FinOps-aware model routing — reducing latency while cutting costs — see our LLM FinOps guide.
Semantic caching for LLMs — serve repeated or similar queries from cache instead of invoking the model. Portkey's cache hit rates typically reduce latency by 40-60% for repetitive workloads and cut inference costs proportionally.
Conclusion
LLM latency monitoring is not a single metric problem. TTFT, TPOT, input token count, and output token count are four separate signals that together tell you whether your inference infrastructure is performing correctly. When TTFT climbs, you have a prefill problem. When TPOT climbs, you have a generation problem. When both are stable but queue depth grows, you have a capacity problem.
The tools to measure all of this exist today: vLLM's native Prometheus metrics, OpenTelemetry tracing, and Grafana dashboards. Getting them wired up takes an afternoon. What you get in return is the ability to distinguish between a model that is performing correctly and one that is degraded — before your users start complaining.
For a deeper dive into the open source LLM observability stack that ties these metrics together, see our guide to the Open Source LLM Monitoring Stack in 2026. And for understanding how latency relates to cost, the LLM FinOps guide covers the unit economics of inference in detail.