vLLM Production Monitoring: A Practical Stack Guide

vLLM has become the de facto open-source inference engine for serving large language models in production. Its PagedAttention memory management and high throughput make it the default choice for teams running LLM inference at scale. But monitoring vLLM in production requires understanding a different set of metrics than conventional API services.

vLLM-Specific Metrics

Traditional API services expose request rate, error rate, and latency percentiles. vLLM is different. The core performance metrics are:

GPU memory utilization: vLLM pre-allocates a KV cache based on gpu_memory_utilization. If this hits 98% or higher, you are at OOM risk. Set a warning at 85%, critical at 95%.
KV cache hit rate: vLLM's cache is your primary throughput lever. A low cache hit rate means you are regenerating too many tokens from scratch. Track it per request type.
Prefill vs. decode throughput imbalance: If prefill (processing the input prompt) is bottlenecking, you need more GPU memory bandwidth. If decode is bottlenecking, you need larger batch sizes.
Number of ongoing sequences: vLLM's block manager tracks active sequences. Understanding batch composition helps you tune max_num_seqs.

Key Prometheus Metrics from vLLM

vLLM exposes a native /metrics endpoint in Prometheus format. The critical metrics to track:

vllm:gpu_cache_usage: KV cache memory used over allocated. 85-95% is healthy, above 98% is OOM risk.
vllm:num_generation_tokens_total: total tokens generated. Use for throughput calculations.
vllm:e2e_request_latency_seconds: histogram of end-to-end request latency.
vllm:time_to_first_token_seconds: TTFT, the make-or-break metric for streaming UX. p99 under 1 second for most models.
vllm:time_per_output_token_seconds: TPOT, inter-token latency. p99 under 100ms for most use cases.

Common Failure Modes

OOM during batch expansion: If a long sequence arrives and vLLM tries to allocate more KV cache blocks than available, the request fails. Fix: reduce gpu_memory_utilization by 5-10% increments. If you are below 0.80 and still hitting OOM, your batch sizes are too large.

Low cache hit rate: Usually caused by insufficient GPU memory for the workload's concurrent sequence count. Increase gpu_memory_utilization or reduce max_num_seqs. Also check if your workload has high prompt diversity - if every request has a completely different context, cache utilization will be inherently low.

Prefill bottleneck: If your TTFT is high but TPOT is low, the bottleneck is processing the input prompt. This happens with long system prompts or retrieval-augmented prompts that include many chunks. Solutions: reduce prompt length, use prompt caching for fixed system prompts, or switch to a model with longer native context.