The most common misconception in observability tooling is that Prometheus and Grafana are competitors. They are not. They are complementary pieces of the same stack, and using them correctly requires understanding what each one is actually for.
What Prometheus Does Well
Prometheus is a time-series database and monitoring system built around a pull-based model. It reaches out to your services at configured intervals and scrapes exposed metrics. Every metric has a name and a set of labels (key-value pairs) that enable powerful dimensional querying. The PromQL query language is the real differentiator — you can compute rates, aggregations, joins, and functions directly in the query layer without preprocessing your data.
The four primary metric types in Prometheus:
- Counter — monotonically increasing value. Use for request counts, bytes sent, errors total. Never decreases.
- Gauge — can go up or down. Use for current memory usage, queue depth, temperature.
- Histogram — samples observations into configurable buckets. Use for latency distributions — your classic p50/p95/p99.
- Summary — similar to histogram but computes quantiles client-side. Less flexible than histogram for backend aggregation.
What's new in Prometheus 3.x (2026): Prometheus 3.x, production-hardened in late 2025, brought significant improvements for high-scale AI/ML workloads. Remote write protocol v2 improves reliability and compression for metric streams generating 50K+ series — common for multi-GPU inference servers. Native OTLP ingestion means OpenTelemetry SDKs can send traces, metrics, and logs directly to Prometheus without an adapter layer, collapsing a common source of operational complexity. Exemplars — contextual metadata attached to histogram samples — now link Prometheus metrics to distributed traces seamlessly, enabling correlation workflows that previously required custom instrumentation. PromQL subqueries are now stable, enabling runtime evaluation of complex nested queries like `histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))[1h:1m])` that previously required recording rules.
A practical example: exposing latency from a Python FastAPI service.
from prometheus_client import Counter, Histogram, generate_latest
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request latency',
['method', 'endpoint'],
buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5]
)
@app.middleware("http")
async def track_requests(request: Request, call_next):
labels = {"method": request.method, "endpoint": request.url.path}
with REQUEST_LATENCY.labels(**labels).time():
response = await call_next(request)
REQUEST_COUNT.labels(**labels, status=response.status_code).inc()
return response Prometheus scrapes this endpoint and stores the time series. You query it with PromQL:
# Error rate per endpoint over the last 5 minutes
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])
# p99 latency by endpoint
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
# Prometheus Alertmanager: Routing and Silencing
Prometheus Alertmanager completes the picture — it handles the routing, grouping, and deduplication of alerts generated by Prometheus rules. Without it, you get a firehose of individual alert notifications. With it, you get actionable, grouped alerts that respect maintenance windows.
```bash
# Install alertmanager via Helm
helm install alertmanager prometheus-community/alertmanager \
--namespace monitoring \
--set persistentVolume.storageClass=standard
# Key configuration: route tree with receiver fallbacks
route:
receiver: 'team-ops'
group_by: ['alertname', 'cluster']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
routes:
- match:
severity: critical
receiver: 'pagerduty'
group_wait: 10s
- match:
service: ai-inference
receiver: 'ai-oncall'
```
Alertmanager's power comes from its inhibition rules — you can suppress lower-priority alerts when a higher-priority alert is already firing (e.g., suppress node-level alerts when the entire cluster is down). For AI inference services specifically, configure inhibit rules that silence GPU temperature warnings when a GPU is already in a critical throttling state.
Advertisement Advertisement
What Grafana Does Well
Grafana is a visualization and analytics platform that speaks many data sources. It can query Prometheus, Loki (logs), Tempo (traces), Jaeger, Elasticsearch, InfluxDB, PostgreSQL, CloudWatch, and dozens of others through its plugin ecosystem. Its core value is unifying disparate monitoring data into a single pane of glass.
For AI infrastructure in particular, Grafana dashboards serve three critical roles:
- Real-time operational dashboards — GPU utilization, memory pressure, KV cache hit rates, token throughput. These need low-latency queries and live-refresh capability.
- SLO burn-rate tracking — multi-window alerting that correlates error budget consumption with actual user impact.
- Cost attribution — correlating GPU runtime, token generation volume, and cloud spend in a single view.
What's new in Grafana 11.x (2026): Grafana 11.x brings production-ready features that matter for AI infrastructure teams. Dashboard version control brings Git-like versioning to dashboards — rollbacks, PR-based changes, and audit trails for free. Grafana Assistant (AI-powered) generates PromQL queries from natural language, explains metrics, and suggests correlated signals during incident investigation. Improved alerting UX offers a visual node editor for multi-step alert routing. For AI/ML specifically, Grafana Cloud's pre-built AI dashboards now ship with templates for vLLM, Ray, and Hugging Face Transformers — token throughput, KV cache statistics, batch scheduling metrics, and GPU memory pressure out of the box. Grafana Enterprise adds SAML SSO, role-based access control, and audit logs for compliance-heavy environments.
Grafana Cloud is the managed option. It handles Prometheus, Loki, and Tempo hosting for you — attractive for teams that want observability without operating infrastructure. The free tier includes 10K Prometheus active series and 50GB logs, which covers small-to-medium deployments. The paid tiers start at $9/month for the Hosted Grafana service plus consumption-based pricing for metrics.
Head-to-Head: Prometheus vs Grafana
Dimension Prometheus Grafana
Primary role Metrics collection, storage, and querying Visualization, alerting, and multi-source correlation
Data storage Built-in TSDB (time-series database) Does not store data — queries external sources
Query language PromQL (powerful, expressive) Depends on data source (PromQL for Prometheus, LogQL for Loki, etc.)
Alerting Alertmanager for routing, but no built-in UI for alert management Built-in alert rules, notification policies, and alert state management
Dashboarding Basic UI, not designed for complex dashboards Purpose-built for rich, interactive dashboards
Log management Not supported natively (use Loki) Full log exploration via Loki or Elasticsearch
Tracing Not supported natively Native support for Jaeger, Tempo, X-Ray
Kubernetes integration Best-in-class via prometheus-operator Queries Prometheus data; Kubernetes integration via plugins
Scalability Single-instance: ~1M active series. Thanos/Mimir for horizontal scale. Scales with backend data sources
Cost model Open-source, self-hosted free. Managed via Grafana Enterprise or Prometheus on-cloud. Free open-source (self-hosted). Cloud: $9/mo + consumption.
The Correct Architecture: Using Both Together
The right mental model: Prometheus is your data plane, Grafana is your control plane and visualization layer. They are not alternatives — you deploy them together.
For Kubernetes clusters
The standard production setup uses the kube-prometheus-stack Helm chart. It packages the prometheus-operator, pre-configured Prometheus instances, kube-state-metrics, node-exporter, and a set of pre-built Grafana dashboards into a single deployable unit.
helm install prometheus stackprometheus/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.replicas=2
The prometheus-operator is the key component — it manages Prometheus configurations as Kubernetes CRDs. You define ServiceMonitor and PrometheusRule objects, and the operator handles the underlying Prometheus reloads automatically. This means your monitoring configuration is version-controlled, reviewed, and deployed like any other Kubernetes workload.
For AI inference workloads (vLLM, Ray, etc.)
AI inference servers expose their own Prometheus metrics. You add a ServiceMonitor to wire them into your existing Prometheus:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: vllm
namespace: monitoring
spec:
selector:
matchLabels:
app: vllm
endpoints:
- port: metrics
path: /metrics
interval: 10s
Then Grafana queries the same Prometheus for AI-specific dashboards. A single Grafana Cloud or self-hosted instance can serve dashboards for your Kubernetes cluster health, your GPU inference servers, your LLM application layer, and your cloud billing — all from the same Prometheus backend.
Monitoring GPU Metrics via DCGM Exporter
NVIDIA's DCGM (Data Center GPU Manager) exporter exposes GPU diagnostics, utilization, memory, temperature, and power metrics via Prometheus. For AI inference workloads, the key metrics are:
# GPU utilization
DCGM_FI_DEV_GPU_UTIL{gpu="0"}
# GPU memory used (bytes)
DCGM_FI_DEV_FB_USED{gpu="0"}
# GPU temperature (Celsius)
DCGM_FI_DEV_GPU_TEMP
# Power draw (watts)
DCGM_FI_DEV_POWER_USAGE
# KV cache memory ratio (if supported by your inference server)
vllm:kv_cache_used_ratio / vllm:kv_cache_total_ratio
Deploy DCGM exporter as a DaemonSet so it runs on every node with a GPU:
apiVersion: apps/v1
kind: DaemonSet
metadata:
name: dcgm-exporter
namespace: monitoring
spec:
selector:
matchLabels:
app: dcgm-exporter
template:
metadata:
labels:
app: dcgm-exporter
spec:
hostPID: true
containers:
- name: dcgm-exporter
image: nvcr.io/nvidia/dcgm-exporter:3.3
securityContext:
privileged: true
ports:
- name: metrics
containerPort: 9400
volumeMounts:
- name: nvidia-container
mountPath: /var/lib/nvidia-container/runtime/nvidia.sock
Ray AI Runtime Metrics
For distributed AI training and inference on Ray, the Ray dashboard and Prometheus metrics are automatically exposed on port 8265. Add Ray head node discovery to your Prometheus config:
scrape_configs:
- job_name: 'ray'
file_sd_configs:
- files:
- /etc/prometheus/ray_targets.yml
relabel_configs:
- source_labels: [__address__]
regex: '(.*):8265'
target_label: __address__
replacement: '$1:8265'
Ray emits metrics for actor creation, task scheduling, object store memory, and cluster resource utilization — essential for multi-node AI workloads on Kubernetes.
The storage hierarchy that works in production
- Short-term (0-30 days): Prometheus instances (hot storage, SSD-backed). Query latency under 1 second for all operational dashboards.
- Long-term (30 days - 1 year): Thanos sidecar uploads TSDB blocks to S3/GCS every 2 hours. Store gateway serves historical queries. No data loss, no operational overhead.
- Cross-cluster federation: Thanos querier provides a unified PromQL endpoint across all your Prometheus instances — useful for multi-region or multi-cloud setups.
Mimir vs Thanos for Long-Term Storage
Both Thanos and Mimir solve the same problem: horizontally scaling Prometheus metrics beyond a single server and storing historical data cost-effectively. The operational models differ significantly.
Mimir (Grafana Labs' fork of Cortex) deploys via a single Helm chart with a unified backend for store, query, and ingest. Its advantages in 2026: native Grafana integration so the metrics explorer handles long-term data identically to live Prometheus data; active quarterly releases; simplified operational surface when you are already on Grafana Cloud or Grafana Enterprise. Mimir's `kubectl-grafana` plugin makes multi-tenancy straightforward.
Thanos has a more modular architecture — Sidecar, Store Gateway, Querier, Receive, and Compact are separate components. This gives flexibility but requires more operational expertise. Thanos advantages: broader S3-compatible storage support (including MinIO and Wasabi); more mature in some enterprise environments; the Receive component lets you push metrics directly from producers that cannot be scraped.
Recommendation for 2026: Mimir if you want operational simplicity and are already on the Grafana stack. Thanos if you need multi-cloud or on-premises S3 compatibility, or already have Thanos expertise on the team. Both are production-grade — the choice depends on your team's existing knowledge and cloud provider constraints.
Remote Write Best Practices
Prometheus 3.x's native OTLP support eliminates the OTLP receiver adapter for most deployments. Configure remote write directly:
remote_write:
- url: https://mimir.example.com/prometheus/api/v1/write
queue_config:
capacity: 10000
max_shards: 30
min_shards: 5
max_samples_per_send: 2000
metadata_config:
send: true
send_interval: 1m
Key sizing guidance: for 100K active series with 15s scrape interval, set `capacity: 10000` and `max_shards: 30`. Monitor prometheus_remote_storage_queue_highest_timestamp_in_seconds to detect backpressure — if it lags more than 5 minutes behind current time, increase max_shards.
When to Use Prometheus Alone vs Grafana + Prometheus
Use Prometheus alone when you are debugging in the terminal, running automated tests, or need to validate that your service is exposing the right metrics during development. The PromQL query language is powerful enough that many engineers do their initial exploration directly in the Prometheus UI or via promtool.
Use Grafana on top of Prometheus when you need to share operational state with non-engineers, track SLO burn rates over time, correlate metrics across multiple systems, or set up alert notifications that go to Slack, PagerDuty, or email. Grafana's alert manager is substantially more sophisticated than Prometheus Alertmanager's flat routing.
Use Grafana Cloud when you want the Prometheus+Grafana stack without managing the infrastructure. For a startup or small team, the operational simplicity is worth the $50-200/month cost for most production workloads. You get pre-built dashboards for Kubernetes, Prometheus, and popular applications out of the box.
The Complete Observability Stack in 2026
The modern monitoring stack has four layers:
1. Metrics — Prometheus + Thanos/Mimir
Prometheus scrapes instrumented applications. Thanos sidecar or Mimir provides horizontal scalability and long-term storage. For teams on Kubernetes, the prometheus-operator is the standard management layer.
2. Logs — Loki or Elasticsearch
Loki is Grafana Labs' log aggregation system — designed to work with Prometheus. It indexes only label metadata, not full log content, which makes it dramatically cheaper than Elasticsearch for high-volume environments. Most AI inference logs (Python application logs, Ray worker logs) are well-suited to Loki's label-based indexing.
3. Traces — Tempo or Jaeger
Distributed tracing connects a single request across multiple services. Tempo integrates with Grafana natively and can ingest from OpenTelemetry SDKs — the emerging standard for instrumentation.
4. Visualization + Alerting — Grafana
Grafana sits on top of all three. A single Grafana dashboard can correlate a latency spike in Prometheus metrics with error logs in Loki and a trace waterfall in Tempo — the full request journey in one view.
The architecture that works for most production AI infrastructure in 2026: Prometheus (metrics) + Loki (logs) + Tempo (traces) + Grafana (visualization and alerting). All four are open-source, all run on Kubernetes, and Grafana Cloud can host all of them if you prefer not to operate the infrastructure yourself.
Monitoring AI Inference: vLLM, Ray, and Custom Models
If you are running LLM inference in production — on vLLM, Ray, TensorRT-LLM, or a custom serving layer — the Prometheus + Grafana stack extends naturally. The key is instrumenting your inference server to expose the metrics that matter for AI workloads:
- Token throughput — tokens generated per second, broken down by model and endpoint
- Time to first token (TTFT) — time to first token, critical for UX in streaming responses
- KV cache hit rate — indicates whether your batch sizes and context lengths are well-tuned
- GPU utilization and VRAM — the bottleneck in most inference deployments
- Batch queue depth — tells you whether you have headroom to accept more requests or are already saturated
vLLM exposes a /metrics endpoint in Prometheus format by default. For custom serving layers, use the prometheus_client Python library to expose equivalent metrics. The pattern is the same regardless of the serving technology: expose metrics → Prometheus scrapes → Grafana visualizes.
Essential PromQL for AI Inference
The queries you need for AI inference monitoring — applicable to any Prometheus-scraped inference server:
# KV cache hit rate (vLLM-specific)
# Higher is better — indicates efficient batch scheduling
vllm:kv_cache_used_ratio{model="$model"}
/ ignoring(model) vllm:kv_cache_total_ratio{model="$model"}
# GPU utilization via DCGM Exporter
# Alert if sustained >90% for >10 minutes
rate(DCGM_FI_DEV_GPU_UTIL_total[device="GPU-all"][5m])
# Token throughput — prefill vs decode
rate(vllm:prompt_tokens_total[1m]) # Prefill throughput
rate(vllm:completion_tokens_total[1m]) # Decode throughput
# Prefill/decode ratio — diagnose batching efficiency
# High prefill ratio means large batches are stuck on prompt processing
rate(vllm:prompt_tokens_total[5m])
/ (rate(vllm:prompt_tokens_total[5m]) + rate(vllm:completion_tokens_total[5m]))
# Request-level latency contribution (p99)
histogram_quantile(0.99, rate(vllm:request_duration_seconds_bucket[5m]))
# Batch queue depth — watch for saturation
vllm:scheduler_pending_tokens
For Ray AI Runtime, metrics are exposed at http://localhost:52365/metrics via the Ray dashboard agent. Key metrics: ray_tasks{executed} for task throughput, ray_object_store_memory for shared memory pressure, and ray_node_gpu_utilization for per-node GPU usage.
Observability Stack by Team Size
Start with the stack that matches your team scale and scale up as needs grow:
- Solo / 1-2 engineers: Grafana Cloud free tier — 10K series, 50GB logs, enough for side projects and small production workloads. Pre-built vLLM dashboards mean you are operational in hours, not days.
- 3-10 engineers: Grafana Cloud Pro ($75/mo) or self-hosted kube-prometheus-stack. Cloud Pro if you want simplicity; self-hosted if you have ops bandwidth and need more than 50K series.
- 10-50 engineers: Self-hosted Mimir + Loki + Tempo + Grafana Enterprise. Full observability without per-query costs. Plan for 0.5-1 FTE dedicated to observability infrastructure.
- 50+ engineers: Multi-cluster Thanos/Mimir federation across regions, Loki for log aggregation, Tempo for distributed tracing, Grafana Enterprise for multi-tenancy and compliance.
Recommended Tool Datadog Full-stack monitoring for AI inference: GPU utilization, token throughput, model latency, and infrastructure metrics in a single platform. 14-day free trial, no credit card required.
Recommended Tool Grafana Cloud Prometheus + Loki + Tempo + Grafana alerting — all managed. Pro plan at $75/mo covers 50K Prometheus series and 1TB logs. 60% cheaper than Datadog for the full stack.