The most common misconception in observability tooling is that Prometheus and Grafana are competitors. They are not. They are complementary pieces of the same stack, and using them correctly requires understanding what each one is actually for.

What Prometheus Does Well

Prometheus is a time-series database and monitoring system built around a pull-based model. It reaches out to your services at configured intervals and scrapes exposed metrics. Every metric has a name and a set of labels (key-value pairs) that enable powerful dimensional querying. The PromQL query language is the real differentiator — you can compute rates, aggregations, joins, and functions directly in the query layer without preprocessing your data.

The four primary metric types in Prometheus:

  • Counter — monotonically increasing value. Use for request counts, bytes sent, errors total. Never decreases.
  • Gauge — can go up or down. Use for current memory usage, queue depth, temperature.
  • Histogram — samples observations into configurable buckets. Use for latency distributions — your classic p50/p95/p99.
  • Summary — similar to histogram but computes quantiles client-side. Less flexible than histogram for backend aggregation.

What's new in Prometheus 3.x (2026): Prometheus 3.x, production-hardened in late 2025, brought significant improvements for high-scale AI/ML workloads. Remote write protocol v2 improves reliability and compression for metric streams generating 50K+ series — common for multi-GPU inference servers. Native OTLP ingestion means OpenTelemetry SDKs can send traces, metrics, and logs directly to Prometheus without an adapter layer, collapsing a common source of operational complexity. Exemplars — contextual metadata attached to histogram samples — now link Prometheus metrics to distributed traces seamlessly, enabling correlation workflows that previously required custom instrumentation. PromQL subqueries are now stable, enabling runtime evaluation of complex nested queries like `histogram_quantile(0.99, rate(request_duration_seconds_bucket[5m]))[1h:1m])` that previously required recording rules.

A practical example: exposing latency from a Python FastAPI service.

from prometheus_client import Counter, Histogram, generate_latest

REQUEST_COUNT = Counter(
    'http_requests_total',
    'Total HTTP requests',
    ['method', 'endpoint', 'status']
)

REQUEST_LATENCY = Histogram(
    'http_request_duration_seconds',
    'HTTP request latency',
    ['method', 'endpoint'],
    buckets=[0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1.0, 2.5]
)

@app.middleware("http")
async def track_requests(request: Request, call_next):
    labels = {"method": request.method, "endpoint": request.url.path}
    with REQUEST_LATENCY.labels(**labels).time():
        response = await call_next(request)
    REQUEST_COUNT.labels(**labels, status=response.status_code).inc()
    return response

Prometheus scrapes this endpoint and stores the time series. You query it with PromQL:

# Error rate per endpoint over the last 5 minutes
rate(http_requests_total{status=~"5.."}[5m])
/ rate(http_requests_total[5m])

# p99 latency by endpoint
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Prometheus Alertmanager: Routing and Silencing

Prometheus Alertmanager completes the picture — it handles the routing, grouping, and deduplication of alerts generated by Prometheus rules. Without it, you get a firehose of individual alert notifications. With it, you get actionable, grouped alerts that respect maintenance windows.

```bash
# Install alertmanager via Helm
helm install alertmanager prometheus-community/alertmanager \
  --namespace monitoring \
  --set persistentVolume.storageClass=standard

# Key configuration: route tree with receiver fallbacks
route:
  receiver: 'team-ops'
  group_by: ['alertname', 'cluster']
  group_wait: 30s
  group_interval: 5m
  repeat_interval: 4h
  routes:
    - match:
        severity: critical
      receiver: 'pagerduty'
      group_wait: 10s
    - match:
        service: ai-inference
      receiver: 'ai-oncall'
```

Alertmanager's power comes from its inhibition rules — you can suppress lower-priority alerts when a higher-priority alert is already firing (e.g., suppress node-level alerts when the entire cluster is down). For AI inference services specifically, configure inhibit rules that silence GPU temperature warnings when a GPU is already in a critical throttling state.

Advertisement
Advertisement

What Grafana Does Well

Grafana is a visualization and analytics platform that speaks many data sources. It can query Prometheus, Loki (logs), Tempo (traces), Jaeger, Elasticsearch, InfluxDB, PostgreSQL, CloudWatch, and dozens of others through its plugin ecosystem. Its core value is unifying disparate monitoring data into a single pane of glass.

For AI infrastructure in particular, Grafana dashboards serve three critical roles:

  • Real-time operational dashboards — GPU utilization, memory pressure, KV cache hit rates, token throughput. These need low-latency queries and live-refresh capability.
  • SLO burn-rate tracking — multi-window alerting that correlates error budget consumption with actual user impact.
  • Cost attribution — correlating GPU runtime, token generation volume, and cloud spend in a single view.

What's new in Grafana 11.x (2026): Grafana 11.x brings production-ready features that matter for AI infrastructure teams. Dashboard version control brings Git-like versioning to dashboards — rollbacks, PR-based changes, and audit trails for free. Grafana Assistant (AI-powered) generates PromQL queries from natural language, explains metrics, and suggests correlated signals during incident investigation. Improved alerting UX offers a visual node editor for multi-step alert routing. For AI/ML specifically, Grafana Cloud's pre-built AI dashboards now ship with templates for vLLM, Ray, and Hugging Face Transformers — token throughput, KV cache statistics, batch scheduling metrics, and GPU memory pressure out of the box. Grafana Enterprise adds SAML SSO, role-based access control, and audit logs for compliance-heavy environments.

Grafana Cloud is the managed option. It handles Prometheus, Loki, and Tempo hosting for you — attractive for teams that want observability without operating infrastructure. The free tier includes 10K Prometheus active series and 50GB logs, which covers small-to-medium deployments. The paid tiers start at $9/month for the Hosted Grafana service plus consumption-based pricing for metrics.

Head-to-Head: Prometheus vs Grafana

DimensionPrometheusGrafana
Primary roleMetrics collection, storage, and queryingVisualization, alerting, and multi-source correlation
Data storageBuilt-in TSDB (time-series database)Does not store data — queries external sources
Query languagePromQL (powerful, expressive)Depends on data source (PromQL for Prometheus, LogQL for Loki, etc.)
AlertingAlertmanager for routing, but no built-in UI for alert managementBuilt-in alert rules, notification policies, and alert state management
DashboardingBasic UI, not designed for complex dashboardsPurpose-built for rich, interactive dashboards
Log managementNot supported natively (use Loki)Full log exploration via Loki or Elasticsearch
TracingNot supported nativelyNative support for Jaeger, Tempo, X-Ray
Kubernetes integrationBest-in-class via prometheus-operatorQueries Prometheus data; Kubernetes integration via plugins
ScalabilitySingle-instance: ~1M active series. Thanos/Mimir for horizontal scale.Scales with backend data sources
Cost modelOpen-source, self-hosted free. Managed via Grafana Enterprise or Prometheus on-cloud.Free open-source (self-hosted). Cloud: $9/mo + consumption.

The Correct Architecture: Using Both Together

The right mental model: Prometheus is your data plane, Grafana is your control plane and visualization layer. They are not alternatives — you deploy them together.

For Kubernetes clusters

The standard production setup uses the kube-prometheus-stack Helm chart. It packages the prometheus-operator, pre-configured Prometheus instances, kube-state-metrics, node-exporter, and a set of pre-built Grafana dashboards into a single deployable unit.

helm install prometheus stackprometheus/kube-prometheus-stack \
  --namespace monitoring \
  --create-namespace \
  --set prometheus.prometheusSpec.retention=30d \
  --set prometheus.prometheusSpec.replicas=2

The prometheus-operator is the key component — it manages Prometheus configurations as Kubernetes CRDs. You define ServiceMonitor and PrometheusRule objects, and the operator handles the underlying Prometheus reloads automatically. This means your monitoring configuration is version-controlled, reviewed, and deployed like any other Kubernetes workload.

For AI inference workloads (vLLM, Ray, etc.)

AI inference servers expose their own Prometheus metrics. You add a ServiceMonitor to wire them into your existing Prometheus:

apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: vllm
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: vllm
  endpoints:
    - port: metrics
      path: /metrics
      interval: 10s

Then Grafana queries the same Prometheus for AI-specific dashboards. A single Grafana Cloud or self-hosted instance can serve dashboards for your Kubernetes cluster health, your GPU inference servers, your LLM application layer, and your cloud billing — all from the same Prometheus backend.

Monitoring GPU Metrics via DCGM Exporter

NVIDIA's DCGM (Data Center GPU Manager) exporter exposes GPU diagnostics, utilization, memory, temperature, and power metrics via Prometheus. For AI inference workloads, the key metrics are:

# GPU utilization
DCGM_FI_DEV_GPU_UTIL{gpu="0"}

# GPU memory used (bytes)
DCGM_FI_DEV_FB_USED{gpu="0"}

# GPU temperature (Celsius)
DCGM_FI_DEV_GPU_TEMP

# Power draw (watts)
DCGM_FI_DEV_POWER_USAGE

# KV cache memory ratio (if supported by your inference server)
vllm:kv_cache_used_ratio / vllm:kv_cache_total_ratio

Deploy DCGM exporter as a DaemonSet so it runs on every node with a GPU:

apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      hostPID: true
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/dcgm-exporter:3.3
        securityContext:
          privileged: true
        ports:
        - name: metrics
          containerPort: 9400
        volumeMounts:
        - name: nvidia-container
          mountPath: /var/lib/nvidia-container/runtime/nvidia.sock

Ray AI Runtime Metrics

For distributed AI training and inference on Ray, the Ray dashboard and Prometheus metrics are automatically exposed on port 8265. Add Ray head node discovery to your Prometheus config:

scrape_configs:
  - job_name: 'ray'
    file_sd_configs:
      - files:
          - /etc/prometheus/ray_targets.yml
    relabel_configs:
      - source_labels: [__address__]
        regex: '(.*):8265'
        target_label: __address__
        replacement: '$1:8265'

Ray emits metrics for actor creation, task scheduling, object store memory, and cluster resource utilization — essential for multi-node AI workloads on Kubernetes.

The storage hierarchy that works in production

  • Short-term (0-30 days): Prometheus instances (hot storage, SSD-backed). Query latency under 1 second for all operational dashboards.
  • Long-term (30 days - 1 year): Thanos sidecar uploads TSDB blocks to S3/GCS every 2 hours. Store gateway serves historical queries. No data loss, no operational overhead.
  • Cross-cluster federation: Thanos querier provides a unified PromQL endpoint across all your Prometheus instances — useful for multi-region or multi-cloud setups.

Mimir vs Thanos for Long-Term Storage

Both Thanos and Mimir solve the same problem: horizontally scaling Prometheus metrics beyond a single server and storing historical data cost-effectively. The operational models differ significantly.

Mimir (Grafana Labs' fork of Cortex) deploys via a single Helm chart with a unified backend for store, query, and ingest. Its advantages in 2026: native Grafana integration so the metrics explorer handles long-term data identically to live Prometheus data; active quarterly releases; simplified operational surface when you are already on Grafana Cloud or Grafana Enterprise. Mimir's `kubectl-grafana` plugin makes multi-tenancy straightforward.

Thanos has a more modular architecture — Sidecar, Store Gateway, Querier, Receive, and Compact are separate components. This gives flexibility but requires more operational expertise. Thanos advantages: broader S3-compatible storage support (including MinIO and Wasabi); more mature in some enterprise environments; the Receive component lets you push metrics directly from producers that cannot be scraped.

Recommendation for 2026: Mimir if you want operational simplicity and are already on the Grafana stack. Thanos if you need multi-cloud or on-premises S3 compatibility, or already have Thanos expertise on the team. Both are production-grade — the choice depends on your team's existing knowledge and cloud provider constraints.

Remote Write Best Practices

Prometheus 3.x's native OTLP support eliminates the OTLP receiver adapter for most deployments. Configure remote write directly:

remote_write:
  - url: https://mimir.example.com/prometheus/api/v1/write
    queue_config:
      capacity: 10000
      max_shards: 30
      min_shards: 5
      max_samples_per_send: 2000
    metadata_config:
      send: true
      send_interval: 1m

Key sizing guidance: for 100K active series with 15s scrape interval, set `capacity: 10000` and `max_shards: 30`. Monitor prometheus_remote_storage_queue_highest_timestamp_in_seconds to detect backpressure — if it lags more than 5 minutes behind current time, increase max_shards.

When to Use Prometheus Alone vs Grafana + Prometheus

Use Prometheus alone when you are debugging in the terminal, running automated tests, or need to validate that your service is exposing the right metrics during development. The PromQL query language is powerful enough that many engineers do their initial exploration directly in the Prometheus UI or via promtool.

Use Grafana on top of Prometheus when you need to share operational state with non-engineers, track SLO burn rates over time, correlate metrics across multiple systems, or set up alert notifications that go to Slack, PagerDuty, or email. Grafana's alert manager is substantially more sophisticated than Prometheus Alertmanager's flat routing.

Use Grafana Cloud when you want the Prometheus+Grafana stack without managing the infrastructure. For a startup or small team, the operational simplicity is worth the $50-200/month cost for most production workloads. You get pre-built dashboards for Kubernetes, Prometheus, and popular applications out of the box.

The Complete Observability Stack in 2026

The modern monitoring stack has four layers:

1. Metrics — Prometheus + Thanos/Mimir

Prometheus scrapes instrumented applications. Thanos sidecar or Mimir provides horizontal scalability and long-term storage. For teams on Kubernetes, the prometheus-operator is the standard management layer.

2. Logs — Loki or Elasticsearch

Loki is Grafana Labs' log aggregation system — designed to work with Prometheus. It indexes only label metadata, not full log content, which makes it dramatically cheaper than Elasticsearch for high-volume environments. Most AI inference logs (Python application logs, Ray worker logs) are well-suited to Loki's label-based indexing.

3. Traces — Tempo or Jaeger

Distributed tracing connects a single request across multiple services. Tempo integrates with Grafana natively and can ingest from OpenTelemetry SDKs — the emerging standard for instrumentation.

4. Visualization + Alerting — Grafana

Grafana sits on top of all three. A single Grafana dashboard can correlate a latency spike in Prometheus metrics with error logs in Loki and a trace waterfall in Tempo — the full request journey in one view.

The architecture that works for most production AI infrastructure in 2026: Prometheus (metrics) + Loki (logs) + Tempo (traces) + Grafana (visualization and alerting). All four are open-source, all run on Kubernetes, and Grafana Cloud can host all of them if you prefer not to operate the infrastructure yourself.

Monitoring AI Inference: vLLM, Ray, and Custom Models

If you are running LLM inference in production — on vLLM, Ray, TensorRT-LLM, or a custom serving layer — the Prometheus + Grafana stack extends naturally. The key is instrumenting your inference server to expose the metrics that matter for AI workloads:

  • Token throughput — tokens generated per second, broken down by model and endpoint
  • Time to first token (TTFT) — time to first token, critical for UX in streaming responses
  • KV cache hit rate — indicates whether your batch sizes and context lengths are well-tuned
  • GPU utilization and VRAM — the bottleneck in most inference deployments
  • Batch queue depth — tells you whether you have headroom to accept more requests or are already saturated

vLLM exposes a /metrics endpoint in Prometheus format by default. For custom serving layers, use the prometheus_client Python library to expose equivalent metrics. The pattern is the same regardless of the serving technology: expose metrics → Prometheus scrapes → Grafana visualizes.

Essential PromQL for AI Inference

The queries you need for AI inference monitoring — applicable to any Prometheus-scraped inference server:

# KV cache hit rate (vLLM-specific)
# Higher is better — indicates efficient batch scheduling
vllm:kv_cache_used_ratio{model="$model"}
/ ignoring(model) vllm:kv_cache_total_ratio{model="$model"}

# GPU utilization via DCGM Exporter
# Alert if sustained >90% for >10 minutes
rate(DCGM_FI_DEV_GPU_UTIL_total[device="GPU-all"][5m])

# Token throughput — prefill vs decode
rate(vllm:prompt_tokens_total[1m])   # Prefill throughput
rate(vllm:completion_tokens_total[1m])  # Decode throughput

# Prefill/decode ratio — diagnose batching efficiency
# High prefill ratio means large batches are stuck on prompt processing
rate(vllm:prompt_tokens_total[5m])
/ (rate(vllm:prompt_tokens_total[5m]) + rate(vllm:completion_tokens_total[5m]))

# Request-level latency contribution (p99)
histogram_quantile(0.99, rate(vllm:request_duration_seconds_bucket[5m]))

# Batch queue depth — watch for saturation
vllm:scheduler_pending_tokens

For Ray AI Runtime, metrics are exposed at http://localhost:52365/metrics via the Ray dashboard agent. Key metrics: ray_tasks{executed} for task throughput, ray_object_store_memory for shared memory pressure, and ray_node_gpu_utilization for per-node GPU usage.

Observability Stack by Team Size

Start with the stack that matches your team scale and scale up as needs grow:

  • Solo / 1-2 engineers: Grafana Cloud free tier — 10K series, 50GB logs, enough for side projects and small production workloads. Pre-built vLLM dashboards mean you are operational in hours, not days.
  • 3-10 engineers: Grafana Cloud Pro ($75/mo) or self-hosted kube-prometheus-stack. Cloud Pro if you want simplicity; self-hosted if you have ops bandwidth and need more than 50K series.
  • 10-50 engineers: Self-hosted Mimir + Loki + Tempo + Grafana Enterprise. Full observability without per-query costs. Plan for 0.5-1 FTE dedicated to observability infrastructure.
  • 50+ engineers: Multi-cluster Thanos/Mimir federation across regions, Loki for log aggregation, Tempo for distributed tracing, Grafana Enterprise for multi-tenancy and compliance.
Recommended Tool Datadog

Full-stack monitoring for AI inference: GPU utilization, token throughput, model latency, and infrastructure metrics in a single platform. 14-day free trial, no credit card required.

Recommended Tool Grafana Cloud

Prometheus + Loki + Tempo + Grafana alerting — all managed. Pro plan at $75/mo covers 50K Prometheus series and 1TB logs. 60% cheaper than Datadog for the full stack.