Text-only LLMs are becoming the minority. The latest generation of production AI systems processes images, transcribes audio, reasons over video frames, and generates synthetic speech — all within a single model pipeline. GPT-4V, Claude 3 Opus with vision, Gemini Pro, LLaVA, and vLLM's new multi-modal extensions have moved multi-modal AI from research demo to production workload.
And with this shift comes a monitoring challenge that existing LLMOps tooling was never designed to handle.
Text-only monitoring is hard enough: token counts, latency histograms, hallucination rates, and cost per request. Add vision tokens, audio transcription quality, cross-modal attention patterns, and media processing latency — and the observability surface area doubles or triples.
This guide covers what you need to monitor in a production multi-modal LLM (MLLM) system, how to instrument it with OpenTelemetry, and the metrics that actually predict user experience in multi-modal applications.
What Makes Multi-Modal Monitoring Different
Before diving into metrics, it is worth understanding what fundamentally changes when you add non-text modalities to an LLM system.
The Tokenization Problem
Text tokenization is well understood. OpenAI and Anthropic publish pricing per token, and you can count tokens with reasonable accuracy using tiktoken or similar libraries.
Image and audio tokenization is far more complex and model-specific. A single image does not map to a fixed number of tokens — it depends on the image dimensions, resolution, number of salient regions, and the model's own tokenization scheme. Some models chunk images into fixed-size patches (VIT-style), others use dynamic resolution tokenization, and some approximate image token counts using heuristic formulas.
This means image token costs are harder to predict and harder to bill accurately. If you are running multi-modal models on a per-token cost structure, you need explicit image token counting per request — something most commercial APIs do not expose cleanly.
Latency Has More Components
Text-only LLM latency is usually measured as TTFT (Time to First Token) and TPOT (Time Per Output Token). These two metrics capture most of what matters for user experience.
Multi-modal latency has at least six distinct components:
| Latency Component | Description | Typical Range |
|---|---|---|
| Media preprocessing | Image resize, normalization, audio resampling | 5-50ms |
| Media tokenization | Converting media to model tokens | 10-200ms (images) |
| Cross-modal encoding | Running the vision/audio encoder | 50-500ms |
| Context assembly | Combining text and media embeddings | 1-10ms |
| LLM inference | Text token generation | Variable (dominant for long outputs) |
| Media generation | If generating images/audio | 100-2000ms |
The preprocessing and encoding stages are often ignored in text-only monitoring setups — but they can account for 30-60% of total latency in image-heavy workloads.
Quality Signals Are Harder to Define
For text outputs, you can approximate quality using reference-free evaluation (LLM-as-judge), structured output validation, or ground-truth comparison. For image and audio outputs, quality evaluation is its own sub-discipline.
Some quality signals that matter for multi-modal systems:
- Image caption accuracy: Does the model describe the input correctly?
- Object detection fidelity: Does the model correctly identify spatial relationships?
- Audio transcription accuracy: Word error rate (WER) compared to reference transcripts
- Cross-modal consistency: Does the model's text response match what is in the image?
- Multimodal hallucination rate: Does the model describe objects that are not in the image?
Input Validation Surface Expands
A text-only LLM input validation pipeline checks for prompt injection, PII leakage, and malicious content patterns. Multi-modal inputs require validation on each modality:
- Image inputs: NSFW detection, resolution limits, format validation (JPEG, PNG, WebP support varies)
- Audio inputs: Duration limits, sample rate validation, content classification
- Video inputs: Frame rate, total frame count, codec support
A malicious image payload can exploit image parsing libraries, and a corrupted audio file can crash an audio encoder. Input validation for multi-modal is not optional.
The Metrics That Matter
Based on production deployments at scale, here are the metrics that actually correlate with user experience and cost control in multi-modal LLM systems.
Throughput Metrics
mllm.requests.total (Counter) — Total multi-modal requests, labeled by modality combination (image+text, audio+text, video+text). Segment by model version — multi-modal model capabilities change significantly between releases.
mllm.media.tokens.total (Counter) — Total media tokens consumed, labeled by modality type (image, audio, video). This is distinct from text tokens and must be tracked separately because pricing differs.
mllm.media.tokens.per_request (Histogram) — Distribution of media token count per request. Useful for identifying outliers — a user sending a 4K image when the application expects 224x224 thumbnails will inflate costs dramatically.
mllm.throughput.tokens_per_second (Gauge) — End-to-end throughput in tokens per second (text + media tokens combined). Segmented by model, by modality, and by request priority.
Latency Metrics
mllm.latency.preprocessing (Histogram) — Time spent on media preprocessing before it reaches the model. High preprocessing latency usually indicates CPU-bound image operations or slow audio resampling.
mllm.latency.media_encoding (Histogram) — Time spent in the vision/audio encoder. This is often GPU-bound and relatively consistent — spikes here usually indicate GPU contention.
mllm.latency.first_token (Histogram / TTFT) — Time to first text token. In multi-modal requests, this includes all the above stages plus the first-pass LLM inference. Break this down by modality count to understand the cost of adding more images.
mllm.latency.total (Histogram) — Total request latency from receiving the request to returning the final response. This is your primary SLO metric.
Quality Metrics
mllm.quality.image_caption_similarity (Gauge) — Semantic similarity between generated caption and reference captions (using text embeddings like sentence-transformers). Tracked as a rolling 24-hour average. A drop in caption similarity usually precedes user complaints by 24-48 hours.
mllm.hallucination.image_object_missing (Counter) — Count of times the model mentions an object not present in the input image, validated against an object detection model. This is expensive to compute per-request (run asynchronously), so track as a sampled metric on 1-5% of traffic.
mllm.quality.audio_wer (Gauge) — Word Error Rate for audio transcription, sampled against a reference dataset. Only computable if you have ground-truth transcripts — use a fixed evaluation dataset rather than live traffic.
Cost Metrics
mllm.cost.per_request (Histogram) — Cost per request, combining text token cost + media token cost + any per-request API fees. Segmented by modality type to understand which request patterns are most expensive.
mllm.cost.media_type_distribution (Counter) — Breakdown of total cost by media type. If 80% of your spend is on image processing but 90% of requests are text-only, you have a routing optimization opportunity.
mllm.cost.preemption_count (Counter) — Count of requests that were processed on a lower-priority preemptible instance. Useful for capacity planning and spot vs. on-demand cost analysis.
Error Metrics
mllm.errors.media_parse (Counter) — Count of media parsing failures — corrupted images, unsupported formats, oversized files. Labeled by media type and error code. High rates indicate client-side issues or an active attack surface.
mllm.errors.model_timeout (Counter) — Requests that exceeded the maximum allowed inference time. Labeled by modality type — multi-modal requests often have longer tail latencies.
mllm.errors.quality_below_threshold (Counter) — Requests where a quality signal fell below a defined threshold. These are candidates for retry or fallback to a higher-quality model.
Architecture: Instrumenting a Multi-Modal LLM Stack
The Stack
OpenTelemetry Instrumentation
The OTel Python SDK instruments multi-modal applications the same way it instruments text-only applications — via auto-instrumentation for HTTP clients and servers. The key is adding custom spans for the modality-specific stages.
The critical addition for multi-modal is the media.tokens attribute. Since image token counts are not directly returned by most APIs, you need to implement a token estimation function:
Prometheus Metrics for Multi-Modal Workloads
Designing the Grafana Dashboard
A multi-modal LLM dashboard should have four panels that map to the distinct concerns: throughput, latency, cost, and quality.
Panel 1: Request Volume by Modality (Pie Chart)
Shows whether your traffic is predominantly text-only, image+text, or audio+text. Helps with capacity planning and routing decisions.
Panel 2: Latency Breakdown (Stacked Bar Chart)
Shows the p95 latency contribution of each stage. If preprocessing is 60% of your latency, you have a CPU bottleneck to fix.
Panel 3: Cost by Modality Over Time (Time Series)
Shows cost trends per modality. If audio+text is 10% of traffic but 40% of cost, a routing strategy to move audio-only tasks to a cheaper model could cut bills significantly.
Panel 4: Quality Score (Gauge)
Set thresholds: green > 0.85, yellow 0.70-0.85, red < 0.70. An alert on this gauge can catch model quality regressions before they become user-visible incidents.
Common Failure Modes and How to Detect Them
Failure Mode 1: Image Dimension Mismatch
A client sends a 4096x4096 PNG where the application expects 224x224 thumbnails. The image passes your API but gets resized internally — consuming 10x the processing time and 10x the GPU memory.
Detection: mllm.media.tokens.per_request histogram. Look for a bimodal distribution with a high-token tail. Set an alert when requests exceed the 99th percentile of image token count — this catches outlier images before they consume disproportionate resources.
Failure Mode 2: Vision Encoder Saturation
Under heavy load, the vision encoder becomes the bottleneck while the LLM sits idle waiting for encoded image tokens. Throughput drops but raw GPU utilization looks fine because the encoder is using a different device (or a CPU-based preprocessing pipeline).
Detection: Compare mllm_latency_media_encoding_ms p95 vs p50. A large gap (p50=100ms, p95=2000ms) indicates encoder queueing. Fix: horizontal scaling of the encoding stage or asynchronous image preprocessing.
Failure Mode 3: Multi-Modal Hallucination Spikes
Model updates can cause sudden increases in hallucination rates — the model starts describing objects that are not in images. This is especially common when updating to a new model version or when model weights are refreshed.
Detection: The mllm_hallucination_image_object_missing counter. Track this as a rolling 1-hour rate and alert on deviation from the 7-day baseline (e.g., > 3 standard deviations above mean).
Failure Mode 4: Audio Transcription Quality Degradation
If you are running your own audio transcription model (Whisper, Nova, etc.), quality can degrade due to acoustic model drift, codec changes in upstream audio sources, or audio preprocessing pipeline bugs.
Detection: Track mllm_quality_audio_wer on your evaluation dataset. Alert on WER increases > 10% from baseline. Note: WER is only meaningful if you have a fixed evaluation dataset — you cannot compute WER on live traffic without reference transcripts.
Tooling Recommendations
Commercial Platforms with Multi-Modal Support
- Helicone (helicone.ai): Supports multi-modal request logging with cost tracking, image token estimation, and latency breakdowns. Integrates with OpenAI's vision API and Anthropic's Claude with vision. Has an affiliate program.
- Portkey (portkey.ai): Full observability stack with multi-modal support, semantic caching for image-heavy requests, and custom metrics dashboards.
- Arize Phoenix (arize.com/phoenix): Open-source LLM observability with multi-modal evaluation capabilities, primarily for self-hosted models.
Open Source Stack
- vLLM 0.6+: Native multi-modal support (image, audio via extensions) with Prometheus metrics out of the box. Best for self-hosted deployments.
- OpenTelemetry: Universal instrumentation layer — works with commercial APIs and self-hosted models alike.
- Grafana + Prometheus: The standard for metrics visualization. Multi-modal dashboards follow the same pattern as text-only LLM monitoring, with additional panels for media-specific latency and cost metrics.
Self-Hosted vs. API: When to Split the Stack
If your multi-modal workload is predominantly image processing (document understanding, OCR, chart analysis), consider a hybrid approach: use a self-hosted vision model for high-volume, lower-stakes image tasks, and route to GPT-4V or Claude 3 Opus only for tasks requiring maximum quality. This architectural decision can cut multi-modal costs by 60-80% for document-heavy workflows.
Helicone provides first-class multi-modal observability — image token tracking, cross-modal latency breakdowns, and semantic caching for vision-heavy workloads. Sign up via our affiliate link and support StackPulse.
SLOs for Multi-Modal LLM Systems
Define SLOs across the four dimensions that matter:
| SLO | Target | Alert Threshold |
|---|---|---|
| Availability | 99.5% of requests return a response | < 99% in 1 hour |
| Latency (TTFT p95) | < 3s for image+text, < 1s for text-only | > 5s in 15 min |
| Cost per 1K requests | < $2.00 (image+text), < $0.50 (text-only) | > $3.00 in 1 hour |
| Quality (caption similarity) | > 0.80 rolling 24h average | < 0.70 in 1 hour |
Multi-modal SLOs are harder to meet than text-only SLOs because the latency and cost components are less predictable. Set your initial targets loose and tighten them as you gather production data.
Conclusion
Multi-modal LLM monitoring is not just "LLM monitoring + image support." The fundamental differences in tokenization, latency composition, quality evaluation, and cost structure require a dedicated observability approach.
The key principles:
- Track media tokens separately from text tokens — they have different cost structures and different predictability profiles
- Decompose latency into preprocessing, encoding, and inference stages — they have different bottlenecks and different fix strategies
- Measure quality per modality — image caption similarity, audio WER, and cross-modal consistency are all distinct signals
- Set modality-specific SLOs — text-only SLOs are too loose for image-heavy workloads and too tight for text-only requests
Start with the Prometheus metrics and Grafana dashboard outlined here, add the OpenTelemetry instrumentation for your specific model, and build from there. Multi-modal AI is growing faster than any other segment of the AI infrastructure space — the teams that have observability in place now will be the ones who can scale confidently.