Monitoring LLM APIs: Latency, Cost, Uptime

Anatoly Oshmanovsky

Мониторинг

Monitoring LLM APIs: Latency, Cost, Uptime

Published: 15.06.2026 · ~4art.read_time_min · 13 views

Monitoring LLM APIs: Latency, Cost, Uptime

Short answer. LLM API документацию behave like any external HTTP service, but they add two twists: high, variable latency and per-token cost. You need to monitor three axes — uptime (is the endpoint reachable), latency (is it within your SLA), and cost (is token spend creeping up). Basic uptime is covered by ordinary HTTP monitoring of the API URL with alerts to Telegram, Slack, or a webhook; latency and cost require logging on the application side.

Why you can't monitor LLM APIs "the usual way"

Classic monitoring answers "is the service alive." For LLM APIs that isn't enough. A response might arrive in 200 ms or in 30 seconds — and formally both are "up." Cost grows not from the number of requests but from token volume. So you add two metrics to the familiar uptime check.

"The service is reachable" and "the service is performing acceptably" are different things. For an LLM, a slow 30-second reply can be worse than an honest timeout.

Three axes of monitoring

Axis	What you measure	Where
Uptime	Endpoint availability, response code	HTTP monitor on the URL
Latency	Response time, p95/p99	App logs + monitoring
Cost	Tokens per request, daily spend	Logging the usage field from the API response

Basic availability check with curl

The simplest Ping is a request to a public API endpoint with a timing measurement. Example latency check:

curl -s -o /dev/null \
  -w "http_code=%{http_code} total=%{time_total}s\n" \
  -H "Authorization: Bearer $LLM_API_KEY" \
  https://api.example-llm.com/v1/models

This command returns the HTTP code and total response time without spending tokens — ideal for a cost-free health check that skips generation.

The HTTP-monitor concept

On enterno.io you can set a plain HTTP monitor on the API URL:

Target: the provider's health or models endpoint (not generation — to avoid spending tokens).
Interval: 1 minute on Pro (5 minutes on the free plan, up to 10 monitors).
Expected code: 200; anything else counts as a failure.
Alerts: Telegram, Slack, or a webhook on the very first failure.

Monitor a cheap health endpoint, not generation. Otherwise the monitoring itself becomes a token-cost line item.

Latency: what to track

p95 and p99, not the average — mean latency hides the tail.
Time to first token with streaming — that's what the user actually feels.
Client-side timeouts — set a reasonable limit and count overruns as incidents.

Cost: controlling token spend

Most LLM APIs return a usage object with input and output token counts. Log it on every request and aggregate by day. This gives an early signal if spend suddenly jumps — for example, from bloated prompts or a retry loop.

Log input/output tokens from the usage field of every response.
Set a daily spend threshold and alert when it's exceeded.
Watch for spikes — a token jump often means a bug, not a load increase.

FAQ

Can I monitor an LLM API without spending tokens?

Yes. Use a lightweight health or models endpoint that doesn't trigger generation. That checks availability and network latency without paying for model output.

What check interval should I pick?

For a production dependency, 1 minute (Pro). It balances fast incident detection against load. The free plan offers 5-minute checks and up to 10 monitors.

How do I tell "slow" from "down"?

Set a client-side timeout and treat an overrun as a failure. Additionally track p95/p99 to catch degradation before a full outage.

Where do the alerts go?

On enterno.io alerts are delivered to Telegram, Slack, or via webhook. See our guides on uptime monitoring and alerting best practices.

Set up API monitoring →

Check your website right now

Check now →

Monitoring LLM APIs: Latency, Cost, Uptime

Monitoring LLM APIs: Latency, Cost, Uptime

Why you can't monitor LLM APIs "the usual way"

Three axes of monitoring

Basic availability check with curl

The HTTP-monitor concept

Latency: what to track

Cost: controlling token spend

FAQ

Can I monitor an LLM API without spending tokens?

What check interval should I pick?

How do I tell "slow" from "down"?

Where do the alerts go?

Start monitoring for free