Monitoring LLM APIs: Latency, Cost, Uptime
Short answer. LLM API документацию behave like any external HTTP service, but they add two twists: high, variable latency and per-token cost. You need to monitor three axes — uptime (is the endpoint reachable), latency (is it within your SLA), and cost (is token spend creeping up). Basic uptime is covered by ordinary HTTP monitoring of the API URL with alerts to Telegram, Slack, or a webhook; latency and cost require logging on the application side.
Why you can't monitor LLM APIs "the usual way"
Classic monitoring answers "is the service alive." For LLM APIs that isn't enough. A response might arrive in 200 ms or in 30 seconds — and formally both are "up." Cost grows not from the number of requests but from token volume. So you add two metrics to the familiar uptime check.
"The service is reachable" and "the service is performing acceptably" are different things. For an LLM, a slow 30-second reply can be worse than an honest timeout.
Three axes of monitoring
| Axis | What you measure | Where |
|---|---|---|
| Uptime | Endpoint availability, response code | HTTP monitor on the URL |
| Latency | Response time, p95/p99 | App logs + monitoring |
| Cost | Tokens per request, daily spend | Logging the usage field from the API response |
Basic availability check with curl
The simplest Ping is a request to a public API endpoint with a timing measurement. Example latency check:
curl -s -o /dev/null \
-w "http_code=%{http_code} total=%{time_total}s\n" \
-H "Authorization: Bearer $LLM_API_KEY" \
https://api.example-llm.com/v1/models
This command returns the HTTP code and total response time without spending tokens — ideal for a cost-free health check that skips generation.
The HTTP-monitor concept
On enterno.io you can set a plain HTTP monitor on the API URL:
- Target: the provider's health or models endpoint (not generation — to avoid spending tokens).
- Interval: 1 minute on Pro (5 minutes on the free plan, up to 10 monitors).
- Expected code: 200; anything else counts as a failure.
- Alerts: Telegram, Slack, or a webhook on the very first failure.
Monitor a cheap health endpoint, not generation. Otherwise the monitoring itself becomes a token-cost line item.
Latency: what to track
- p95 and p99, not the average — mean latency hides the tail.
- Time to first token with streaming — that's what the user actually feels.
- Client-side timeouts — set a reasonable limit and count overruns as incidents.
Cost: controlling token spend
Most LLM APIs return a usage object with input and output token counts. Log it on every request and aggregate by day. This gives an early signal if spend suddenly jumps — for example, from bloated prompts or a retry loop.
- Log input/output tokens from the
usagefield of every response. - Set a daily spend threshold and alert when it's exceeded.
- Watch for spikes — a token jump often means a bug, not a load increase.
FAQ
Can I monitor an LLM API without spending tokens?
Yes. Use a lightweight health or models endpoint that doesn't trigger generation. That checks availability and network latency without paying for model output.
What check interval should I pick?
For a production dependency, 1 minute (Pro). It balances fast incident detection against load. The free plan offers 5-minute checks and up to 10 monitors.
How do I tell "slow" from "down"?
Set a client-side timeout and treat an overrun as a failure. Additionally track p95/p99 to catch degradation before a full outage.
Where do the alerts go?
On enterno.io alerts are delivered to Telegram, Slack, or via webhook. See our guides on uptime monitoring and alerting best practices.
Related reading: uptime monitoring, API performance metrics, health-check endpoints, alerting best practices.