Short answer. A RAG pipeline is a sequence of components: an embedding service, a vector DB, a retriever and an LLM API документацию. A failure or slowdown in any stage spoils answers. Monitoring RAG comes down to checking each component's availability over HTTP, watching stage latency, and a heartbeat for background indexing. enterno.io provides the external availability layer from RU, EU and US, without replacing output-quality eval.
RAG components that fail
In a typical RAG, every stage can fail:
- The embedding service — unavailable or slow;
- The vector DB (Qdrant, Weaviate, pgvector, etc.) — down or degrading on latency;
- The retriever / search API — returns errors or empties;
- The LLM API — 429/5xx, rising response time;
- Indexing — a background job stops updating the store.
Three layers of RAG monitoring
- Component availability — HTTP health checks of each service.
- Stage latency — where time is actually lost.
- Index freshness — a heartbeat for background indexing.
The trickiest RAG problem isn't a crash but silent degradation: the vector DB responds, but slowly, and the user gets a delayed answer or one built on stale documents.
| Component | Typical failure | How to monitor |
|---|---|---|
| Embedding service | Unavailable or slow | HTTP monitor |
| Vector DB | Down, rising latency | HTTP monitor + latency |
| Retriever / search API | Errors or empty results | HTTP monitor |
| LLM API | 429/5xx, slow response | HTTP monitor |
| Indexing | Didn't refresh the store in time | Heartbeat |
Health-checking components
Set up a check for each critical pipeline service:
# Vector DB
curl -o /dev/null -s -w "vectordb %{http_code} %{time_total}s\n" \
https://vectordb.internal.example.com/healthz
# Retriever / search API
curl -o /dev/null -s -w "retriever %{http_code} %{time_total}s\n" \
https://retriever.example.com/health
# LLM API
curl -o /dev/null -s -w "llm %{http_code} %{time_total}s\n" \
https://api.example-llm.com/v1/health
Add each check to enterno.io as a separate HTTP monitor — so you instantly see which component took the pipeline down.
Heartbeat for indexing
Background reindexing should signal completion. Have the job Ping a heartbeat after a successful update:
# After a successful reindex
curl -fsS https://enterno.io/api/heartbeat/INDEX_TOKEN \
-o /dev/null && echo "index heartbeat sent"
If indexing doesn't run on time, you'll learn about a stale index before users start getting outdated answers.
What to monitor beyond availability
- Stage latency — separate monitors on the retriever and LLM API.
- SSL and DNS of the pipeline's external services.
- Cost — LLM tokens for answer generation (log in your own tracing).
The line: availability, not quality
Let's be honest: enterno.io doesn't score retrieval relevance or compute RAG quality metrics (retrieval precision/recall, faithfulness). That needs eval tools. enterno.io answers "is each pipeline component alive and how fast does it respond" — and that's the layer that most often breaks production.
FAQ
Does enterno.io evaluate retrieval quality?
No, that's a job for eval tools. enterno.io covers component availability and latency plus an indexing heartbeat.
How do I tell which stage is slow?
Create a separate monitor per service and compare latency — the bottleneck becomes obvious.
What about a stale index?
A background-indexing heartbeat: alert if reindexing didn't run within the window.
Can I monitor from Russia?
Yes, checks run from ru-msk, with EU and US added on paid tiers.
Cover the pipeline: create HTTP checks for components on the monitors page and connect heartbeat for indexing.
Related: monitoring AI/LLM APIs, best API monitoring tools, multi-region.