How to Build a RAG Chatbot on Docs — 2026

Anatoly Oshmanovsky

How to Build a RAG Chatbot

By Anatoly Oshmanovsky · Updated Apr 18, 2026

Key idea:

RAG chatbot in 30 minutes: (1) Chunk documents into 500-1000 tokens, (2) Embed via OpenAI text-embedding-3-small ($0.02/1M), (3) Store in Qdrant (Rust open-source), (4) User query → embed → similaritySearch top-5 chunks, (5) Inject into prompt → Claude/GPT-5 generates answer with sources. Stack: Node.js + LangChain.js + Qdrant. Cost: ~$0.001 per query.

Below: step-by-step, working examples, common pitfalls, FAQ.

Try it now — free →

Step-by-Step Setup

Install Qdrant: docker run -p 6333:6333 qdrant/qdrant
Chunk docs: recursive text splitter with 100-token overlap
Generate embeddings via OpenAI API (batch 100 docs per request)
Upsert into Qdrant collection with payload (source URL, title)
Query pipeline: user input → embed → Qdrant search top-5 → format context
LLM call with system prompt: "Answer only from context, cite sources"
UI: streaming response for UX, show citations in footnotes

Working Examples

Scenario	Config
LangChain.js full pipeline	import { QdrantVectorStore } from '@langchain/qdrant'; import { OpenAIEmbeddings } from '@langchain/openai'; import { ChatOpenAI } from '@langchain/openai'; const store = await QdrantVectorStore.fromDocuments( chunks, new OpenAIEmbeddings(), { url: 'http://qdrant:6333', collectionName: 'docs' } ); const docs = await store.similaritySearch(query, 5); const llm = new ChatOpenAI({ model: 'gpt-5' }); const answer = await llm.invoke([ { role: 'system', content: `Context: ${docs.join('\n')}` }, { role: 'user', content: query } ]);
Qdrant HNSW tuning	`PUT /collections/docs {"vectors": {"size": 1536, "distance": "Cosine"}, "hnsw_config": {"m": 16, "ef_construct": 100}}`
Python (LlamaIndex)	`from llama_index.core import VectorStoreIndex, SimpleDirectoryReader docs = SimpleDirectoryReader('./docs').load_data() index = VectorStoreIndex.from_documents(docs) query_engine = index.as_query_engine() response = query_engine.query('Your question')`
Chunking strategy	`from langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=800, chunk_overlap=100, separators=['\n\n', '\n', '.', ' '] )`
Hybrid search (dense + sparse)	`# Qdrant: create named vectors (dense + sparse BM25) # Then batch search with weights`

Common Pitfalls

Chunks too small → lose context. Too large → cosine dilution. 500-1000 tokens sweet spot
No overlap between chunks — info on boundary is lost. Use 10-20% overlap (100-200 tokens)
Vector DB without filter on source/date → irrelevant matches. Use metadata filter
Embedding model mismatch: embedded with text-embedding-3-small, queried with text-embedding-3-large — will not work
Hallucinations do not fully vanish — add "If the context does not contain the answer, say I do not know"

Learn more

Glossary

Frequently Asked Questions

How many documents are needed?

100 small docs already work. 10k+ — rerank needed for quality. 100k+ — shard vector DB, hybrid search.

Cost?

Embeddings: $0.02/1M tokens. LLM call: $0.15-15/1M. For 1k queries/day ~$0.50-5.

Best LLM for RAG?

Claude Opus 4.7 — best for long context. GPT-5 — balanced. Gemini 2.5 — 2M context. Llama 3 70B self-host — free.

How to monitor RAG quality?

Ragas (Python) measures context_precision, context_recall, answer_relevancy. Set thresholds in CI.

Try the live tool that powered this guide

Free plan — 20 monitors, 5-minute checks, no card required. Upgrade for 1-minute interval and multi-region monitoring.

Start free See pricing