Skip to content

How to Build a RAG Chatbot

Key idea:

RAG chatbot in 30 minutes: (1) Chunk documents into 500-1000 tokens, (2) Embed via OpenAI text-embedding-3-small ($0.02/1M), (3) Store in Qdrant (Rust open-source), (4) User query → embed → similaritySearch top-5 chunks, (5) Inject into prompt → Claude/GPT-5 generates answer with sources. Stack: Node.js + LangChain.js + Qdrant. Cost: ~$0.001 per query.

Below: step-by-step, working examples, common pitfalls, FAQ.

Try it now — free →

Step-by-Step Setup

  1. Install Qdrant: docker run -p 6333:6333 qdrant/qdrant
  2. Chunk docs: recursive text splitter with 100-token overlap
  3. Generate embeddings via OpenAI API (batch 100 docs per request)
  4. Upsert into Qdrant collection with payload (source URL, title)
  5. Query pipeline: user input → embed → Qdrant search top-5 → format context
  6. LLM call with system prompt: "Answer only from context, cite sources"
  7. UI: streaming response for UX, show citations in footnotes

Working Examples

ScenarioConfig
LangChain.js full pipelineimport { QdrantVectorStore } from '@langchain/qdrant'; import { OpenAIEmbeddings } from '@langchain/openai'; import { ChatOpenAI } from '@langchain/openai'; const store = await QdrantVectorStore.fromDocuments( chunks, new OpenAIEmbeddings(), { url: 'http://qdrant:6333', collectionName: 'docs' } ); const docs = await store.similaritySearch(query, 5); const llm = new ChatOpenAI({ model: 'gpt-5' }); const answer = await llm.invoke([ { role: 'system', content: `Context: ${docs.join('\n')}` }, { role: 'user', content: query } ]);
Qdrant HNSW tuningPUT /collections/docs {"vectors": {"size": 1536, "distance": "Cosine"}, "hnsw_config": {"m": 16, "ef_construct": 100}}
Python (LlamaIndex)from llama_index.core import VectorStoreIndex, SimpleDirectoryReader docs = SimpleDirectoryReader('./docs').load_data() index = VectorStoreIndex.from_documents(docs) query_engine = index.as_query_engine() response = query_engine.query('Your question')
Chunking strategyfrom langchain.text_splitter import RecursiveCharacterTextSplitter splitter = RecursiveCharacterTextSplitter( chunk_size=800, chunk_overlap=100, separators=['\n\n', '\n', '.', ' '] )
Hybrid search (dense + sparse)# Qdrant: create named vectors (dense + sparse BM25) # Then batch search with weights

Common Pitfalls

  • Chunks too small → lose context. Too large → cosine dilution. 500-1000 tokens sweet spot
  • No overlap between chunks — info on boundary is lost. Use 10-20% overlap (100-200 tokens)
  • Vector DB without filter on source/date → irrelevant matches. Use metadata filter
  • Embedding model mismatch: embedded with text-embedding-3-small, queried with text-embedding-3-large — will not work
  • Hallucinations do not fully vanish — add "If the context does not contain the answer, say I do not know"

Learn more

Frequently Asked Questions

How many documents are needed?

100 small docs already work. 10k+ — rerank needed for quality. 100k+ — shard vector DB, hybrid search.

Cost?

Embeddings: $0.02/1M tokens. LLM call: $0.15-15/1M. For 1k queries/day ~$0.50-5.

Best LLM for RAG?

Claude Opus 4.7 — best for long context. GPT-5 — balanced. Gemini 2.5 — 2M context. Llama 3 70B self-host — free.

How to monitor RAG quality?

Ragas (Python) measures context_precision, context_recall, answer_relevancy. Set thresholds in CI.