Skip to content

RAG: Retrieval-Augmented Generation

Key idea:

RAG (Retrieval-Augmented Generation) — a pattern to ground an LLM on specific data without fine-tuning. Steps: (1) embed documents into vectors → store in a vector DB (Qdrant/Pinecone/Weaviate), (2) embed user query → retrieve top-k similar chunks, (3) inject retrieved context into prompt → LLM generates an answer with citations. Used in chatbots on docs, enterprise Q&A, code search. Frameworks: LlamaIndex, LangChain, Haystack.

Below: details, example, related terms, FAQ.

Try it now — free →

Details

  • Chunking: split docs into 500-1500 token chunks (semantic or fixed)
  • Embedding models: OpenAI text-embedding-3-large, Cohere embed-v3, jina-embeddings-v3
  • Vector DB: Qdrant (Rust open-source), Pinecone (managed), Weaviate, pgvector (PostgreSQL extension)
  • Retrieval: ANN (HNSW) top-k=5-20 chunks + rerank via Cohere/Voyage
  • Generation: LLM with augmented context, often with citations in answer

Example

# RAG in LangChain.js
import { OpenAIEmbeddings } from '@langchain/openai';
import { QdrantVectorStore } from '@langchain/qdrant';

const vectorStore = await QdrantVectorStore.fromExistingCollection(
  new OpenAIEmbeddings(), { url: 'http://qdrant:6333', collectionName: 'docs' }
);
const relevantDocs = await vectorStore.similaritySearch(userQuery, 5);
// Inject relevantDocs into prompt
const answer = await chatModel.invoke([
  { role: 'system', content: `Context: ${relevantDocs.join('

')}` },
  { role: 'user', content: userQuery }
]);

Related Terms

Learn more

Frequently Asked Questions

RAG vs fine-tuning?

RAG: dynamic knowledge, easy to update, transparent (sources visible). Fine-tune: better style/tone, fixed knowledge. They combine well.

Best chunk size?

512-1024 tokens usually. Larger — context gets diffused, smaller — meaning is lost. Test on your corpus.

Hallucinations in RAG?

Reduced, not eliminated. Prompt: "If answer is not in context, say \"I do not know\"". + chain-of-citations.