RAG (Retrieval-Augmented Generation) — a pattern to ground an LLM on specific data without fine-tuning. Steps: (1) embed documents into vectors → store in a vector DB (Qdrant/Pinecone/Weaviate), (2) embed user query → retrieve top-k similar chunks, (3) inject retrieved context into prompt → LLM generates an answer with citations. Used in chatbots on docs, enterprise Q&A, code search. Frameworks: LlamaIndex, LangChain, Haystack.
Below: details, example, related terms, FAQ.
# RAG in LangChain.js
import { OpenAIEmbeddings } from '@langchain/openai';
import { QdrantVectorStore } from '@langchain/qdrant';
const vectorStore = await QdrantVectorStore.fromExistingCollection(
new OpenAIEmbeddings(), { url: 'http://qdrant:6333', collectionName: 'docs' }
);
const relevantDocs = await vectorStore.similaritySearch(userQuery, 5);
// Inject relevantDocs into prompt
const answer = await chatModel.invoke([
{ role: 'system', content: `Context: ${relevantDocs.join('
')}` },
{ role: 'user', content: userQuery }
]);RAG: dynamic knowledge, easy to update, transparent (sources visible). Fine-tune: better style/tone, fixed knowledge. They combine well.
512-1024 tokens usually. Larger — context gets diffused, smaller — meaning is lost. Test on your corpus.
Reduced, not eliminated. Prompt: "If answer is not in context, say \"I do not know\"". + chain-of-citations.