RAG — Retrieval Augmented Generation
How AI reads your documents and gives accurate, sourced answers
Contents
The problem RAG solves
Standard LLMs have a knowledge cutoff — they only know what was in their training data. Ask GPT-4 about your company's internal policy document and it cannot answer accurately. Ask it about last week's news and it'll hallucinate.
RAG (Retrieval Augmented Generation) fixes this by giving the model a way to look things up before answering — like an open-book exam versus a closed-book one.
How RAG works — step by step
1. Indexing: Your documents (PDFs, websites, databases) are split into chunks and converted into numerical representations called embeddings. These are stored in a vector database.
2. Query: A user asks a question. The question is also converted into an embedding.
3. Retrieval: The system finds the document chunks most similar to the question — like a semantic search.
4. Augmentation: The relevant chunks are inserted into the LLM's prompt as context.
5. Generation: The LLM answers using both its training knowledge and the retrieved documents — and can cite its sources.
Real-world example
Imagine you have 500 pages of legal contracts. A standard LLM cannot read all of them in one go (context limit) and doesn't know your specific contracts.
With RAG: all contracts are indexed. You ask "What are the termination clauses in the Reliance contract?" The system retrieves the relevant pages, passes them to the LLM, and gives you an accurate, cited answer.
This is how tools like Perplexity (searches the web), NotebookLM (reads your documents), and custom enterprise AI chatbots work.
When to use RAG vs fine-tuning
Use RAG when: You need the AI to access specific, frequently updated, or private documents. RAG is dynamic — add new documents and the knowledge updates immediately.
Use fine-tuning when: You want the model to adopt a specific style, personality, or specialized skill baked in. Fine-tuning changes the model itself and is expensive.
Rule of thumb: For knowledge (facts, documents), use RAG. For behaviour (writing style, domain expertise), use fine-tuning.
Tools to build RAG yourself
LlamaIndex: The most popular framework for building RAG pipelines. Handles indexing, retrieval, and query logic.
LangChain: Broader AI framework with RAG support. More complex but very flexible.
Supabase + pgvector: PostgreSQL with vector search. Free and open-source.
Pinecone: Managed vector database. Easy to start, scales well.
Weaviate / Qdrant: Open-source vector databases for self-hosting.
For a simple start: LlamaIndex + local Llama model can run entirely on your laptop.