Guide to Using RAG with LangChain and ChromaDB/FAISS
Build a 'Chat with PDF' app in 10 minutes. We walk through the code for loading documents, creati...
Abstract AlgorithmsTLDR: RAG (Retrieval-Augmented Generation) gives an LLM access to your private documents at query time. You chunk and embed documents into a vector store (ChromaDB or FAISS), retrieve the relevant chunks at query time, and inject them into the LLM's prompt. The model answers from real data instead of hallucinating.
๐ Why LLMs Hallucinate and RAG Fixes It
An LLM's knowledge is frozen at its training cutoff. Ask it about your internal documentation, your product catalog, or a document uploaded today โ it will generate a plausible-sounding answer from its weights, not from your data.
RAG bridges this gap:
- No fine-tuning required (fine-tuning is expensive and won't help for dynamic data)
- Works with any LLM
- Updates in real time โ add new documents to the vector store and they're immediately searchable
๐ข Step 1: Chunking and Embedding Your Documents
Before storing, split your documents into chunks and convert each to an embedding.
Why chunking?
| Chunk too small | Chunk too large |
| Missing context | Too much noise injected into LLM |
| Loses coherence across sentences | Uses up context window budget |
| Many irrelevant chunks returned | Retrieval quality degrades |
Overlap (typically 10โ20% of chunk size) preserves context across chunk boundaries.
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50,
)
chunks = splitter.split_documents(docs)
โ๏ธ Step 2: Storing and Searching with ChromaDB
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db",
)
# Retrieve top-4 most relevant chunks for a query
retriever = vectorstore.as_retriever(search_kwargs={"k": 4})
ChromaDB stores embeddings locally (persisted to disk). FAISS is an alternative โ in-memory only, faster for pure search, no built-in persistence.
| Feature | ChromaDB | FAISS |
| Persistence | Yes (disk) | No (in-memory, manual save) |
| Metadata filtering | Yes | Limited |
| Latency | Slightly higher | Very low |
| Best for | Prototyping + production | High-throughput search, research |
๐ง Step 3: The RetrievalQA Chain in LangChain
from langchain.chains import RetrievalQA
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff", # "stuff" = inject all context at once
retriever=retriever,
return_source_documents=True,
)
result = qa_chain.invoke({"query": "What is our refund policy?"})
print(result["result"])
print(result["source_documents"])
The chain_type="stuff" injects all retrieved chunks into one prompt. For larger context windows, use "map_reduce" (summarize each chunk separately, then combine) or "refine".
flowchart LR
Q[User Question] --> Embed[Embed question]
Embed --> Retrieve[Retrieve top-k chunks\nfrom ChromaDB]
Retrieve --> Prompt[Inject chunks into\nLLM prompt]
Prompt --> LLM[LLM call]
LLM --> A[Answer with citations]
โ๏ธ What Can Go Wrong: Retrieval Quality Traps
| Problem | Symptom | Fix |
| Chunks too large | Irrelevant content in context | Reduce chunk size, test retrieval quality |
| Top-k too low | Answer misses key details | Increase k, or use reranker |
| Embedding model mismatch | Poor retrieval | Use same model for indexing and querying |
| No metadata filtering | Returns documents from wrong project | Add where filters in ChromaDB |
| Chain type wrong for large docs | Context overflow | Switch to map_reduce or refine |
๐ Key Takeaways
- RAG = retrieve relevant document chunks + inject them into the LLM prompt at query time.
- Chunking strategy is critical: too small loses context, too large adds noise.
- ChromaDB handles persistence and metadata filtering; FAISS is faster but in-memory only.
- The
RetrievalQAchain is the standard LangChain building block for RAG. - Monitor retrieval quality separately from answer quality โ bad retrieval = bad answers regardless of model.
๐งฉ Test Your Understanding
- Why can't you just fine-tune an LLM instead of using RAG for private documents?
- You set
chunk_overlap=0and notice answers miss context from the boundary of two chunks. What should you change? - When should you use
chain_type="map_reduce"instead of"stuff"? - What is the risk of using different embedding models for indexing and querying?
๐ Related Posts

Written by
Abstract Algorithms
@abstractalgorithms
More Posts
SFT for LLMs: A Practical Guide to Supervised Fine-Tuning
TLDR: Supervised fine-tuning (SFT) is the stage where a pretrained model learns task-specific response behavior from curated input-output examples. It is usually the first alignment step after pretraining and often the foundation for later RLHF. Good...
RLHF in Practice: From Human Preferences to Better LLM Policies
TLDR: Reinforcement Learning from Human Feedback (RLHF) helps align language models with human preferences after pretraining and SFT. The typical pipeline is: collect preference comparisons, train a reward model, then optimize a policy (often with KL...
PEFT, LoRA, and QLoRA: A Practical Guide to Efficient LLM Fine-Tuning
TLDR: Full fine-tuning updates every model weight, which is expensive in memory, compute, and storage. PEFT methods update only a small trainable slice. LoRA learns low-rank adapters on top of frozen base weights. QLoRA pushes efficiency further by q...
LLM Model Naming Conventions: How to Read Names and Why They Matter
TLDR: LLM names encode practical decisions: model family, size, training stage, context window, format, and quantization level. If you can decode naming conventions, you can avoid costly deployment mistakes and choose the right checkpoint faster. ๏ฟฝ...
