Building a RAG Pipeline with LangChain — Complete Guide
In this tutorial, you'll learn about Building a RAG Pipeline with LangChain. We cover key concepts, practical examples, and best practices to help you understand and apply this topic effectively.
What You'll Learn
Build a complete RAG pipeline using LangChain that can answer questions about your documents — from PDF loading to LLM-powered answers.
Why It Matters
RAG with LangChain is the standard architecture for document Q&A, customer support bots, and knowledge BASE tools.
Real-World Use
Ask questions about technical documentation, legal contracts, research papers, or internal company wikis.
Setup
This installs the five packages you need. <a href="/ai-frameworks-apis/LangChain/">LangChain</a> is the Orchestration framework that chains components together. <a href="/ai-frameworks-apis/langchain/">LangChain</a>-openai provides the OpenAI integration layer — both for embeddings and for the chat model. chromadb is the vector database that stores embeddings and runs similarity searches. pypdf extracts text content from PDF files so the pipeline can read them. Without any one of these, the pipeline breaks.
Step 1: Load Documents
from LangChain_community.document_loaders import PyPDFLoader
loader = PyPDFLoader("manual.pdf")
documents = loader.load()
print(f"Loaded {len(documents)} pages")
PyPDFLoader opens the PDF and extracts text page by page. Each page becomes a separate Document object with two fields: page_content (the extracted text) and metadata (a dict containing the page number and source file path). LangChain supports many other loaders — CSVLoader for spreadsheets, TextLoader for plain text files, UnstructuredHTMLLoader for web pages, and cloud connectors for Notion, Confluence, and Google Drive. Choose the loader that matches your source format.
Common pitfall: Scanned PDFs (image-based, not text-based) return empty pages with PyPDFLoader. If your PDF was created from a scanner rather than a word processor, you need OCR. Switch to UnstructuredPDFLoader with pytesseract installed, or use an OCR preprocessing step before loading.
Step 2: Split into Chunks
from LangChain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=50
)
chunks = splitter.split_documents(documents)
print(f"Split into {len(chunks)} chunks")
LLMs have context Windows (4K to 128K tokens depending on the model). You cannot fit an entire book into one prompt. Chunking solves this by dividing documents into smaller pieces that fit within the context window. chunk_size=500 means each chunk targets 500 characters. chunk_overlap=50 means adjacent chunks share 50 characters of overlap so that sentences or ideas are not severed at the boundary.
The RecursiveCharacterTextSplitter tries to split at natural boundaries in this order: double newlines (paragraph breaks), single newlines (line breaks), spaces (word breaks), and finally characters. This produces more coherent chunks than a naive character split.
Choosing chunk size: Smaller chunks (200-300 characters) work well for factual Q&A where you want precise retrieval. Larger chunks (1000-2000) suit summarization tasks. For most document Q&A, 500-1000 characters with 10-20% overlap is a good starting point. Experiment and measure retrieval quality on your specific documents.
Step 3: Create Embeddings and Store
from LangChain_OpenAI import OpenAIEmbeddings
from LangChain_community.vectorstores import Chroma
embeddings = OpenAIEmbeddings()
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
Embeddings convert text into a vector — a list of floating-point numbers that captures semantic meaning. OpenAI's embedding model (text-embedding-ada-002) produces 1536-dimensional vectors. Similar texts produce similar vectors, so the distance between vectors reflects the distance in meaning.
Chroma is a vector database that stores these embeddings and supports fast similarity searches. When a question comes in, it embeds the question using the same model and finds the stored vectors closest to it (cosine similarity). persist_directory="./chroma_db" saves the database to disk so you do not need to re-embed on every run. Without it, the vector store lives in memory only.
Requirement: Set OPENAI_API_KEY as an environment variable or in a .env file. Each embedding call costs a fraction of a cent, but costs add up for large document collections. For production, consider Caching embeddings or using a local model like sentence-transformers.
Step 4: Query with RAG
from LangChain_OpenAI import ChatOpenAI
from LangChain.chains import RetrievalQA
llm = ChatOpenAI(model="gpt-4o-mini")
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever()
)
answer = qa.invoke("What is the warranty period?")
print(answer["result"])
This is where retrieval-augmented generation comes together. When you call qa.invoke("What is the warranty period?"), five steps execute internally:
- The question is embedded using the same embedding model from Step 3.
- Chroma finds the 4 most similar chunks (default
k=4) using cosine similarity. - The "stuff" chain type concatenates all retrieved chunks into a single prompt as context.
- The LLM (gpt-4o-mini) reads the question alongside the context and generates an answer grounded in those documents.
- The answer is returned under the
resultkey.
The "stuff" chain type works well when you retrieve a small number of chunks (4-6). For larger retrieval sets, use "map_reduce" (summarizes each chunk independently first) or "refine" (iteratively refines the answer across chunks). These handle larger context but cost more in tokens and latency.
Tuning tip: If answers are incomplete, increase the number of retrieved chunks with as_retriever(search_kwargs={"k": 6}). If answers contain irrelevant information, decrease k or lower chunk_size.
Customizing the Prompt
from LangChain.prompts import PromptTemplate
template = """Use the following context to answer the question.
If you don't know, say you don't know.
Context: {context}
Question: {question}
Answer:"""
prompt = PromptTemplate.from_template(template)
qa = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=vectorstore.as_retriever(),
chain_type_kwargs={"prompt": prompt}
)
The default prompt works, but customizing it gives you fine control over the model's output. You can enforce tone ("Answer in one short sentence"), add format instructions ("List the top 3 reasons with bullet points"), require source citations ("Include the source page number"), or define fallback behavior ("If the context does not contain the answer, say 'Not found in the provided documents'").
The template uses {context} and {question} as placeholders. LangChain fills {context} with the concatenated retrieved chunks and {question} with the user's original question. You can add any other variables and pass them via chain_type_kwargs.
Tip: Add a system instruction at the top — "You are a helpful technical support agent for Acme Corp" — to give the model a consistent persona across all queries.
Common Errors
openai.RateLimitError: The api_key client option must be set
Set the environment variable before running: export OPENAI_API_KEY="sk-...". Or create a .env file and use python-dotenv to load it automatically.
2. Chroma persistence conflict
chromadb.errors.DuplicateIDError: ID already exists
If your documents changed and you re-run the script, delete the ./chroma_db folder first with rm -rf ./chroma_db, or use a different persist_directory path.
3. PDF has no extractable text
PyPDFLoader returns Document objects with empty page_content for scanned PDFs. Verify your PDF is text-based by opening it in a reader and selecting text. If text is not selectable, use UnstructuredPDFLoader with OCR enabled.
4. Token limit exceeded
Context length exceeded... maximum context length is 4097 tokens
Reduce chunk_size, reduce k (number of retrieved chunks), or switch to a model with a larger context window like gpt-4o-mini (128K tokens).
5. Irrelevant or hallucinated answers
The retrieved chunks may not contain the relevant information. Increase k to retrieve more chunks, adjust chunk overlap, or use search_type="similarity_score_threshold" with a minimum score to filter out low-relevance results.
Practice
Experiment with chunk sizes: Change
chunk_sizeto 200 and then to 1000. Load a 5-page PDF and observe how the number of chunks changes. Test the same question with each chunk size and compare answer quality.Switch to MMR retrieval: Replace
as_retriever()withas_retriever(search_type="mmr", search_kwargs={"k": 6, "fetch_k": 20}). MMR (Maximum Marginal Relevance) diversifies the retrieved chunks so you get broader coverage of the document.Add source citations: Modify the prompt template to include "Include the page number from the metadata" and display the source page alongside each answer.
Challenge: Build a multi-file pipeline that loads all PDFs from a directory using
DirectoryLoader, indexes them together, and queries across all documents at once.
Summary
You built a complete RAG pipeline: loading PDFs with PyPDFLoader, splitting into coherent chunks, generating embeddings with OpenAI, storing in Chroma, and querying with an LLM via RetrievalQA. This same architecture powers customer support bots, legal document analysis tools, research paper assistants, and internal knowledge BASE search across thousands of organizations. The components — loader, splitter, vector store, retriever, LLM — are modular and swappable, letting you adapt the pipeline to any document format, scale, or domain.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro