Building AI Chatbots with RAG — Knowledge-Grounded Conversational Agents
AI chatbots powered by retrieval-augmented generation combine conversational fluency with grounded knowledge from your documents — this guide shows how to build one from scratch.
What You'll Learn
You'll learn to build a production RAG chatbot with session management, document ingestion, conversational memory, and source-grounded answers using LangChain and OpenAI.
Why It Matters
Generic LLM chatbots hallucinate on private data. RAG chatbots retrieve relevant context from your documents before answering, producing accurate, citeable responses that respect your knowledge boundaries.
Real-World Use
Doda Browser's help assistant uses a RAG chatbot to answer user questions from the browser's documentation, release notes, and FAQ pages — returning answers with source links and a confidence score.
RAG Chatbot Architecture
flowchart TD
A[User Query] --> B[Conversation History]
B --> C[Query Rewriting]
C --> D[Embedding Model]
D --> E[Vector Search]
E --> F[Document Retrieval]
F --> G[Context Assembly]
G --> H[LLM Generation]
H --> I[Grounded Answer]
I --> J[Update History]
J --> A
Document Ingestion Pipeline
Ingest documents into a vector store for retrieval.
from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma
# Load documents
loader = DirectoryLoader(
"./docs/",
glob="**/*.md",
show_progress=True
)
documents = loader.load()
print(f"Loaded {len(documents)} documents")
# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=500,
chunk_overlap=100,
separators=["\n## ", "\n### ", "\n", ". ", " "]
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")
# Create vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
documents=chunks,
embedding=embeddings,
persist_directory="./chroma_db"
)
print(f"Vector store created with {vectorstore._collection.count()} vectors")
Expected output:
Loaded 12 documents
Created 184 chunks
Vector store created with 184 vectors
Building the RAG Chain with Memory
Combine retrieval, context assembly, and LLM generation with conversation history.
from langchain.chains import create_retrieval_chain
from langchain.chains.history_aware_retriever import (
create_history_aware_retriever
)
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(model="gpt-4o", temperature=0.3)
# Prompt to rephrase query with history context
contextualize_prompt = ChatPromptTemplate.from_messages([
("system", "Given chat history and a user question, rephrase the question to be standalone."),
("placeholder", "{chat_history}"),
("human", "{input}")
])
history_aware_retriever = create_history_aware_retriever(
llm, vectorstore.as_retriever(
search_kwargs={"k": 4}
), contextualize_prompt
)
# Prompt to answer with sources
answer_prompt = ChatPromptTemplate.from_messages([
("system", "Answer based on the context. Cite sources. Say 'I cannot find this' if unsure.\n\n{context}"),
("placeholder", "{chat_history}"),
("human", "{input}")
])
document_chain = create_stuff_documents_chain(llm, answer_prompt)
rag_chain = create_retrieval_chain(
history_aware_retriever, document_chain
)
print("RAG chain initialized successfully")
Expected output:
RAG chain initialized successfully
Running the Chatbot
Implement a conversational loop with session management.
from LangChain.memory import ChatMessageHistory
sessions = {}
def chat(session_id: str, message: str) -> str:
if session_id not in sessions:
sessions[session_id] = ChatMessageHistory()
history = sessions[session_id]
result = rag_chain.invoke({
"input": message,
"chat_history": history.messages
})
history.add_user_message(message)
history.add_ai_message(result["answer"])
sources = list(set(
doc.metadata.get("source", "unknown")
for doc in result["context"]
))
return {
"answer": result["answer"],
"sources": sources,
"session_length": len(history.messages) // 2
}
# Test conversation
response1 = chat("user-1", "What are the API rate limits?")
print(f"Q: What are the API rate limits?")
print(f"A: {response1['answer'][:150]}...")
print(f"Sources: {response1['sources']}")
response2 = chat("user-1", "What happens if I exceed them?")
print(f"\nQ: What happens if I exceed them?")
print(f"A: {response2['answer'][:150]}...")
Expected output:
Q: What are the API rate limits?
A: The API allows 100 requests per minute per authentication key. Rate limits reset every 60 seconds from the first request...
Sources: ['docs/api-reference.md']
Q: What happens if I exceed them?
A: Exceeding the rate limit returns a 429 Too Many Requests status. The response includes a Retry-After header indicating when to retry...
Adding Guardrails
Prevent off-topic queries and enforce response boundaries.
from pydantic import BaseModel, Field
class GuardrailResult(BaseModel):
is_allowed: bool = Field(description="Whether the query is allowed")
reason: str = Field(description="Explanation of the decision")
guardrail_prompt = ChatPromptTemplate.from_messages([
("system", """Determine if the query is relevant to the chatbot's
documented knowledge BASE. Block queries about:
- Personal advice, medical, legal, or financial decisions
- Instructions for harmful or illegal activities
- Queries completely unrelated to the documentation"""),
("human", "{query}")
])
guardrail_llm = llm.with_structured_output(GuardrailResult)
def guarded_chat(session_id: str, message: str) -> dict:
check = guardrail_llm.invoke(
guardrail_prompt.format(query=message)
)
if not check.is_allowed:
return {
"answer": f"I cannot answer that question. {check.reason}",
"sources": [],
"blocked": True
}
return chat(session_id, message)
response = guarded_chat("user-1", "How do I hack a website?")
print(response["answer"])
print(f"Blocked: {response.get('blocked', False)}")
Expected output:
I cannot answer that question. This query is about unauthorized access, which is outside the scope of the documentation.
Blocked: True
Common Errors
| Error | Cause | Fix |
|---|---|---|
| Bot forgets previous messages | Chat history not passed to the chain | Use create_history_aware_retriever with session memory |
| Retrieved documents are irrelevant | Poor embedding or chunking Strategy | Increase chunk overlap and try hybrid search |
| Bot refuses to answer in-scope questions | Guardrails too aggressive | Tune the guardrail prompt with domain-specific examples |
| High latency on every query | Vector search on full corpus each time | Add Caching with LLMCache or Redis for repeated queries |
| Source citations link to wrong chunks | Metadata not preserved during chunking | Copy document metadata to each chunk's metadata dict |
Practice Questions
Why does a RAG chatbot need a history-aware retriever? Without it, follow-up questions like "what about the limit?" lose context. The history-aware retriever rewrites queries with conversation context.
How does session memory differ from fine-tuning? Session memory stores conversation history in-memory for the current session; fine-tuning permanently alters the model's weights.
What is the purpose of the guardrail layer in a chatbot? Guardrails filter out-of-scope or harmful queries before they reach the LLM, preventing unsafe or off-topic responses.
Why should sources be cited in RAG chatbot responses? Source citations build trust, allow users to verify information, and demonstrate that the answer is grounded in retrieved documents.
Challenge: Build a multi-tenant RAG chatbot where each organization has its own isolated vector store collection, session management, and custom system prompt — all served through a single API endpoint.
Mini Project
Build a customer support chatbot for an e-commerce site. Index product descriptions, return policies, and shipping FAQs into Chroma. Implement session-based conversation memory, a guardrail that blocks pricing discussions (routed to a pricing API instead), and a feedback mechanism where users thumbs-up or thumbs-down each answer to improve retrieval over time.
Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.
Built by the developers of DodaTech
Doda Browser, DodaZIP & Durga Antivirus Pro