Building AI Chatbots with RAG — Knowledge-Grounded Conversational Agents

DodaTech Updated 2026-06-22 5 min read

AI chatbots powered by retrieval-augmented generation combine conversational fluency with grounded knowledge from your documents — this guide shows how to build one from scratch.

What You'll Learn

You'll learn to build a production RAG chatbot with session management, document ingestion, conversational memory, and source-grounded answers using LangChain and OpenAI.

Why It Matters

Generic LLM chatbots hallucinate on private data. RAG chatbots retrieve relevant context from your documents before answering, producing accurate, citeable responses that respect your knowledge boundaries.

Real-World Use

Doda Browser's help assistant uses a RAG chatbot to answer user questions from the browser's documentation, release notes, and FAQ pages — returning answers with source links and a confidence score.

RAG Chatbot Architecture

flowchart TD
    A[User Query] --> B[Conversation History]
    B --> C[Query Rewriting]
    C --> D[Embedding Model]
    D --> E[Vector Search]
    E --> F[Document Retrieval]
    F --> G[Context Assembly]
    G --> H[LLM Generation]
    H --> I[Grounded Answer]
    I --> J[Update History]
    J --> A

Document Ingestion Pipeline

Ingest documents into a vector store for retrieval.

from langchain_community.document_loaders import DirectoryLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import Chroma

# Load documents
loader = DirectoryLoader(
    "./docs/",
    glob="**/*.md",
    show_progress=True
)
documents = loader.load()
print(f"Loaded {len(documents)} documents")

# Split into chunks
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=100,
    separators=["\n## ", "\n### ", "\n", ". ", " "]
)
chunks = text_splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks")

# Create vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
vectorstore = Chroma.from_documents(
    documents=chunks,
    embedding=embeddings,
    persist_directory="./chroma_db"
)
print(f"Vector store created with {vectorstore._collection.count()} vectors")

Expected output:

Loaded 12 documents
Created 184 chunks
Vector store created with 184 vectors

Building the RAG Chain with Memory

Combine retrieval, context assembly, and LLM generation with conversation history.

from langchain.chains import create_retrieval_chain
from langchain.chains.history_aware_retriever import (
    create_history_aware_retriever
)
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI

llm = ChatOpenAI(model="gpt-4o", temperature=0.3)

# Prompt to rephrase query with history context
contextualize_prompt = ChatPromptTemplate.from_messages([
    ("system", "Given chat history and a user question, rephrase the question to be standalone."),
    ("placeholder", "{chat_history}"),
    ("human", "{input}")
])

history_aware_retriever = create_history_aware_retriever(
    llm, vectorstore.as_retriever(
        search_kwargs={"k": 4}
    ), contextualize_prompt
)

# Prompt to answer with sources
answer_prompt = ChatPromptTemplate.from_messages([
    ("system", "Answer based on the context. Cite sources. Say 'I cannot find this' if unsure.\n\n{context}"),
    ("placeholder", "{chat_history}"),
    ("human", "{input}")
])

document_chain = create_stuff_documents_chain(llm, answer_prompt)
rag_chain = create_retrieval_chain(
    history_aware_retriever, document_chain
)
print("RAG chain initialized successfully")

Expected output:

RAG chain initialized successfully

Running the Chatbot

Implement a conversational loop with session management.

from LangChain.memory import ChatMessageHistory

sessions = {}

def chat(session_id: str, message: str) -> str:
    if session_id not in sessions:
        sessions[session_id] = ChatMessageHistory()

    history = sessions[session_id]
    result = rag_chain.invoke({
        "input": message,
        "chat_history": history.messages
    })

    history.add_user_message(message)
    history.add_ai_message(result["answer"])

    sources = list(set(
        doc.metadata.get("source", "unknown")
        for doc in result["context"]
    ))

    return {
        "answer": result["answer"],
        "sources": sources,
        "session_length": len(history.messages) // 2
    }

# Test conversation
response1 = chat("user-1", "What are the API rate limits?")
print(f"Q: What are the API rate limits?")
print(f"A: {response1['answer'][:150]}...")
print(f"Sources: {response1['sources']}")

response2 = chat("user-1", "What happens if I exceed them?")
print(f"\nQ: What happens if I exceed them?")
print(f"A: {response2['answer'][:150]}...")

Expected output:

Q: What are the API rate limits?
A: The API allows 100 requests per minute per authentication key. Rate limits reset every 60 seconds from the first request...
Sources: ['docs/api-reference.md']

Q: What happens if I exceed them?
A: Exceeding the rate limit returns a 429 Too Many Requests status. The response includes a Retry-After header indicating when to retry...

Adding Guardrails

Prevent off-topic queries and enforce response boundaries.

from pydantic import BaseModel, Field

class GuardrailResult(BaseModel):
    is_allowed: bool = Field(description="Whether the query is allowed")
    reason: str = Field(description="Explanation of the decision")

guardrail_prompt = ChatPromptTemplate.from_messages([
    ("system", """Determine if the query is relevant to the chatbot's
documented knowledge BASE. Block queries about:
- Personal advice, medical, legal, or financial decisions
- Instructions for harmful or illegal activities
- Queries completely unrelated to the documentation"""),
    ("human", "{query}")
])

guardrail_llm = llm.with_structured_output(GuardrailResult)

def guarded_chat(session_id: str, message: str) -> dict:
    check = guardrail_llm.invoke(
        guardrail_prompt.format(query=message)
    )

    if not check.is_allowed:
        return {
            "answer": f"I cannot answer that question. {check.reason}",
            "sources": [],
            "blocked": True
        }

    return chat(session_id, message)

response = guarded_chat("user-1", "How do I hack a website?")
print(response["answer"])
print(f"Blocked: {response.get('blocked', False)}")

Expected output:

I cannot answer that question. This query is about unauthorized access, which is outside the scope of the documentation.
Blocked: True

Common Errors

Error	Cause	Fix
Bot forgets previous messages	Chat history not passed to the chain	Use create_history_aware_retriever with session memory
Retrieved documents are irrelevant	Poor embedding or chunking Strategy	Increase chunk overlap and try hybrid search
Bot refuses to answer in-scope questions	Guardrails too aggressive	Tune the guardrail prompt with domain-specific examples
High latency on every query	Vector search on full corpus each time	Add Caching with `LLMCache` or Redis for repeated queries
Source citations link to wrong chunks	Metadata not preserved during chunking	Copy document metadata to each chunk's metadata dict

Practice Questions

Why does a RAG chatbot need a history-aware retriever? Without it, follow-up questions like "what about the limit?" lose context. The history-aware retriever rewrites queries with conversation context.
How does session memory differ from fine-tuning? Session memory stores conversation history in-memory for the current session; fine-tuning permanently alters the model's weights.
What is the purpose of the guardrail layer in a chatbot? Guardrails filter out-of-scope or harmful queries before they reach the LLM, preventing unsafe or off-topic responses.
Why should sources be cited in RAG chatbot responses? Source citations build trust, allow users to verify information, and demonstrate that the answer is grounded in retrieved documents.
Challenge: Build a multi-tenant RAG chatbot where each organization has its own isolated vector store collection, session management, and custom system prompt — all served through a single API endpoint.

Mini Project

Build a customer support chatbot for an e-commerce site. Index product descriptions, return policies, and shipping FAQs into Chroma. Implement session-based conversation memory, a guardrail that blocks pricing discussions (routed to a pricing API instead), and a feedback mechanism where users thumbs-up or thumbs-down each answer to improve retrieval over time.

Built by the developers of Doda Browser, DodaZIP, and Durga Antivirus Pro.

← Previous Fine-Tuning LLMs with LoRA and QLoRA — Parameter-Efficient Training Guide Next → Vector Databases Explained — Pinecone, Weaviate, Qdrant & Chroma for AI Search

Built by the developers of DodaTech

Doda Browser, DodaZIP & Durga Antivirus Pro

Home Browse Ai Automation