Building a RAG Pipeline: Step‑by‑Step Tutorial

Building a RAG Pipeline – Deep Tutorial

The most practical way to ground LLMs in your own data – with code, explanations, and production tips.

📖 Plain‑English summary: RAG (Retrieval‑Augmented Generation) is a technique that lets an LLM answer questions about your private documents without retraining. It works by: (1) chopping your documents into small chunks, (2) converting each chunk into a “vector” (a mathematical fingerprint), (3) storing those vectors in a database, and (4) when a user asks a question, finding the most similar chunks and feeding them to the LLM as context. This completely eliminates hallucinations about your data. This tutorial walks you through building a production‑ready RAG system from scratch.

🧩 The Four Components of RAG

Document store – Your private files (PDFs, Word docs, Confluence pages, Slack exports).
Embedding model – Converts text into vectors (arrays of numbers). Examples: OpenAI text‑embedding‑3‑small, all‑MiniLM‑L6‑v2 (free).
Vector database – Stores embeddings and supports similarity search. Options: Chroma (lightweight), Pinecone (cloud), Qdrant (open source).
LLM – The generator. Takes the user’s query + retrieved chunks + a prompt and returns a grounded answer.

Simple analogy: Imagine you have a huge library (documents). You write a summary card for every paragraph (embedding). When someone asks a question, you find the most relevant cards (retrieval), then hand those full paragraphs to a brilliant librarian (LLM) who reads them and gives a concise answer. That’s RAG.

🛠️ Step‑by‑Step Implementation (Python)

We’ll build a RAG pipeline that can answer questions from a folder of PDFs. We’ll use LlamaIndex (high‑level orchestration), OpenAI for embeddings and LLM, and Chroma as the vector DB. All code is ready to run after pip install.

Step 0: Install dependencies

pip install llama-index chromadb llama-index-llms-openai llama-index-embeddings-openai pypdf

Step 1: Load and chunk your documents

from llama_index.core import SimpleDirectoryReader
from llama_index.core.node_parser import SentenceSplitter

# Load all PDFs from a folder
documents = SimpleDirectoryReader("./my_docs").load_data()

# Split into chunks of ~512 tokens with 50 token overlap
parser = SentenceSplitter(chunk_size=512, chunk_overlap=50)
nodes = parser.get_nodes_from_documents(documents)

print(f"Created {len(nodes)} chunks")

Why chunk? LLMs have limited context windows. If you feed a 100‑page document directly, the model will miss details. By chunking, you retrieve only the most relevant paragraphs.

Step 2: Create embeddings and store in a vector database

from llama_index.vector_stores.chroma import ChromaVectorStore
from llama_index.core import StorageContext
import chromadb

# Initialize Chroma client (in‑memory for prototyping)
chroma_client = chromadb.EphemeralClient()
chroma_collection = chroma_client.create_collection("my_knowledge")

# Wrap Chroma collection as a LlamaIndex vector store
vector_store = ChromaVectorStore(chroma_collection=chroma_collection)
storage_context = StorageContext.from_defaults(vector_store=vector_store)

# Create an index that will embed each node and store it
from llama_index.core import VectorStoreIndex
index = VectorStoreIndex(nodes, storage_context=storage_context)

What’s happening? LlamaIndex takes each chunk, sends it to the embedding model (OpenAI by default), gets a vector (e.g., 1536 numbers), and stores that vector in Chroma along with the original text. Later, we can search by vector similarity.

Step 3: Query the index (retrieve + generate)

# Create a query engine
query_engine = index.as_query_engine()

# Ask a question
response = query_engine.query("What is the vacation policy for full‑time employees?")
print(response)

The query engine does this internally:

Converts your question into an embedding vector using the same embedding model.
Finds the top‑k most similar chunks in Chroma (by cosine similarity).
Builds a prompt: “Answer the question based only on the following context: [chunk1] [chunk2] … Question: … Answer:”
Sends that prompt to the LLM (GPT‑4o by default).
Returns the LLM’s answer.

Step 4: Advanced – customizing the prompt and retrieval

from llama_index.core import PromptTemplate

custom_prompt = PromptTemplate(
    "You are a helpful HR assistant. Use only the following context to answer the question.\n"
    "Context: {context_str}\n"
    "Question: {query_str}\n"
    "Answer in bullet points:"
)

query_engine.update_prompts({"response_synthesizer:text_qa_template": custom_prompt})

🚀 Production‑Ready Enhancements

The above works for a prototype. For production, add these:

Hybrid search: Combine vector similarity with keyword (BM25) to catch proper names and exact phrases. Many vector DBs support this natively.
Re‑ranking: Retrieve 20 chunks, then use a cross‑encoder (e.g., cross‑encoder/ms‑marco‑MiniLM‑L‑6‑v2) to re‑rank them by relevance to the query. Keeps only the top 3–5.
Chat history: Instead of re‑embedding the whole conversation, store past turns and retrieve chunks based on the last user query + summarised history.
Observability: Log which chunks were retrieved and their similarity scores. This helps debug why an answer was wrong.
Caching: Cache embeddings for common queries to reduce API costs.

📊 Cost Breakdown (Using OpenAI)

For 10,000 documents (each 10 pages → ~500 chunks of 512 tokens):

Embedding (one‑time): 500 * 10,000 = 5M tokens. Cost with text‑embedding‑3‑small: ~$0.25 (yes, 25 cents).
Storage: Chroma (free self‑hosted) or Pinecone (free tier up to 100k vectors).
Per query: ~1,500 input tokens (question + retrieved chunks) + 300 output tokens. Cost ~$0.003 per query. At 10,000 queries/day = $30/day.

Tip: Use a smaller LLM (e.g., GPT‑3.5‑turbo or Llama 3 8B) for RAG – the retrieved context does most of the work, so you don’t need GPT‑4.

🧪 Try it without OpenAI (fully open source)
Replace embeddings with BAAI/bge‑base‑en (via llama-index-embeddings-huggingface) and LLM with llama-cpp-python running Llama 3 8B. The pipeline stays exactly the same. This makes your RAG system completely free and private.

🐞 Common Pitfalls and Solutions

Retrieved chunks are irrelevant: Increase chunk size (1024 tokens) or use a better embedding model (Voyage AI, Cohere). Also try hybrid search.
LLM still hallucinates: Strengthen your prompt with “ONLY use the context. If the answer is not in the context, say ‘I don’t know’.”
Slow retrieval: Use an approximate nearest neighbor index (HNSW) – all vector DBs support it. Or reduce the number of chunks per query.
Outdated documents: Implement a refresh pipeline that re‑embeds documents when they change.

RAG is the single most valuable pattern for enterprise AI. Master it, and you can build a company‑specific Q&A system in an afternoon. Try it with your own HR policies, product documentation, or codebase.