Introduction
Retrieval-Augmented Generation (RAG) is the most practical approach to giving LLMs access to your proprietary data without expensive fine-tuning. In this tutorial, we will build a production-grade RAG application using LangChain for orchestration and Pinecone as our vector database.
Our application will:
- Ingest documents from multiple sources (PDF, web, markdown)
- Chunk and embed them with best practices
- Store vectors in Pinecone for fast retrieval
- Generate accurate, grounded answers using Claude or GPT
Prerequisites
- Python 3.10+
- Pinecone account (free tier works)
- OpenAI or Anthropic API key
- Basic understanding of embeddings and vector databases
Step 1: Project Setup
mkdir rag-app && cd rag-app
python -m venv venv
source venv/bin/activate
pip install langchain langchain-community langchain-openai \
pinecone-client pypdf tiktoken python-dotenvCreate a .env file:
OPENAI_API_KEY=sk-...
PINECONE_API_KEY=pc-...
PINECONE_INDEX=rag-productionStep 2: Document Ingestion
from langchain_community.document_loaders import (
PyPDFLoader,
WebBaseLoader,
DirectoryLoader,
)
# Load PDFs
pdf_loader = DirectoryLoader("./docs/", glob="**/*.pdf", loader_cls=PyPDFLoader)
pdf_docs = pdf_loader.load()
# Load web pages
web_loader = WebBaseLoader([
"https://example.com/docs/page1",
"https://example.com/docs/page2",
])
web_docs = web_loader.load()
all_docs = pdf_docs + web_docs
print(f"Loaded {len(all_docs)} documents")Step 3: Smart Chunking
The chunking strategy is critical for RAG quality. Use semantic chunking for better results:
from langchain.text_splitter import RecursiveCharacterTextSplitter
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=1000,
chunk_overlap=200,
length_function=len,
separators=["\n\n", "\n", ". ", " ", ""],
)
chunks = text_splitter.split_documents(all_docs)
print(f"Created {len(chunks)} chunks")Chunking Best Practices
- Chunk size: 500-1500 tokens works best for most use cases
- Overlap: 10-20% overlap prevents context loss at boundaries
- Separators: Prioritize natural boundaries (paragraphs > sentences > words)
- Metadata: Preserve source information for citation
Step 4: Embedding and Storage
from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore
# Initialize embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")
# Initialize Pinecone
pc = Pinecone()
# Create index if it doesn't exist
if "rag-production" not in pc.list_indexes().names():
pc.create_index(
name="rag-production",
dimension=3072, # text-embedding-3-large dimension
metric="cosine",
spec=ServerlessSpec(cloud="aws", region="us-east-1"),
)
# Store vectors
vectorstore = PineconeVectorStore.from_documents(
documents=chunks,
embedding=embeddings,
index_name="rag-production",
)Step 5: Build the Retrieval Chain
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate
# Custom prompt for grounded answers
prompt_template = """Use the following context to answer the question.
If you cannot find the answer in the context, say "I don't have enough information to answer this."
Always cite which document the information comes from.
Context: {context}
Question: {question}
Answer:"""
PROMPT = PromptTemplate(
template=prompt_template,
input_variables=["context", "question"]
)
# Build the chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
retriever = vectorstore.as_retriever(
search_type="mmr", # Maximum Marginal Relevance for diversity
search_kwargs={"k": 5, "fetch_k": 20}
)
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=retriever,
chain_type_kwargs={"prompt": PROMPT},
return_source_documents=True,
)
# Query
result = qa_chain.invoke({"query": "What is the refund policy?"})
print(result["result"])
print("Sources:", [doc.metadata["source"] for doc in result["source_documents"]])Step 6: Advanced Retrieval Techniques
Hybrid Search (BM25 + Vector)
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever
# BM25 for keyword matching
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5
# Combine with vector retrieval
ensemble_retriever = EnsembleRetriever(
retrievers=[bm25_retriever, retriever],
weights=[0.4, 0.6]
)Reranking for Better Precision
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
reranker = CohereRerank(top_n=3)
compression_retriever = ContextualCompressionRetriever(
base_compressor=reranker,
base_retriever=ensemble_retriever,
)Troubleshooting
- Low quality answers: Increase chunk overlap, try smaller chunk sizes, or add reranking
- Slow retrieval: Use approximate nearest neighbor search, reduce the number of retrieved documents
- Hallucinations: Lower the temperature, use a stricter prompt, or add a verification step
- Missing information: Increase k in the retriever, check that the source data is properly chunked
Conclusion
You have built a production-grade RAG application with proper chunking, hybrid retrieval, and reranking. This architecture scales well and can be extended with features like streaming responses, caching, and user feedback loops.
Key Takeaways
- Chunking strategy has the biggest impact on RAG quality
- Hybrid search (BM25 + vector) outperforms pure vector search
- Reranking significantly improves precision
- Always include source citations for trustworthiness