Building a Production RAG Application with LangChain and Pinecone

Introduction

Retrieval-Augmented Generation (RAG) is the most practical approach to giving LLMs access to your proprietary data without expensive fine-tuning. In this tutorial, we will build a production-grade RAG application using LangChain for orchestration and Pinecone as our vector database.

Our application will:

Ingest documents from multiple sources (PDF, web, markdown)
Chunk and embed them with best practices
Store vectors in Pinecone for fast retrieval
Generate accurate, grounded answers using Claude or GPT

Prerequisites

Python 3.10+
Pinecone account (free tier works)
OpenAI or Anthropic API key
Basic understanding of embeddings and vector databases

Step 1: Project Setup

bash

mkdir rag-app && cd rag-app
python -m venv venv
source venv/bin/activate

pip install langchain langchain-community langchain-openai \
  pinecone-client pypdf tiktoken python-dotenv

Create a .env file:

code

OPENAI_API_KEY=sk-...
PINECONE_API_KEY=pc-...
PINECONE_INDEX=rag-production

Step 2: Document Ingestion

python

from langchain_community.document_loaders import (
    PyPDFLoader,
    WebBaseLoader,
    DirectoryLoader,
)

# Load PDFs
pdf_loader = DirectoryLoader("./docs/", glob="**/*.pdf", loader_cls=PyPDFLoader)
pdf_docs = pdf_loader.load()

# Load web pages
web_loader = WebBaseLoader([
    "https://example.com/docs/page1",
    "https://example.com/docs/page2",
])
web_docs = web_loader.load()

all_docs = pdf_docs + web_docs
print(f"Loaded {len(all_docs)} documents")

Step 3: Smart Chunking

The chunking strategy is critical for RAG quality. Use semantic chunking for better results:

python

from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunks = text_splitter.split_documents(all_docs)
print(f"Created {len(chunks)} chunks")

Chunking Best Practices

Chunk size: 500-1500 tokens works best for most use cases
Overlap: 10-20% overlap prevents context loss at boundaries
Separators: Prioritize natural boundaries (paragraphs > sentences > words)
Metadata: Preserve source information for citation

Step 4: Embedding and Storage

python

from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore

# Initialize embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# Initialize Pinecone
pc = Pinecone()

# Create index if it doesn't exist
if "rag-production" not in pc.list_indexes().names():
    pc.create_index(
        name="rag-production",
        dimension=3072,  # text-embedding-3-large dimension
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

# Store vectors
vectorstore = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name="rag-production",
)

Step 5: Build the Retrieval Chain

python

from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Custom prompt for grounded answers
prompt_template = """Use the following context to answer the question.
If you cannot find the answer in the context, say "I don't have enough information to answer this."
Always cite which document the information comes from.

Context: {context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

# Build the chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
retriever = vectorstore.as_retriever(
    search_type="mmr",  # Maximum Marginal Relevance for diversity
    search_kwargs={"k": 5, "fetch_k": 20}
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True,
)

# Query
result = qa_chain.invoke({"query": "What is the refund policy?"})
print(result["result"])
print("Sources:", [doc.metadata["source"] for doc in result["source_documents"]])

Step 6: Advanced Retrieval Techniques

Hybrid Search (BM25 + Vector)

python

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# BM25 for keyword matching
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Combine with vector retrieval
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, retriever],
    weights=[0.4, 0.6]
)

Reranking for Better Precision

python

from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

reranker = CohereRerank(top_n=3)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=ensemble_retriever,
)

Troubleshooting

Low quality answers: Increase chunk overlap, try smaller chunk sizes, or add reranking
Slow retrieval: Use approximate nearest neighbor search, reduce the number of retrieved documents
Hallucinations: Lower the temperature, use a stricter prompt, or add a verification step
Missing information: Increase k in the retriever, check that the source data is properly chunked

Conclusion

You have built a production-grade RAG application with proper chunking, hybrid retrieval, and reranking. This architecture scales well and can be extended with features like streaming responses, caching, and user feedback loops.

Key Takeaways

Chunking strategy has the biggest impact on RAG quality
Hybrid search (BM25 + vector) outperforms pure vector search
Reranking significantly improves precision
Always include source citations for trustworthiness

Building a Production RAG Application with LangChain and Pinecone

Introduction

Prerequisites

Step 1: Project Setup

Step 2: Document Ingestion

Step 3: Smart Chunking

Chunking Best Practices

Step 4: Embedding and Storage

Step 5: Build the Retrieval Chain

Step 6: Advanced Retrieval Techniques

Hybrid Search (BM25 + Vector)

Reranking for Better Precision

Troubleshooting

Conclusion

Key Takeaways

Related Articles

Model Context Protocol (MCP): The New Standard for AI Tool Integration

Fine-Tuning LLMs in 2026: Best Practices and Common Pitfalls

Getting Started with AI Coding Assistants: A Beginner's Guide

Stay Ahead in AI