TutorialAdvanced

Building a Production RAG Application with LangChain and Pinecone

Complete guide to building a Retrieval-Augmented Generation app with proper chunking, embedding, and retrieval strategies.

AIcloud2026-02-0418 min read

Introduction

Retrieval-Augmented Generation (RAG) is the most practical approach to giving LLMs access to your proprietary data without expensive fine-tuning. In this tutorial, we will build a production-grade RAG application using LangChain for orchestration and Pinecone as our vector database.

Our application will:

  • Ingest documents from multiple sources (PDF, web, markdown)
  • Chunk and embed them with best practices
  • Store vectors in Pinecone for fast retrieval
  • Generate accurate, grounded answers using Claude or GPT

Prerequisites

  • Python 3.10+
  • Pinecone account (free tier works)
  • OpenAI or Anthropic API key
  • Basic understanding of embeddings and vector databases

Step 1: Project Setup

bash
mkdir rag-app && cd rag-app
python -m venv venv
source venv/bin/activate

pip install langchain langchain-community langchain-openai \
  pinecone-client pypdf tiktoken python-dotenv

Create a .env file:

code
OPENAI_API_KEY=sk-...
PINECONE_API_KEY=pc-...
PINECONE_INDEX=rag-production

Step 2: Document Ingestion

python
from langchain_community.document_loaders import (
    PyPDFLoader,
    WebBaseLoader,
    DirectoryLoader,
)

# Load PDFs
pdf_loader = DirectoryLoader("./docs/", glob="**/*.pdf", loader_cls=PyPDFLoader)
pdf_docs = pdf_loader.load()

# Load web pages
web_loader = WebBaseLoader([
    "https://example.com/docs/page1",
    "https://example.com/docs/page2",
])
web_docs = web_loader.load()

all_docs = pdf_docs + web_docs
print(f"Loaded {len(all_docs)} documents")

Step 3: Smart Chunking

The chunking strategy is critical for RAG quality. Use semantic chunking for better results:

python
from langchain.text_splitter import RecursiveCharacterTextSplitter

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=200,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""],
)

chunks = text_splitter.split_documents(all_docs)
print(f"Created {len(chunks)} chunks")

Chunking Best Practices

  • Chunk size: 500-1500 tokens works best for most use cases
  • Overlap: 10-20% overlap prevents context loss at boundaries
  • Separators: Prioritize natural boundaries (paragraphs > sentences > words)
  • Metadata: Preserve source information for citation

Step 4: Embedding and Storage

python
from langchain_openai import OpenAIEmbeddings
from pinecone import Pinecone, ServerlessSpec
from langchain_pinecone import PineconeVectorStore

# Initialize embeddings
embeddings = OpenAIEmbeddings(model="text-embedding-3-large")

# Initialize Pinecone
pc = Pinecone()

# Create index if it doesn't exist
if "rag-production" not in pc.list_indexes().names():
    pc.create_index(
        name="rag-production",
        dimension=3072,  # text-embedding-3-large dimension
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1"),
    )

# Store vectors
vectorstore = PineconeVectorStore.from_documents(
    documents=chunks,
    embedding=embeddings,
    index_name="rag-production",
)

Step 5: Build the Retrieval Chain

python
from langchain_openai import ChatOpenAI
from langchain.chains import RetrievalQA
from langchain.prompts import PromptTemplate

# Custom prompt for grounded answers
prompt_template = """Use the following context to answer the question.
If you cannot find the answer in the context, say "I don't have enough information to answer this."
Always cite which document the information comes from.

Context: {context}

Question: {question}

Answer:"""

PROMPT = PromptTemplate(
    template=prompt_template,
    input_variables=["context", "question"]
)

# Build the chain
llm = ChatOpenAI(model="gpt-4o", temperature=0)
retriever = vectorstore.as_retriever(
    search_type="mmr",  # Maximum Marginal Relevance for diversity
    search_kwargs={"k": 5, "fetch_k": 20}
)

qa_chain = RetrievalQA.from_chain_type(
    llm=llm,
    chain_type="stuff",
    retriever=retriever,
    chain_type_kwargs={"prompt": PROMPT},
    return_source_documents=True,
)

# Query
result = qa_chain.invoke({"query": "What is the refund policy?"})
print(result["result"])
print("Sources:", [doc.metadata["source"] for doc in result["source_documents"]])

Step 6: Advanced Retrieval Techniques

Hybrid Search (BM25 + Vector)

python
from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

# BM25 for keyword matching
bm25_retriever = BM25Retriever.from_documents(chunks)
bm25_retriever.k = 5

# Combine with vector retrieval
ensemble_retriever = EnsembleRetriever(
    retrievers=[bm25_retriever, retriever],
    weights=[0.4, 0.6]
)

Reranking for Better Precision

python
from langchain.retrievers import ContextualCompressionRetriever
from langchain_cohere import CohereRerank

reranker = CohereRerank(top_n=3)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=reranker,
    base_retriever=ensemble_retriever,
)

Troubleshooting

  • Low quality answers: Increase chunk overlap, try smaller chunk sizes, or add reranking
  • Slow retrieval: Use approximate nearest neighbor search, reduce the number of retrieved documents
  • Hallucinations: Lower the temperature, use a stricter prompt, or add a verification step
  • Missing information: Increase k in the retriever, check that the source data is properly chunked

Conclusion

You have built a production-grade RAG application with proper chunking, hybrid retrieval, and reranking. This architecture scales well and can be extended with features like streaming responses, caching, and user feedback loops.

Key Takeaways

  • Chunking strategy has the biggest impact on RAG quality
  • Hybrid search (BM25 + vector) outperforms pure vector search
  • Reranking significantly improves precision
  • Always include source citations for trustworthiness
RAGLangChainPineconeVector Database

Related Articles

Stay Ahead in AI

Get the latest AI tutorials, tools, and news delivered to your inbox every week.

Join 12,000+ AI developers