End-to-end Implementation of a RAG Pipeline using LangChain v0.3

13 min readOct 29, 2024

In the tech world, it's clear that Generative AI is the current hot topic. Staying ahead in this fiercely competitive job market means diving into GenAI and exploring its practical applications. For those aiming to keep their edge, getting acquainted with a handful of GenAI use cases is becoming increasingly essential.

Find the GitHub repo here. Connect with me on LinkedIn.

Retrieval-Augmented Generation or RAG is an AI framework that combines information retrieval systems with generative language models (LLMs). I won’t spend much time on the introduction to RAG. Instead, I’ll explain the necessary concepts step-by-step as we progress and provide a detailed walkthrough of setting up the end-to-end RAG pipeline using Llama3.1–8b-instant and HuggingFace embeddings.

Here’s quick demo of the app —

Note: Even though this is a beginner friendly project, I strongly advise the readers to familiarize themselves with the LangChain v0.3 documentation as the framework is being frequently updated and often introduces breaking changes.

Problem Statement

The idea is that we’ll take a few PDFs regarding Indian Law & Order and try to build a conversational web app where users will input their query and our application will generate a response ONLY based on the contents in those PDFs.

Why do we need RAG?

Before we dive into how RAG works, we have to ask ourselves why we need RAG in the first place. The main problem with Large Language Models (LLMs) is that these models are essentially probabilistic models and what they basically do is predict one word at a time based on probability. These models use an autoregressive approach, where they take the previous words as context and then predict the most likely next word. Also, these ‘next’ words come from the huge corpus on which these models are trained. So, when you’re trying to solve a very specific problem that is out of the scope of its training data, these models start to hallucinate.

Now, it’s easy to deal with this issue. We just have to provide the LLM with the context required to solve our specific problem. It could be anything from a bunch of PDFs (e.g., return policies for an e-commerce platform if you intend to build a RAG-based Chatbot or AI Agent) to websites (i.e., scraping is required; beautifulsoup4 comes in handy in such cases) or all sorts of parsable files. But there’s an issue with the context window.

So, what is a context window? A context window refers to the amount of text data a language model can consider at one time when generating responses. It includes all the tokens from the input text that the model looks at to gather context before replying. The size of this window directly influences the model’s understanding and response accuracy.

source: https://s10251.pcdn.co/pdf/2023-Alan-D-Thompson-2023-Context-Windows-Rev-0.pdf

That’s why attaching a bunch of PDFs or DOCXs to generate coherent responses is neither sensible nor scalable. This is where RAG comes into the picture.

How does RAG work?

RAG simply solves the issue of context window when we’re trying to solve a domain-specific problem as we previously discussed. Given that the ‘G’ in RAG stands for ‘Generative,’ which involves a large language model (LLM), RAG empowers us to create a dynamic conversational experience based on the context we provide.

At the core, it has simple two components —

Retriever: What it basically does is retrieve the document relevant to the user query and inject it as context to the generator (LLM).
Generator: After getting the context from the retriever, using NLP, it’s able to parse both the context and user query, then it generates a viable response — enabling a conversational experience.

This project is broadly separated into two parts — the data ingestion part and the conversational retrieval part. But before we jump straight into coding, we have to understand what LangChain is.

What is LangChain?

LangChain is a framework that simplifies building applications with LLMs by providing modular components and abstractions. For RAG (Retrieval Augmented Generation) pipelines specifically, it offers pre-built components for document loading, text chunking, vector storage, and retrieval — allowing us, the developers, to quickly set up a working system without having to implement these pieces from scratch.

Data Ingestion

The flow is simple — we’ll take a bunch of PDFs, put ’em all in a directory, load ’em in our script, chunk ‘em in pieces, create embeddings, and lastly, store the documents and their embeddings in a vector store e.g. Chroma/FAISS so that we’ll be able to perform some similarity search in the future.

## functional dependencies
import time
## settings up the env
import os
from dotenv import load_dotenv
load_dotenv()

## langchain dependencies
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

LangChain’s PyPDFLoader uses the famous pypdflibrary underneath to effectively load PDF documents into structured formats, leveraging the capabilities of pypdffor extracting text and metadata from PDF files.

LangChain’s RecursiveCharacterTextSplitter is used to break down large bodies of text into smaller, manageable chunks based on specified characters. A list of characters or strings that the splitter uses to divide the text. By default, this includes [“\n\n”, “\n”, “ “, “”], which allows it to maintain logical groupings like paragraphs and sentences while splitting. The splitter attempts to divide the text starting with the first character in the separators list (e.g., \n\n). If this split results in chunks larger than the specified chunk_size, it moves to the next character in the list (e.g., \n), and so on. For more detail click here.

LangChain’s HuggingFaceEmbeddings is the official HuggingFace wrapper that centralizes the usage of different embedding models available on the HuggingFace platform. Wondering what embeddings are? Read this detailed article on embeddings.

Note: If you’re not already familiar with basic concepts like embedding and text splitting, I strongly suggest taking the time to learn these foundational ideas before diving into building a RAG pipeline.

Lastly, we’ll use the Chroma Vector Database for our project. We’ll use Chromadb’s official LangChain wrapper. Read the docs for more info.

## setting up directories
current_dir_path = os.path.dirname(os.path.abspath(__file__)) ## <- extracting the directory name from the absolute path of this file
data_path = os.path.join(current_dir_path, "data") ## creating a path for the `data` folder
persistent_directory = os.path.join(current_dir_path, "data-ingestion-local") ## creating a directory to save the vector store locally

Here, we’ve set up the path to the data directory where the PDFs are stored and persistent_directory called data-ingestion-local where we’ll save the vector store locally.

In the next few steps, we’ll ingest the data but before that, we have to make sure that the directory data exists otherwise the code will break.

## checking if the folder that contains the required PDFs exists
if not os.path.exists(data_path):
    raise FileNotFoundError(
        f"[ALERT] {data_path} doesn't exist. ⚠️⚠️"
    )

Now, we’ll list all the file (PDFs) names in a list using Python’s list comprehension method —

## list of all the PDFs
pdfs = [pdf for pdf in os.listdir(data_path) if pdf.endswith(".pdf")] ## <- making a list of all file names as str

Since there are more than one PDF file in the data directory, we can either directly take advantage of the LangChain’s DirectoryLoader (more info here) or use a simple for loop with PyPDFLoaderto load the documents.

doc_container = [] ## <- list of chunked documents aka container

## taking each each item from `pdfs` and loading it using PyPDFLoader
for pdf in pdfs:
    loader = PyPDFLoader(file_path=os.path.join("data", pdf),
                            extract_images=False)
    docsRaw = loader.load() ## <- returns a list of `Document` objects. Each such object has - 1. Page Content // 2. Metadata
    for doc in docsRaw:
        doc_container.append(doc) ## <- appending each `Document` object to the previously declared container (list)

Note: .load() method in LangChain’s PyPDFLoader is designed to load data into Document objects. This method processes the specified PDF file and returns a list of Documentinstances, each representing a page or a chunk of the document.

Now, we’ll split the documents into chunks —

## splitting the document into chunks
splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
docs_split = splitter.split_documents(documents=doc_container)

## displaying information about the split documents
print(f"[INFO] Number of document chunks: {len(docs_split)}", end="\n\n")

Lastly, we’ll create the embeddings for our documents and save ’em in a vector store.

For embeddings, we’ll use all-MiniLM-L6-v2 which you can learn more about here. We chose this particular embedding model because it’s part of the SentenceTransformers framework and maps sentences and paragraphs to a 384-dimensional dense vector space, making it suitable for applications like semantic search, clustering, and sentence similarity assessments. This model is approximately 5x faster than its larger counterparts while still maintaining good quality in embeddings, enabling real-time applications.

## embedding and vector store
embedF = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2") ## <- open-source embedding model from HuggingFace - taking the default model only
print("[INFO] Started embedding", end="\n")
start = time.time() ## <- noting the starting time

"""
creating the embeddings for the documents and
then storing to a vector database
"""
vectorDB = Chroma.from_documents(documents=docs_split,
                                 embedding=embedF,
                                 persist_directory=persistent_directory)

end = time.time() ## <- noting the end time
print("[INFO] Finished embedding", end="\n")
print(f"[ADD. INFO] Time taken: {end - start}")

Now that we’ve successfully saved the documents and embeddings in a vector store locally, we’re all set to build the main web application but before we hop off, we have to consider that there’s no need to create the embeddings and store it again if it’s already there in the storage. We can make sure of it with a simple if-else statement wrapping the entire logic. Here’s the entire code for data-ingestion-

import time
## settings up the env
import os
from dotenv import load_dotenv
load_dotenv()

## langchain dependencies
from langchain_community.document_loaders import PyPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_chroma import Chroma

## setting up directories
current_dir_path = os.path.dirname(os.path.abspath(__file__)) ## <- extracting the directory name from the absolute path of this file
data_path = os.path.join(current_dir_path, "data") ## creating a path for the `data` folder
persistent_directory = os.path.join(current_dir_path, "data-ingestion-local") ## creating a directory to save the vector store locally

## checking if the directory already exists
if not os.path.exists(persistent_directory):
    print("[INFO] Initiating the build of Vector Database .. 📌📌", end="\n\n")

    ## checking if the folder that contains the required PDFs exists
    if not os.path.exists(data_path):
        raise FileNotFoundError(
            f"[ALERT] {data_path} doesn't exist. ⚠️⚠️"
        )

    ## list of all the PDFs
    pdfs = [pdf for pdf in os.listdir(data_path) if pdf.endswith(".pdf")] ## <- making a list of all file names as str

    doc_container = [] ## <- list of chunked documents aka container
    
    ## taking each each item from `pdfs` and loading it using PyPDFLoader
    for pdf in pdfs:
        loader = PyPDFLoader(file_path=os.path.join("data", pdf),
                             extract_images=False)
        docsRaw = loader.load() ## <- returns a list of `Document` objects. Each such object has - 1. Page Content // 2. Metadata
        for doc in docsRaw:
            doc_container.append(doc) ## <- appending each `Document` object to the previously declared container (list)

    ## splitting the document into chunks
    splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=50)
    docs_split = splitter.split_documents(documents=doc_container)

    ## displaying information about the split documents
    print("\n--- Document Chunks Information ---", end="\n")
    print(f"Number of document chunks: {len(docs_split)}", end="\n\n")

    ## embedding and vector store
    embedF = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2") ## <- open-source embedding model from HuggingFace
    print("[INFO] Started embedding", end="\n")
    start = time.time() ## <- noting the starting time

    """
    creating the embeddings for the documents and
    then storing to a vector database
    """
    vectorDB = Chroma.from_documents(documents=docs_split,
                                     embedding=embedF,
                                     persist_directory=persistent_directory)
    
    end = time.time() ## <- noting the end time
    print("[INFO] Finished embedding", end="\n")
    print(f"[ADD. INFO] Time taken: {end - start}")

else:
    print("[ALERT] Vector Database already exist. ️⚠️")

And the output looks something like this —

Main Web App

The logic of how the application will function is as follows —

Set up the initial UI such as titles and stuff.
Initialize an empty list to store conversation turns.
Users will enter their queries.
Create a prompt that uses the user query and chat history to rephrase the query contextually with an LLM, ensuring precise and relevant responses.
Using this prompt and chat model, we’ll instantiate a retriever object of LangChain’s create_history_aware_retriever.
Formulate a QA prompt to pass onto an LLM which will be responsible for generating a contexually aware response to the user query.
Instantiate an LCEL Runnable object of create_stuff_documents_chain that will be responsible for facilitating the integration of multiple documents into a single prompt for processing by an LLM.
Finally, instantiate another LCEL Runnable object of create_retrieval_chain that will create a chain that retrieves documents based on a query and then processes these documents to generate an answer.
Set up the UI for user interaction.

Note: create_history_aware_retriever returns a retriever schema object. It’s designed to create a retriever that utilizes historical context to enhance document retrieval capabilities.

Setting Up

## functional dependencies
import time
import streamlit as st

## initializing the UI
st.set_page_config(page_title="RAG-Based Legal Assistant")
col1, col2, col3 = st.columns([1, 25, 1])
with col2:
    st.title("RAG-Based Legal Assistant")

## setting up env
import os
from dotenv import load_dotenv
from numpy.core.defchararray import endswith
load_dotenv()

## LangChain dependencies
from langchain_huggingface import HuggingFaceEmbeddings
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage
from langchain_groq import ChatGroq
from langchain_chroma import Chroma
from langchain.prompts import ChatPromptTemplate, MessagesPlaceholder
## LCEL implementation of LangChain ConversationalRetrievalChain
from langchain.chains import create_history_aware_retriever, create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain

Here’s a quick explanation — you might be wondering why I set up the page instead of importing the dependencies first. I set up the page first because importing LangChain dependencies is slower than traditional Python libraries. By displaying the title first, users aren’t left staring at a blank screen while the dependencies load.

Now, we’ll set up paths to data and data-ingestion-local directories and initiate the LLM.

## setting up file paths
current_dir = os.path.dirname(os.path.abspath(__file__))
data_path = os.path.join(current_dir, "data")
persistent_directory = os.path.join(current_dir, "data-ingestion-local")

## setting-up the LLM
chatmodel = ChatGroq(model="llama-3.1-8b-instant", temperature=0.15)

Then, we’ll initiate a treamlit.session_state key.session_state in Streamlit allows us to store and manage variables across different interactions within a single user session. It helps maintain state, such as user inputs or intermediate results, without resetting on each rerun. This ensures a smoother and more interactive user experience. We’ll use streamlit.session_state to store the conversation turns.

## setting up -> streamlit session state
if "messages" not in st.session_state:
    st.session_state["messages"] = []

# resetting the entire conversation
def reset_conversation():
    st.session_state['messages'] = []

Now, we’ll load the vector store that we previously saved into our script. In that case, we’ll have to use the exact same embedding model as we previously used in the data ingestion process.

## open-source embedding model from HuggingFace - taking the default model only
embedF = HuggingFaceEmbeddings(model_name = "all-MiniLM-L6-v2")

## loading the vector database from local
vectorDB = Chroma(embedding_function=embedF, persist_directory=persistent_directory)

## setting up the retriever
kb_retriever = vectorDB.as_retriever(search_type="similarity",search_kwargs={"k": 3})

Note: It’s important that you use the same embedding model in both the cases because you DON’T want the user query embeddings to be different from the document embeddings, do you?

We’ll set up the rephrasing prompt and instantiate the history-aware retriever.

## initiating the history_aware_retriever
rephrasing_template = (
    """
        TASK: Convert context-dependent questions into standalone queries.

        INPUT: 
        - chat_history: Previous messages
        - question: Current user query

        RULES:
        1. Replace pronouns (it/they/this) with specific referents
        2. Expand contextual phrases ("the above", "previous")
        3. Return original if already standalone
        4. NEVER answer or explain - only reformulate

        OUTPUT: Single reformulated question, preserving original intent and style.

        Example:
        History: "Let's discuss Python."
        Question: "How do I use it?"
        Returns: "How do I use Python?"
    """
)

rephrasing_prompt = ChatPromptTemplate.from_messages(
    [
        ("system", rephrasing_template),
        MessagesPlaceholder("chat_history"),
        ("human", "{input}"),
    ]
)

history_aware_retriever = create_history_aware_retriever(
    llm = chatmodel,
    retriever = kb_retriever,
    prompt = rephrasing_prompt
)

Finally, we now have to set up the chat interface. So, first, we’ll have to print the chat history (if any) and take the input from the user —

## printing all (if any) messages in the session_session `message` key
for message in st.session_state.messages:
    with st.chat_message(message.type):
        st.write(message.content)

user_query = st.chat_input("Ask me anything ..")

.type attribute of LangChain's HumanMessage or AIMesssage returns the role of the message e.g. user or system. Streamlit’s chat_message inserts a chat message container.

Now, when a user submits a query, the retrieval chain gets invoked, fetching and saving the result. But instead of just displaying it with st.write, which is quite mundane, we’ll make the response engaging and dynamic.

Imagine the result appearing on the screen, letter-by-letter, just like in ChatGPT or other GenAI interfaces. Such a way enhances the interaction, making it more dynamic and engaging.

if user_query:
    with st.chat_message("user"):
        st.write(user_query)

    with st.chat_message("assistant"):
        with st.status("Generating 💡...", expanded=True):
            ## invoking the chain to fetch the result
            result = coversational_rag_chain.invoke({"input": user_query, "chat_history": st.session_state['messages']})

            message_placeholder = st.empty()

            full_response = (
                "⚠️ **_This information is not intended as a substitute for legal advice. "
                "We recommend consulting with an attorney for a more comprehensive and"
                " tailored response._** \n\n\n"
            )
        
        ## displaying the output on the dashboard
        for chunk in result["answer"]:
            full_response += chunk
            time.sleep(0.02) ## <- simulate the output feeling of ChatGPT

            message_placeholder.markdown(full_response + " ▌")
        st.button('Reset Conversation 🗑️', on_click=reset_conversation)
    ## appending conversation turns
    st.session_state.messages.extend(
        [
            HumanMessage(content=user_query),
            AIMessage(content=result['answer'])
        ]
    )

Note: Make sure you read the documentation for st.statusand st.empty.

Finally, a button was added to clear the current session's conversation, and the latest conversation was appended back to st.session_state.messages, which is declared as a list.

Thanks for taking the time to read! Should you notice any typos or conceptual mistakes, feel free to point them out in the comments. I truly value your honest feedback. Cheers!

Don’t forget to connect with me on LinkedIn.