Searching PDF documents using LLMs with RAG

February 8, 2025 / langchain RAG embeddings openai faiss

One of the interesting things about embeddings¹ is how they can be used to search documents with results that reflect the meaning of your search rather than the actual search words and terms.

I am going to use Latent Space’s The 2025 AI Engineer Reading List and search through the PDFs in their sections about RAG and Agents to see what happens when I search for RAG related terms.

The technique explored below may have become redundant for smaller collections of files like the ones used below, thanks to models like Gemini 2.o Pro Experimental. It has a context window of 2 million tokens, which allows you to load much more data into the prompt before engaging with the model.

Using LangChain to Connect LLMs with Your Data

I’m using LangChain, an open source framework that makes life easier for people writing LLM applications. We start by looping through the documents and use PyPDFLoader and RecursiveCharacterTextSplitter

pdf_directory = "../../sample_data/The 2025 AI Engineer Reading List/"
pdf_files = [f for f in os.listdir(pdf_directory) if f.endswith('.pdf')]

documents = []
for pdf_file in pdf_files:
    file_path = os.path.join(pdf_directory, pdf_file)
    loader = PyPDFLoader(file_path)
    documents.extend(loader.load())

# Optionally split the documents into manageable chunks.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=200
)
docs = text_splitter.split_documents(documents)

The RecursiveCharacterTextSplitter creates a list of document objects, storing their meta-data (document name, content page number, etc.) along with the actual content. Note that this is still stored as normal text at this stage.

The chunk size parameter means how thin the slices are in our cake (and how many slices we’ll end up with). Overlap controls how much content from one slice is carried into the next, ensuring continuity in meaning. These need to be tweaked according to what kind of search we are implementing. Frankly, I’m clueless at this stage what is the best setting for my PDF search.

Create embeddings for the documents

We need our AI models to be able to work with the documents, so we create embeddings. This is where we start hitting the ChatGPT API. I would prefer to use a local LLM, but for now we need the best results possible to know that if there are problems, they’re my fault and not the model’s.

We use embeddings from OpenAI and store them in FAISS, a library developed by Meta for efficient similarity search. The score ranges from 0 to X, where lower scores signify a closer match in Euclidean distance.

embeddings = OpenAIEmbeddings()  # needs the OPENAI_API_KEY environment variable
vectorstore = FAISS.from_documents(docs, embeddings)

And if we peek into the vectorstore object and look at our first embedding, we see how the text is represented as a vector/array.

vectorstore.index.reconstruct(0)

array([-0.01694826, -0.00412454, -0.00136518, ..., -0.00767243,
        0.00234885, -0.04821463], dtype=float32)

Do the actual search

We use FAISS’s similarity_search_with_score which not only calculates similarity but also returns a similarity score.

query = "using llms to search documents"
retrieved_docs = vectorstore.similarity_search_with_score(
    query= query, 
    k=10   # how many docs to return
)

If we look at the filenames returned, we see that the same file had high scoring sections/pages and is in places 2-4.

[i[0].metadata['source'].split("/")[-1] for i in retrieved_docs]

['From Local to Global - A Graph RAG Approach to Query-Focused Summarization.pdf',
 'MemGPT - Towards LLMs as Operating Systems.pdf',
 'MemGPT - Towards LLMs as Operating Systems.pdf',
 'MemGPT - Towards LLMs as Operating Systems.pdf',
 'Building effective agents _ Anthropic.pdf',
 'REACT - SYNERGIZING REASONING AND ACTING IN LANGUAGE MODELS.pdf',
 'RAGAS - Automated Evaluation of Retrieval Augmented Generation.pdf',
 'MemGPT - Towards LLMs as Operating Systems.pdf',
 'From Local to Global - A Graph RAG Approach to Query-Focused Summarization.pdf',
 'MemGPT - Towards LLMs as Operating Systems.pdf']

Note that we are using FAISS, which defaults to Euclidean distance unless configured to use cosine similarity, which we used in previous experiments.

# 4. For each retrieved document, use an LLM to summarize why it might be relevant.
llm = OpenAI(temperature=0)  # low temperature for deterministic output

def summarize_relevance(doc, query):
    # Here we build a simple prompt to ask why the document is relevant to the query.
    prompt = (
        f"Given the following text from a document:\n\n"
        f"{doc.page_content}\n\n"
        f"And the search query:\n\n"
        f"{query}\n\n"
        f"Summarize why this text is relevant to the query."
    )
    summary = llm.invoke(prompt)
    return summary.strip()

result = []
for i, doc in enumerate(retrieved_docs[:4]):
    relevance_summary = summarize_relevance(doc[0], query)
    source_filename = doc[0].metadata['source'].split("/")[-1]
    excerpt = doc[0].page_content[:300] + "..."  # First 300 characters as excerpt
    result.append({
        "relevance_summary": relevance_summary,
        "source_filename":source_filename,
        "excerpt":excerpt,
        "score":round(float(doc[1]),2)
    })

Show the code

html_output = """
<table class="table plain" border="1" style="text-align:left; border-collapse: collapse; width: 100%; text-align: left;">
    <tr style="background-color: #f2f2f2;">
        <th>Relevance Summary</th>
        <th>Excerpt</th>
    </tr>
"""

for i in result:
    html_output += f"""
    <tr style="background:white;">
    <td style="text-align:left;--bs-table-bg-type:none; font-weight:bold;" colspan="2">{i["source_filename"]} (score {round(float(i["score"]),2)})</td>
    </tr>
    <tr style="background:white;">
        <td style="text-align:left; background-color:white;">{i["relevance_summary"]}</td>
        <td style="text-align:left; font-style:italic">"{i["excerpt"]}"</td>
    </tr>
    """

html_output += "</table>"
display(HTML(html_output))

Relevance Summary	Excerpt
From Local to Global - A Graph RAG Approach to Query-Focused Summarization.pdf (score 0.37)
The text discusses the use of LLMs (Language Model Metrics) for evaluating natural language generation and conventional RAG (Retrieval-Augmented Generation) systems. It mentions that LLMs have been shown to be effective in measuring the qualities of generated texts and evaluating the performance of RAG systems. This is relevant to the query as it discusses the use of LLMs in searching and evaluating documents.	"relationship extraction only, with entity types and few-shot examples tailored to the domain of the data. The graph indexing process used a context window size of 600 tokens with 1 gleaning for the Podcast dataset and 0 gleanings for the News dataset. 3.4 Metrics LLMs have been shown to be good eval..."
MemGPT - Towards LLMs as Operating Systems.pdf (score 0.39)
The text mentions using an LLM judge to evaluate whether an LLM correctly answered a question. This is relevant to the search query as it involves using LLMs to search documents. Additionally, the text discusses the format of the LLM response and the true answer, which could provide insight into how LLMs are used for document search.	"than from the model weights), we used an LLM judge. The LLM judge was provided the answers generated by both baseline approaches and MemGPT, and asked to make a judgement with the following prompt: Your task is to evaluate whether an LLM correct answered a question. The LLM response should be the fo..."
MemGPT - Towards LLMs as Operating Systems.pdf (score 0.4)
The text discusses the limitations of using LLMs (language model models) for conversational AI and their inability to handle long conversations or documents. It also mentions the challenges of extending the context length of transformers, which are used in LLMs, and the need for new long-context architectures. This is relevant to the search query as it highlights the use of LLMs in searching documents and the challenges associated with it.	"have become the cornerstone of conversational AI and have led to a wide array of consumer and enterprise applications. Despite these advances, the limited fixed-length context windows used by LLMs significantly hinders their appli- cability to long conversations or reasoning about long doc- uments. ..."
MemGPT - Towards LLMs as Operating Systems.pdf (score 0.4)
ANSWER: The text discusses the use of an LLM judge to check the correctness of answers generated by baseline approaches and MemGPT for document analysis tasks. DOCUMENT: The text is relevant to the query because it specifically mentions the use of LLMs in searching documents, which is the main focus of the query. It also provides information on how the LLM judge was used to ensure that the answer was derived from the provided text rather than from model weights.	"response, provide both the answer and the document text from which you determined the answer. Format your response with the format ’ANSWER: , DOCUMENT: [DOCUMENT TEXT]’. If none of the documents provided have the answer to the question, reply with ’INSUFFICIENT INFORMATION’. Do NOT prov..."

Final thoughts

At this point, I’m not entirely sure how good the search actually is. Some results seem relevant, but I don’t know how much of that is due to the tech itself versus my choice of parameters. The documents all cover a very similar topic, which makes it even harder to evaluate the quality of the results.

The next step would be to try out a larger set of documents, tweaking parameters like chunk size and overlap and trying cosine similarity rather than Euclidean distance.

Footnotes

AWS explains embeddings here and I play around with them here.↩︎