Searching PDF documents using LLMs with RAG

One of the interesting things about embeddings1 is how they can be used to search documents with results that reflect the meaning of your search rather than the actual search words and terms.

I am going to use Latent Space’s The 2025 AI Engineer Reading List and search through the PDFs in their sections about RAG and Agents to see what happens when I search for RAG related terms.

The technique explored below may have become redundant for smaller collections of files like the ones used below, thanks to models like Gemini 2.o Pro Experimental. It has a context window of 2 million tokens, which allows you to load much more data into the prompt before engaging with the model.

Using LangChain to Connect LLMs with Your Data

I’m using LangChain, an open source framework that makes life easier for people writing LLM applications. We start by looping through the documents and use PyPDFLoader and RecursiveCharacterTextSplitter

pdf_directory = "../../sample_data/The 2025 AI Engineer Reading List/"
pdf_files = [f for f in os.listdir(pdf_directory) if f.endswith('.pdf')]

documents = []
for pdf_file in pdf_files:
    file_path = os.path.join(pdf_directory, pdf_file)
    loader = PyPDFLoader(file_path)
    documents.extend(loader.load())

# Optionally split the documents into manageable chunks.
text_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000, 
    chunk_overlap=200
)
docs = text_splitter.split_documents(documents)

The RecursiveCharacterTextSplitter creates a list of document objects, storing their meta-data (document name, content page number, etc.) along with the actual content. Note that this is still stored as normal text at this stage.

The chunk size parameter means how thin the slices are in our cake (and how many slices we’ll end up with). Overlap controls how much content from one slice is carried into the next, ensuring continuity in meaning. These need to be tweaked according to what kind of search we are implementing. Frankly, I’m clueless at this stage what is the best setting for my PDF search.

Create embeddings for the documents

We need our AI models to be able to work with the documents, so we create embeddings. This is where we start hitting the ChatGPT API. I would prefer to use a local LLM, but for now we need the best results possible to know that if there are problems, they’re my fault and not the model’s.

We use embeddings from OpenAI and store them in FAISS, a library developed by Meta for efficient similarity search. The score ranges from 0 to X, where lower scores signify a closer match in Euclidean distance.

embeddings = OpenAIEmbeddings()  # needs the OPENAI_API_KEY environment variable
vectorstore = FAISS.from_documents(docs, embeddings)

And if we peek into the vectorstore object and look at our first embedding, we see how the text is represented as a vector/array.

vectorstore.index.reconstruct(0)
array([-0.01694826, -0.00412454, -0.00136518, ..., -0.00767243,
        0.00234885, -0.04821463], dtype=float32)

Final thoughts

At this point, I’m not entirely sure how good the search actually is. Some results seem relevant, but I don’t know how much of that is due to the tech itself versus my choice of parameters. The documents all cover a very similar topic, which makes it even harder to evaluate the quality of the results.

The next step would be to try out a larger set of documents, tweaking parameters like chunk size and overlap and trying cosine similarity rather than Euclidean distance.

Footnotes

  1. AWS explains embeddings here and I play around with them here.↩︎