I have a list of sentences and I have a search query. I want to find the most similar sentence in the list to the query sentence.
The query is: “A well dressed man became a Member of Parliament.”
The list of sentences to search is, in order of similarity:
“Vel klæddur maður varð þingmaður.” - the query sentence in Icelandic.
“A well dressed man ran for office.” - a similar sentence, dude becomes a politician.
“The woman in the dress became a politician.” - a slightly less similar sentence, a well dressed woman becomes a politician.
“A well dressed man became a member of a football club.” - a similar sentence with a different semantic meaning.
One common approach to this problem is to use a vector embedding model to represent each sentence as a vector, then use cosine similarity to compare them. The closer the vectors are, the more similar the sentences are. I covered this in a previous post: Comparing AI embedding models.
In this post however, I am going to compare this similarity search with simply sending the query sentence along with the other sentences directly to the AI model and ask it to find the most similar sentence.
Comparing the sentences using vector embeddings
I start by creating a function called find_most_similar_sentence (thanks Claude!). It takes a target sentence and a list of other sentences, and returns their similarity scores.
Code
def get_embedding(text, model="text-embedding-3-small"):"""Get the embedding for a text using OpenAI's API""" response = client.embeddings.create( model=model,input=text, dimensions=1536# Dimensionality of the embedding (default for text-embedding-3-small) )return response.data[0].embeddingdef find_most_similar_sentence(target_sentence, sentence_list):"""Find the sentence in sentence_list most similar to target_sentence"""# Get embeddings for all sentences target_embedding = get_embedding(target_sentence) sentence_embeddings = [get_embedding(sentence) for sentence in sentence_list]# Calculate cosine similarity between target and each sentence similarities = []for embedding in sentence_embeddings:# Reshape for sklearn's cosine_similarity function target_embedding_reshaped = np.array(target_embedding).reshape(1, -1) embedding_reshaped = np.array(embedding).reshape(1, -1) similarity = cosine_similarity(target_embedding_reshaped, embedding_reshaped)[0][0] similarities.append(similarity)# Find the index of the most similar sentence most_similar_index = np.argmax(similarities)return {"most_similar_sentence": sentence_list[most_similar_index],"similarity_score": similarities[most_similar_index],"all_similarities": list(zip(sentence_list, similarities)) }
sentences = ["A well dressed man ran for office.","The woman in the dress became a politician.","A well dressed man became a member of a football club.","Vel klæddur maður varð þingmaður.", # the query sentence, but in Icelandic]query_sentence ="A well dressed man became a Member of Parliament."result = find_most_similar_sentence(query_sentence, sentences)
If we put the results in a table and highlight closer matches in a stronger green, we can see the similarity scores more clearly:
Code
# Create DataFramedf = pd.DataFrame(result['all_similarities'], columns=["Sentence", "Similarity Score"])# Build a nice tabletable = ( gt.GT(df) .tab_header( title="Similarity Scores Using Vector Cosine Similarity", subtitle=f"Most similar sentence: to '{query_sentence}'" ) .fmt_number(columns="Similarity Score", decimals=3) .data_color( columns="Similarity Score", palette="Greens", reverse=False ))table
Similarity Scores Using Vector Cosine Similarity
Most similar sentence: to 'A well dressed man became a Member of Parliament.'
Sentence
Similarity Score
A well dressed man ran for office.
0.677
The woman in the dress became a politician.
0.537
A well dressed man became a member of a football club.
0.611
Vel klæddur maður varð þingmaður.
0.242
The most similar sentence was correctly found, a well dressed man running for office, but the dude becoming a member of a football club gets a higher score than the woman becoming a politician.
The Icelandic sentence is at the bottom.
I might have gotten better results if I had used one of the multi-lingual embedding models. I’m not sure whether these are trained on Icelandic, and the research I did indicated that OpenAI’s default models are pretty good at multilingual tasks. Try it for yourself and let me know.
Comparing the sentences using the AI model
Next, I ask the AI model directly which sentences were most similar, using this prompt:
I will send you a list of sentences and you are to give me the list back along with a score of how closely they match my query sentence semantically, i.e. do they mean the same thing.
Code
prompt =f"""I will send you a list of sentences and you are to give me the list back along with a score of how closely they match my query sentence semantically, i.e. do they mean the same thing. The query setneces is {query_sentence} and the list of sentences to compare it to are:{sentences}Return your result as a json object where- "sentence" is the sentence form the list- "score" is the score of how closely it matches the query sentence """response = client.chat.completions.create( model="gpt-4o", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": prompt}, ], response_format={ "type": "json_object" } # <- This ensures valid JSON is returned)data = json.loads(response.choices[0].message.content)# Create DataFrame from the responsedf2 = pd.DataFrame(data['results'])table2 = ( gt.GT(df2) .tab_header( title="Similarity Scores Using Direct Model Response", subtitle=f"Most similar sentence: to '{query_sentence}'" ) .fmt_number(columns="score", decimals=3) .data_color( columns="score", palette="Greens" ))table2
Similarity Scores Using Direct Model Response
Most similar sentence: to 'A well dressed man became a Member of Parliament.'
sentence
score
A well dressed man ran for office.
0.700
The woman in the dress became a politician.
0.600
A well dressed man became a member of a football club.
0.400
Vel klæddur maður varð þingmaður.
0.900
Conclusion
The vector-based approach found the most similar sentence, but gave a false positive as the second closest match. It failed identifying the sentence that was a direct translation.
Sending the sentences direct to the LLM model worked better, it ordered the sentences in the right order and had the direct translation at the top.
The LLM method was a bit more fiddly - it took me a while to find the right prompt. I originally forgot to add “how closely they match semantically”. This method is more expensive, but calling the current foundational models is still pretty cheap, so it depends on how many calls you have to make.