Comparing AI embedding models

In my last post I used Facebook’s Llama for creating vector embeddings from different words and sentences. My editor, ChatGPT of course, suggested I use a model that specializes in embedding rather than Llama, which does text generation (chatting). I did, and the results are a bit surprising.

Choosing models and sentences

The models I want to compare are the Llama3.1 model I used in my previous post, and the models mentioned in Ollama’s blog post about embeddings.

models = [
    'all-minilm',
    'mxbai-embed-large',
    'nomic-embed-text',
    'llama3.1'
]

I want to calculate the similarities between three sentences, two which sound very different but mean something similar, a person goes into politics, and one sentence which is about public transportation.

sentences = [
    "The man with the tie ran for office.",
    "The woman in the dress became a politician.",
    "The man with the tie ran for a bus."
]

The results

Next, we loop through the models and use each one of them to create a vector, and then we compare the vectors with each other to create a cosine similarity matrix.

Show the code
def compare_words(words, embeddings, model_name):
    vectors = []
    for s in words:
        vectors.append(embeddings.embed_query(s))
    matrix = cosine_similarity(vectors)
    df = (
        pd.DataFrame(matrix)
        .stack()
        .reset_index()
        .query("level_0 > level_1")
        .replace({"level_0": dict(enumerate(words)), "level_1": dict(enumerate(words))})
        .rename(columns={"level_0": "First", "level_1": "Second", 0: model_name.split("/")[-1]})
        .round(2)
    )
    return df

results = []
for m in models:
    embeddings = OllamaEmbeddings(model=m)
    results.append(compare_words(sentences, embeddings=embeddings, model_name=m))

result = pd.concat([results[0], results[1].iloc[:,-1].to_frame(), results[2].iloc[:,-1].to_frame(), results[3].iloc[:,-1].to_frame()], axis=1)
GT(
    result
).tab_header(
    "Cosine similarities between First and Second phrases"
).tab_spanner(
    label="model", 
    columns=[2,3,4,5]
).tab_style(
        style=style.fill(color="lightgreen"), locations=[loc.body(columns=[2,3,4], rows=[1])]
).tab_style(
    style=style.fill(color="lightgreen"), locations=loc.body(columns=[5], rows=[0])
)
Cosine similarities between First and Second phrases
First Second model
all-minilm mxbai-embed-large nomic-embed-text llama3.1
The woman in the dress became a politician. The man with the tie ran for office. 0.29 0.59 0.64 0.91
The man with the tie ran for a bus. The man with the tie ran for office. 0.64 0.69 0.84 0.87
The man with the tie ran for a bus. The woman in the dress became a politician. 0.09 0.31 0.5 0.85

Llama is the only model that correctly identifies the man and woman going into politics as more similar to each other than the two men with the ties (one running for a bus and the other for office.)

My takeaways are:

  • If I were creating a semantic search, Llama would show me the correct result at the top.
  • However, Llama’s representation of a man running for a bus and a woman becoming a politician are quite close, so maybe I just lucked out.
  • nomic-embed-text has the second highest score for the politicians being similar.

So, which model is the best one for embedding. The answer is probably: it depends. Sigh.