Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

OpenAI Embeddings API: How embeddings work?

There are quite a few tutorials on embeddings in OpenAI. I can't understand how they work.

Referring to https://platform.openai.com/docs/guides/embeddings/what-are-embeddings , an embedding is a vector or list. A string is passed to an embedding model and the model returns a number (in simplest terms). I can use this number(s).

If I use a simple string to get its embeddings, I get a massive list

result = get_embedding("I live in space", engine = "textsearchcuriedoc001mc")

result when printed

[5.4967957112239674e-05,
 -0.01301578339189291,
 -0.002223075833171606,
 0.013594076968729496,
 -0.027540158480405807,
 0.008867159485816956,
 0.009403547272086143,
 -0.010987567715346813,
 0.01919262297451496,
 0.022209804505109787,
 -0.01397960539907217,
 -0.012806257233023643,
 -0.027908924967050552,
 0.013074451126158237,
 0.024942029267549515,
 0.0200139675289392 , ..... -> truncated this much, much, much longer list 

Question 1 - how is this massive list correlated with my 4-word text?

Question 2 -

I create embeddings of the text I want to use in query. Note that it is exactly the same as the text of original content I live in space

queryembedding = get_embedding(
        'I live in space',
        engine="textsearchcuriequery001mc"
    )
queryembedding

When I run cosine similarity , the value is 0.42056650555103214.

similarity = cosine_similarity(embeddings_of_i_live,queryembedding)
similarity

I get value 0.42056650555103214

Shouldn't the value be 1 to indicate identical value?

like image 599
Manu Chadha Avatar asked Sep 12 '25 03:09

Manu Chadha


1 Answers

Q1:

How is this massive list correlated with my 4-word text?

A1: Let's say you want to use the OpenAI text-embedding-ada-002 model. No matter what your input is, you will always get a 1536-dimensional embedding vector (i.e., there are 1536 numbers inside). You are probably familiar with 3-dimensional space (i.e., X, Y, Z). Well, this is a 1536-dimensional space, which is very hard to imagine. Why are there exactly 1536 numbers inside the embedding vector? Because the text-embedding-ada-002 model has an output dimension of 1536. It's pre-defined.


Q2:

I create embeddings of the text I want to use in the query. Note that it is exactly the same as the text of the original content: I live in space. When I run cosine similarity, the value is 0.42056650555103214. Should the value be 1 to indicate an identical value?

A2: Yes, the value should be 1 if you calculate cosine similarity between two identical texts. See an example here.

For an example of semantic search based on embeddings, see this answer.

like image 144
Rok Benko Avatar answered Sep 13 '25 18:09

Rok Benko