There are quite a few tutorials on embeddings
in OpenAI. I can't understand how they work.
Referring to https://platform.openai.com/docs/guides/embeddings/what-are-embeddings , an embedding
is a vector
or list
. A string is passed to an embedding
model and the model returns a number (in simplest terms). I can use this number(s).
If I use a simple string to get its embeddings
, I get a massive list
result = get_embedding("I live in space", engine = "textsearchcuriedoc001mc")
result
when printed
[5.4967957112239674e-05,
-0.01301578339189291,
-0.002223075833171606,
0.013594076968729496,
-0.027540158480405807,
0.008867159485816956,
0.009403547272086143,
-0.010987567715346813,
0.01919262297451496,
0.022209804505109787,
-0.01397960539907217,
-0.012806257233023643,
-0.027908924967050552,
0.013074451126158237,
0.024942029267549515,
0.0200139675289392 , ..... -> truncated this much, much, much longer list
Question 1 - how is this massive list correlated with my 4-word text?
Question 2 -
I create embeddings
of the text I want to use in query. Note that it is exactly the same as the text of original content I live in space
queryembedding = get_embedding(
'I live in space',
engine="textsearchcuriequery001mc"
)
queryembedding
When I run cosine similarity
, the value is 0.42056650555103214
.
similarity = cosine_similarity(embeddings_of_i_live,queryembedding)
similarity
I get value 0.42056650555103214
Shouldn't the value be 1
to indicate identical value?
Q1:
How is this massive list correlated with my 4-word text?
A1: Let's say you want to use the OpenAI text-embedding-ada-002
model. No matter what your input is, you will always get a 1536-dimensional embedding vector (i.e., there are 1536 numbers inside). You are probably familiar with 3-dimensional space (i.e., X, Y, Z). Well, this is a 1536-dimensional space, which is very hard to imagine. Why are there exactly 1536 numbers inside the embedding vector? Because the text-embedding-ada-002
model has an output dimension of 1536. It's pre-defined.
Q2:
I create embeddings of the text I want to use in the query. Note that it is exactly the same as the text of the original content:
I live in space
. When I run cosine similarity, the value is0.42056650555103214
. Should the value be1
to indicate an identical value?
A2: Yes, the value should be 1
if you calculate cosine similarity between two identical texts. See an example here.
For an example of semantic search based on embeddings, see this answer.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With