I'm working on an NLP project where I have to compare the similarity between many sentences E.G. from this dataframe:
The first thing I tried was to make a join of the dataframe with itself to get the bellow format and compare row by row:
The problem with this that I get out of memory quickly for big medium/big datasets, e.g. for a 10k rows join I will get 100MM rows which I can not fit in ram
My current aproach is to iterate over the dataframe with as:
final = pd.DataFrame()
### for each row
for i in range(len(df_sample)):
### select the corresponding vector to compare with
v = df_sample[df_sample.index.isin([i])]["use_vector"].values
### compare all cases agains the selected vector
df_sample.apply(lambda x: cosine_similarity_numba(x.use_vector,v[0]) ,axis=1)
### kept the cases with a similarity over a given th, in this case 0.6
temp = df_sample[df_sample.apply(lambda x: cosine_similarity_numba(x.use_vector,v[0]) ,axis=1) > 0.6]
### filter out the base case
temp = temp[~temp.index.isin([i])]
temp["original_question"] = copy.copy(df_sample[df_sample.index.isin([i])]["questions"].values[0])
### append the result
final = pd.concat([final,temp])
But this aproach is not fast either. How can I improve the performance of this process?
One possible trick you may employ is to switch from sparse tfidf representation to dense word embeddings from Facebook's fasttext
:
import fasttext
# wget https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.en.300.bin.gz
model = fasttext.load_model("./cc.en.300.bin")
Then you can proceed to calculate cosine similarity with more space efficient, context aware and better performing (?) dense word embeddings:
df = pd.DataFrame({"questions":["This is a question",
"This is a similar questin",
"And this one is absolutely different"]})
df["vecs"] = df["questions"].apply(model.get_sentence_vector)
from scipy.spatial.distance import pdist, squareform
# only pairwise distance with itself
# vectorized, no doubling data
out = pdist(np.stack(df['vecs']), metric="cosine")
cosine_similarity = squareform(out)
print(cosine_similarity)
[[0. 0.08294727 0.25305626]
[0.08294727 0. 0.23575631]
[0.25305626 0.23575631 0. ]]
Note as well, on top of memory efficiency, you also gain about 10x speed increase due to using cosine similarity from scipy
.
Another possible trick is to cast your similarity vectors from default float64
to float32
or float16
:
df["vecs"] = df["vecs"].apply(np.float16)
which will give you both speed and memory gains.
I just wrote an answer to a problem similar to yours yesterday, which is Top-K Cosine Similarity rows in a dataframe of pandas
import numpy as np
import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
data = {"use_vector": [[-0.1, -0.2, 0.3], [0.1, -0.2, -0.3], [-0.1, 0.2, -0.3]]}
df = pd.DataFrame(data)
print("Data: \n{}\n".format(df))
vectors = []
for v in df['use_vector']:
vectors.append(v)
vectors_num = len(vectors)
A=np.array(vectors)
# Get similarities matrix, value for each pair at corresponding index of upper triangle of matrix
similarities = cosine_similarity(A)
# Set symmetrical(repetitive) and diagonal(similarity to self) to -2
similarities[np.tril_indices(vectors_num)] = -2
print("Similarities: \n{}\n".format(similarities))
Outputs:
Data:
use_vector
0 [-0.1, -0.2, 0.3]
1 [0.1, -0.2, -0.3]
2 [-0.1, 0.2, -0.3]
Similarities:
[[-2. -0.42857143 -0.85714286] # vector 0 & 1, 2
[-2. -2. 0.28571429] # vector 1 & 2
[-2. -2. -2. ]]
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With