I want to find the cosine similarity (or euclidean distance if easier) between one query row, and 10 other rows. These rows are full of nan values, so if a column is nan they are to be ignored.
For example, query :
A B C D E F
3 2 NaN 5 NaN 4
df =
A B C D E F
2 1 3 NaN 4 5
1 NaN 2 4 NaN 3
. . . . . .
. . . . . .
So I just want to get the cosine similarity between every non null column that query and the rows from df have in column. So for row 0 in df A, B, and F are non null in both query and df.
I then want to print the cosine similarity for each row.
Thanks in advance
For euclidean - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.nan_euclidean_distances.html This ignores nan's in it's calculations
For cosine similarity, you cannot simply fillna as this will change your similarity score. Instead, take subsets of your df and calculate the cosine similarity across columns that do not contain null values.
For your example dataframe, this would calculate cosine similarity across all rows using just columns A, & F, across query and row 1 using A, B, & F, and across query and row 2 using A, D, F. You would need to follow this up with some sort of ranking on which score to choose.
combinations = []
df.apply(lambda x: combinations.append(list(x.dropna().index)), axis=1)
# remove duplicate null combinations
combinations = [list(item) for item in set(tuple(row) for row in combinations)]
for i in combinations:
pdist(df[i].dropna(), metric='cosine')
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With