Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I use cosine similarity between rows using only non null values?

I want to find the cosine similarity (or euclidean distance if easier) between one query row, and 10 other rows. These rows are full of nan values, so if a column is nan they are to be ignored.

For example, query :

A   B   C   D   E   F
3   2  NaN  5  NaN  4

df =

A   B   C   D   E   F
2   1   3  NaN  4   5
1  NaN  2   4  NaN  3
.   .   .   .   .   .
.   .   .   .   .   .

So I just want to get the cosine similarity between every non null column that query and the rows from df have in column. So for row 0 in df A, B, and F are non null in both query and df.

I then want to print the cosine similarity for each row.

Thanks in advance

like image 465
toothsie Avatar asked Oct 17 '22 07:10

toothsie


1 Answers

For euclidean - https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.nan_euclidean_distances.html This ignores nan's in it's calculations

For cosine similarity, you cannot simply fillna as this will change your similarity score. Instead, take subsets of your df and calculate the cosine similarity across columns that do not contain null values.

For your example dataframe, this would calculate cosine similarity across all rows using just columns A, & F, across query and row 1 using A, B, & F, and across query and row 2 using A, D, F. You would need to follow this up with some sort of ranking on which score to choose.

combinations = []
df.apply(lambda x: combinations.append(list(x.dropna().index)), axis=1)

# remove duplicate null combinations
combinations = [list(item) for item in set(tuple(row) for row in combinations)]

for i in combinations:
    pdist(df[i].dropna(), metric='cosine')
like image 73
Mattie Knebel-Langford Avatar answered Oct 21 '22 03:10

Mattie Knebel-Langford