Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Cosine similarity between each row in a Dataframe in Python

I have a DataFrame containing multiple vectors each having 3 entries. Each row is a vector in my representation. I needed to calculate the cosine similarity between each of these vectors. Converting this to a matrix representation is better or is there a cleaner approach in DataFrame itself?

Here is the code that I have tried.

import pandas as pd
from scipy import spatial
df = pd.DataFrame([X,Y,Z]).T
similarities = df.values.tolist()

for x in similarities:
    for y in similarities:
        result = 1 - spatial.distance.cosine(x, y)
like image 582
Jayanth Prakash Kulkarni Avatar asked Jul 29 '17 09:07

Jayanth Prakash Kulkarni


People also ask

How do I compare two rows in a data frame?

During data analysis, one might need to compute the difference between two rows for comparison purposes. This can be done using pandas. DataFrame. diff() function.

How do you find cosine similarity in Python?

Use the scipy Module to Calculate the Cosine Similarity Between Two Lists in Python. The spatial. cosine. distance() function from the scipy module calculates the distance instead of the cosine similarity, but to achieve that, we can subtract the value of the distance from 1.

How do you find the cosine similarity between two documents?

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

How do you find the similarity between two columns in Python?

First, you concatenate 2 columns of interest into a new data frame. Then you drop NaN. After that those 2 columns have only corresponding rows, and you can compare them with cosine distance or any other pairwise distance you wish.


1 Answers

You can directly just use sklearn.metrics.pairwise.cosine_similarity.

Demo

import numpy as np; import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity

df = pd.DataFrame(np.random.randint(0, 2, (3, 5)))

df
##     0  1  2  3  4
##  0  1  1  1  0  0
##  1  0  0  1  1  1
##  2  0  1  0  1  0

cosine_similarity(df)
##  array([[ 1.        ,  0.33333333,  0.40824829],
##         [ 0.33333333,  1.        ,  0.40824829],
##         [ 0.40824829,  0.40824829,  1.        ]])
like image 55
miradulo Avatar answered Oct 11 '22 16:10

miradulo