I have a DataFrame containing multiple vectors each having 3 entries. Each row is a vector in my representation. I needed to calculate the cosine similarity between each of these vectors. Converting this to a matrix representation is better or is there a cleaner approach in DataFrame itself?
Here is the code that I have tried.
import pandas as pd
from scipy import spatial
df = pd.DataFrame([X,Y,Z]).T
similarities = df.values.tolist()
for x in similarities:
for y in similarities:
result = 1 - spatial.distance.cosine(x, y)
During data analysis, one might need to compute the difference between two rows for comparison purposes. This can be done using pandas. DataFrame. diff() function.
Use the scipy Module to Calculate the Cosine Similarity Between Two Lists in Python. The spatial. cosine. distance() function from the scipy module calculates the distance instead of the cosine similarity, but to achieve that, we can subtract the value of the distance from 1.
Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.
First, you concatenate 2 columns of interest into a new data frame. Then you drop NaN. After that those 2 columns have only corresponding rows, and you can compare them with cosine distance or any other pairwise distance you wish.
You can directly just use sklearn.metrics.pairwise.cosine_similarity
.
Demo
import numpy as np; import pandas as pd
from sklearn.metrics.pairwise import cosine_similarity
df = pd.DataFrame(np.random.randint(0, 2, (3, 5)))
df
## 0 1 2 3 4
## 0 1 1 1 0 0
## 1 0 0 1 1 1
## 2 0 1 0 1 0
cosine_similarity(df)
## array([[ 1. , 0.33333333, 0.40824829],
## [ 0.33333333, 1. , 0.40824829],
## [ 0.40824829, 0.40824829, 1. ]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With