Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Best way to identify dissimilarity: Euclidean Distance, Cosine Distance, or Simple Subtraction?

I'm new to data science and am currently learning different techniques that I can do with Python. Currently, I'm trying it out with Spotify's API for my own playlists.

The goal is to find the most dissimilar features between two different playlist.

My question is what is the best way to identify the most dissimilar features between these two playlists?

I started off by getting all the tracks in each playlist and their respective features. I then computed the mean of each of the features.

Here is the DataFrame I ended up with. The data values are the means of all the tracks features to their respective playlist

                   playlist1  playlist2
                   --------------------
danceability      | 0.667509   0.592140
energy            | 0.598873   0.468020
acousticness      | 0.114511   0.398372
valence           | 0.376920   0.287250
instrumentalness  | 0.005238   0.227783
speechiness       | 0.243587   0.088612

I did some digging and found two common procedures:

1. Euclidean Distance

2. Cosine Similarity

I for some reason couldn't wrap my head around which one to use and proceeded to compute the absolute difference between each feature. Simple subtraction because that made sense to me intuitively. The feature with the greatest difference would be the 'most dissimilar'.

With this approach, I ended up using these results and concluded that energy and acousticness are the most dissimilar

                   playlist1    playlist2   absoluteDifference
                   ----------------------------------------------------
energy             |0.871310    0.468020    0.403290
acousticness       |0.041479    0.398372    0.356893
valence            |0.501890    0.287250    0.214640
instrumentalness   |0.049012    0.227783    0.178771
danceability       |0.531071    0.592140    0.061069
speechiness        |0.109587    0.088612    0.020975

Is my intuition correct/incorrect and when would we use the aforementioned techniques? Would any of those techniques be applicable in a situation such as this?

Eventually, I want to take the top two dissimilarities and make them my axis for KNN. My intuition is that I can identify the most dissimilar features of two playlists, I'll have a cleaner and more defined features of the playlist and can more accurately predict which song a playlist ought to belong to.

like image 489
Mustafa Avatar asked Jan 02 '23 15:01

Mustafa


1 Answers

Let me start off with a few short remarks on Euclidean Distance and Cosine Similarity:

Euclidean Distance measures how far apart two points in a n-dimensional space are, i.e. it measures the length of a straight line from point A to point B

Cosine Similarity measures their similarity in orientation, i.e. the angle between two points A and B with vertex at zero

Let me add an image to underline my thoughts An illustration on different metrics The Euclidean distance between points A and B is depicted in red, the cosine similarity is depicted in green (and by that I don't literally mean the actual values of the measures, but rather what is relevant to their calculation)

Now let me talk about measures in general: any and all measures depict some kind of similarity. There is no such thing as a universal "best metric". The metric that suits your problem best is always determined by the problem.

I have added some extra points in the image to show that fact:

  • The points D and E have the same cosine similarity as A and B, but a vastly different Euclidean distance
  • On the contrary, the points A and F have a vastly different cosine similarity than A and B, but the same Euclidean distance

Now, let me make a remark to the appropriate choice of metric for your specific problem: you wish to evaluate how far apart features are. The bigger the difference, the farther apart the features. You don't care about angles between points at all. That is a clear point for Euclidean distance. You may not realize, but you actually used Euclidean distance in your example. Your features are one-dimensional, and in 1D, the Euclidean distance is equal to the absolute difference.

like image 149
Lukas Thaler Avatar answered May 01 '23 02:05

Lukas Thaler