Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate distance for every row in a pandas dataframe from a single point efficiently?

Tags:

python

pandas

I have a point

point = np.array([0.07852388, 0.60007135, 0.92925712, 0.62700219, 0.16943809,
       0.34235233])

And a pandas dataframe

           a           b           c           d           e           f
0   0.025641    0.554686    0.988809    0.176905    0.050028    0.333333
1   0.027151    0.520914    0.985590    0.409572    0.163980    0.424242
2   0.028788    0.478810    0.970480    0.288557    0.095053    0.939394
3   0.018692    0.450573    0.985910    0.178048    0.118399    0.484848
4   0.023256    0.787253    0.865287    0.217591    0.205670    0.303030

I would like to calculate the distance of every row in the pandas dataframe, to that specific point

I tried

import numpy as np
d_all = list()
for index, row in df_scaled[cols_list].iterrows():
        d = np.linalg.norm(centroid-np.array(list(row[cols_list])))
        d_all += [d]
df_scaled['distance_cluster'] = d_all

My solution is really slow though (taking into account that I want to calculate the distance from other points as well.

Is there a way to do my calculations more efficiently ?

like image 756
quant Avatar asked Oct 15 '20 15:10

quant


People also ask

What is the most efficient way to loop through Dataframes with pandas?

Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.

How do you tell the difference between consecutive rows in pandas?

diff() function. This function calculates the difference between two consecutive DataFrame elements. Parameters: periods: Represents periods to shift for computing difference, Integer type value.


3 Answers

Another option is use cdist which is a bit faster:

from scipy.spatial.distance import cdist
cdist(point[None,], df.values)

Output:

array([[0.47468985, 0.25707985, 0.70385676, 0.5035961 , 0.46115096]])

Some comparison with 100k rows:

%%timeit -n 10
cdist([point], df.values)
645 µs ± 36.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
np.linalg.norm(df.to_numpy() - point, axis=1)
5.16 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
df.sub(point, axis=1).pow(2).sum(axis=1).pow(.5)
16.8 ms ± 444 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
like image 108
Quang Hoang Avatar answered Sep 27 '22 17:09

Quang Hoang


You can compute vectorized Euclidean distance (L2 norm) using the formula

sqrt((a1 - b1)2 + (a2 - b2)2 + ...)

df.sub(point, axis=1).pow(2).sum(axis=1).pow(.5)

0    0.474690
1    0.257080
2    0.703857
3    0.503596
4    0.461151
dtype: float64

Which gives the same output as your current code.


Or, using linalg.norm:

np.linalg.norm(df.to_numpy() - point, axis=1)
# array([0.47468985, 0.25707985, 0.70385676, 0.5035961 , 0.46115096])
like image 25
cs95 Avatar answered Sep 27 '22 17:09

cs95


Let us do scipy

from scipy.spatial import distance
ary = distance.cdist(df.values, np.array([point]), metric='euclidean')
ary
Out[57]: 
array([[0.47468985],
       [0.25707985],
       [0.70385676],
       [0.5035961 ],
       [0.46115096]])
like image 25
BENY Avatar answered Sep 27 '22 18:09

BENY