I have a point
point = np.array([0.07852388, 0.60007135, 0.92925712, 0.62700219, 0.16943809,
0.34235233])
And a pandas dataframe
a b c d e f
0 0.025641 0.554686 0.988809 0.176905 0.050028 0.333333
1 0.027151 0.520914 0.985590 0.409572 0.163980 0.424242
2 0.028788 0.478810 0.970480 0.288557 0.095053 0.939394
3 0.018692 0.450573 0.985910 0.178048 0.118399 0.484848
4 0.023256 0.787253 0.865287 0.217591 0.205670 0.303030
I would like to calculate the distance of every row in the pandas dataframe, to that specific point
I tried
import numpy as np
d_all = list()
for index, row in df_scaled[cols_list].iterrows():
d = np.linalg.norm(centroid-np.array(list(row[cols_list])))
d_all += [d]
df_scaled['distance_cluster'] = d_all
My solution is really slow though (taking into account that I want to calculate the distance from other points as well.
Is there a way to do my calculations more efficiently ?
Vectorization is always the first and best choice. You can convert the data frame to NumPy array or into dictionary format to speed up the iteration workflow. Iterating through the key-value pair of dictionaries comes out to be the fastest way with around 280x times speed up for 20 million records.
diff() function. This function calculates the difference between two consecutive DataFrame elements. Parameters: periods: Represents periods to shift for computing difference, Integer type value.
Another option is use cdist
which is a bit faster:
from scipy.spatial.distance import cdist
cdist(point[None,], df.values)
Output:
array([[0.47468985, 0.25707985, 0.70385676, 0.5035961 , 0.46115096]])
Some comparison with 100k rows:
%%timeit -n 10
cdist([point], df.values)
645 µs ± 36.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
np.linalg.norm(df.to_numpy() - point, axis=1)
5.16 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
%%timeit -n 10
df.sub(point, axis=1).pow(2).sum(axis=1).pow(.5)
16.8 ms ± 444 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
You can compute vectorized Euclidean distance (L2 norm) using the formula
sqrt((a1 - b1)2 + (a2 - b2)2 + ...)
df.sub(point, axis=1).pow(2).sum(axis=1).pow(.5)
0 0.474690
1 0.257080
2 0.703857
3 0.503596
4 0.461151
dtype: float64
Which gives the same output as your current code.
Or, using linalg.norm
:
np.linalg.norm(df.to_numpy() - point, axis=1)
# array([0.47468985, 0.25707985, 0.70385676, 0.5035961 , 0.46115096])
Let us do scipy
from scipy.spatial import distance
ary = distance.cdist(df.values, np.array([point]), metric='euclidean')
ary
Out[57]:
array([[0.47468985],
[0.25707985],
[0.70385676],
[0.5035961 ],
[0.46115096]])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With