I have a point <pre class="prettyprint"><code>point = np.array([0.07852388, 0.60007135, 0.92925712, 0.62700219, 0.16943809, 0.34235233]) </code></pre> And a pandas dataframe <pre class="prettyprint"><code> a b c d e f 0 0.025641 0.554686 0.988809 0.176905 0.050028 0.333333 1 0.027151 0.520914 0.985590 0.409572 0.163980 0.424242 2 0.028788 0.478810 0.970480 0.288557 0.095053 0.939394 3 0.018692 0.450573 0.985910 0.178048 0.118399 0.484848 4 0.023256 0.787253 0.865287 0.217591 0.205670 0.303030 </code></pre> I would like to calculate the distance of every row in the pandas dataframe, to that specific point I tried <pre class="prettyprint"><code>import numpy as np d_all = list() for index, row in df_scaled[cols_list].iterrows(): d = np.linalg.norm(centroid-np.array(list(row[cols_list]))) d_all += [d] df_scaled['distance_cluster'] = d_all </code></pre> My solution is really slow though (taking into account that I want to calculate the distance from other points as well. Is there a way to do my calculations more efficiently ?

Another option is use <code>cdist</code> which is a bit faster: <pre class="prettyprint"><code>from scipy.spatial.distance import cdist cdist(point[None,], df.values) </code></pre> Output: <pre class="prettyprint"><code>array([[0.47468985, 0.25707985, 0.70385676, 0.5035961 , 0.46115096]]) </code></pre> Some comparison with 100k rows: <pre class="prettyprint"><code>%%timeit -n 10 cdist([point], df.values) 645 µs ± 36.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 np.linalg.norm(df.to_numpy() - point, axis=1) 5.16 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) %%timeit -n 10 df.sub(point, axis=1).pow(2).sum(axis=1).pow(.5) 16.8 ms ± 444 µs per loop (mean ± std. dev. of 7 runs, 10 loops each) </code></pre>

How to calculate distance for every row in a pandas dataframe from a single point efficiently?

Tags:

python

pandas

I have a point

point = np.array([0.07852388, 0.60007135, 0.92925712, 0.62700219, 0.16943809,
       0.34235233])

And a pandas dataframe

           a           b           c           d           e           f
0   0.025641    0.554686    0.988809    0.176905    0.050028    0.333333
1   0.027151    0.520914    0.985590    0.409572    0.163980    0.424242
2   0.028788    0.478810    0.970480    0.288557    0.095053    0.939394
3   0.018692    0.450573    0.985910    0.178048    0.118399    0.484848
4   0.023256    0.787253    0.865287    0.217591    0.205670    0.303030

I would like to calculate the distance of every row in the pandas dataframe, to that specific point

I tried

import numpy as np
d_all = list()
for index, row in df_scaled[cols_list].iterrows():
        d = np.linalg.norm(centroid-np.array(list(row[cols_list])))
        d_all += [d]
df_scaled['distance_cluster'] = d_all

My solution is really slow though (taking into account that I want to calculate the distance from other points as well.

Is there a way to do my calculations more efficiently ?

756

asked Oct 15 '20 15:10

quant

3 Answers

Another option is use cdist which is a bit faster:

from scipy.spatial.distance import cdist
cdist(point[None,], df.values)

Output:

array([[0.47468985, 0.25707985, 0.70385676, 0.5035961 , 0.46115096]])

Some comparison with 100k rows:

%%timeit -n 10
cdist([point], df.values)
645 µs ± 36.4 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
np.linalg.norm(df.to_numpy() - point, axis=1)
5.16 ms ± 227 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n 10
df.sub(point, axis=1).pow(2).sum(axis=1).pow(.5)
16.8 ms ± 444 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

108

answered Sep 27 '22 17:09

Quang Hoang

You can compute vectorized Euclidean distance (L2 norm) using the formula

sqrt((a₁ - b₁)² + (a₂ - b₂)² + ...)

df.sub(point, axis=1).pow(2).sum(axis=1).pow(.5)

0    0.474690
1    0.257080
2    0.703857
3    0.503596
4    0.461151
dtype: float64

Which gives the same output as your current code.

Or, using linalg.norm:

np.linalg.norm(df.to_numpy() - point, axis=1)
# array([0.47468985, 0.25707985, 0.70385676, 0.5035961 , 0.46115096])

answered Sep 27 '22 17:09

cs95

Let us do scipy

from scipy.spatial import distance
ary = distance.cdist(df.values, np.array([point]), metric='euclidean')
ary
Out[57]: 
array([[0.47468985],
       [0.25707985],
       [0.70385676],
       [0.5035961 ],
       [0.46115096]])

answered Sep 27 '22 18:09

BENY

Related questions
                            
                                specifying "skip NA" when calculating mean of the column in a data frame created by Pandas
                            
                                Python asyncio debugging example
                            
                                Python pandas time series interpolation and regularization
                            
                                Getting all superclasses in Python 3
                            
                                python - uploading a plot from memory to s3 using matplotlib and boto
                            
                                WebAssembly, JavaScript, and other languages
                            
                                how to to terminate process using python's multiprocessing
                            
                                DataFrame object has no attribute 'sort_values'
                            
                                Importing and changing variables from another file
                            
                                What is difference between str.format_map(mapping) and str.format
                            
                                get text after specific tag with beautiful soup
                            
                                Creating "virtualenv" for an existing project
                            
                                python os.walk to certain level [duplicate]
                            
                                Python: difference between ValueError and Exception?
                            
                                How to spread a column in a Pandas data frame
                            
                                Suppress multiple messages with same content in Python logging module AKA log compression
                            
                                Pandas - Interleave / Zip two DataFrames by row
                            
                                Using Grouped Map Pandas UDFs with arguments
                            
                                Function annotation with two or more return parameters
                            
                                Crontab can't execute python script with error: "[Errno 1] Operation not permitted"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With