Calculating pairwise Euclidean distance between all the rows of a dataframe

Tags:

How can I calculate the Euclidean distance between all the rows of a dataframe? I am trying this code, but it is not working:

zero_data = data
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
result.head()

This is how my (44062 by 278) dataframe looks like:

Please see sample data here

997

asked Mar 07 '20 06:03

Quicklearner.gk

2 Answers

To compute the Eucledian distance between two rows i and j of a dataframe df:

np.linalg.norm(df.loc[i] - df.loc[j])

To compute it between consecutive rows, i.e. 0 and 1, 1 and 2, 2 and 3, ...

np.linalg.norm(df.diff(axis=0).drop(0), axis=1)

If you want to compute it between all the rows, i.e. 0 and 1, 0 and 2, ..., 1 and 1, 1 and 2 ..., then you have to loop through all the combinations of i and j (keep in mind that for 44062 rows there are 970707891 such combinations so using a for-loop will be very slow):

import itertools

for i, j in itertools.combinations(df.index, 2):
    d_ij = np.linalg.norm(df.loc[i] - df.loc[j])

Edit:

Instead, you can use scipy.spatial.distance.cdist which computes distance between each pair of two collections of inputs:

from scipy.spatial.distance import cdist

cdist(df, df, 'euclid')

This will return you a symmetric (44062 by 44062) matrix of Euclidian distances between all the rows of your dataframe. The problem is that you need a lot of memory for it to work (at least 8*44062**2 bytes of memory, i.e. ~16GB). So a better option is to use pdist

from scipy.spatial.distance import pdist

pdist(df.values, 'euclid')

which will return an array (of size 970707891) of all the pairwise Euclidean distances between the rows of df.

P.s. Don't forget to ignore the 'Actual_data' column in the computations of distances. E.g. you can do the following: data = df.drop('Actual_Data', axis=1).values and then cdist(data, data, 'euclid') or pdist(data, 'euclid'). You can also create another dataframe with distances like this:

data = df.drop('Actual_Data', axis=1).values

d = pd.DataFrame(itertools.combinations(df.index, 2), columns=['i','j'])
d['dist'] = pdist(data, 'euclid')


   i  j  dist
0  0  1  ...
1  0  2  ...
2  0  3  ...
3  0  4  ...
...

110

answered Oct 24 '22 10:10

Andreas K.

Working with a subset of your data for eg.

df_data = [[888888, 3, 0, 0],
 [677767, 0, 2, 1],
 [212341212, 0, 0, 0],
 [141414141414, 0, 0, 0],
 [1112224, 0, 0, 0]]

# Creating the data
df = pd.DataFrame(data=data, columns=['Actual_Data', '8,8', '6,6', '7,7'], dtype=np.float64)

# Which looks like
#     Actual_Data  8,8  6,6  7,7
# 0  8.888880e+05  3.0  0.0  0.0
# 1  6.777670e+05  0.0  2.0  1.0
# 2  2.123412e+08  0.0  0.0  0.0
# 3  1.414141e+11  0.0  0.0  0.0
# 4  1.112224e+06  0.0  0.0  0.0

# Computing the distance matrix
dist_matrix = df.apply(lambda row: [np.linalg.norm(row.values - df.loc[[_id], :].values, 2) for _id in df.index.values], axis=1)

# Which looks like
# 0     [0.0, 211121.00003315636, 211452324.0, 141413252526.0, 223336.000020149]
# 1    [211121.00003315636, 0.0, 211663445.0, 141413463647.0, 434457.0000057543]
# 2                 [211452324.0, 211663445.0, 0.0, 141201800202.0, 211228988.0]
# 3        [141413252526.0, 141413463647.0, 141201800202.0, 0.0, 141413029190.0]
# 4      [223336.000020149, 434457.0000057543, 211228988.0, 141413029190.0, 0.0]

# Reformatting the above into readable format
dist_matrix = pd.DataFrame(
  data=dist_matrix.values.tolist(), 
  columns=df.index.tolist(), 
  index=df.index.tolist())

# Which gives you
#               0             1             2             3             4
# 0  0.000000e+00  2.111210e+05  2.114523e+08  1.414133e+11  2.233360e+05
# 1  2.111210e+05  0.000000e+00  2.116634e+08  1.414135e+11  4.344570e+05
# 2  2.114523e+08  2.116634e+08  0.000000e+00  1.412018e+11  2.112290e+08
# 3  1.414133e+11  1.414135e+11  1.412018e+11  0.000000e+00  1.414130e+11
# 4  2.233360e+05  4.344570e+05  2.112290e+08  1.414130e+11  0.000000e+00

Update

as pointed out in the comments the issue is memory overflow so we have to operate the problem in batches.

# Collecting the data
# df = ....

# Set this number to a lower value if you get the same `memory` errors.
batch = 200 # #'s of row's / user's used to compute the matrix

# To be conservative, let's write the intermediate results to file type.
dffname = []

for ifile,_slice in enumerate(np.array_split(range(df.shape[0]), batch)):

  # Let's compute distance for `batch` #'s of points in data frame
  tmp_df = df.iloc[_slice, :].apply(lambda row: [np.linalg.norm(row.values - df.loc[[_id], :].values, 2) for _id in df.index.values], axis=1)

  tmp_df = pd.DataFrame(tmp_df.values.tolist(), index=df.index.values[_slice], columns=df.index.values)

  # You can change it from csv to any other files
  tmp_df.to_csv(f"{ifile+1}.csv")
  dffname.append(f"{ifile+1}.csv")

# Reading back the dataFrames
dflist = []
for f in dffname:
  dflist.append(pd.read_csv(f, dtype=np.float64, index_col=0))

res = pd.concat(dflist)

answered Oct 24 '22 10:10

DOOM

Related questions
                            
                                SQLAlchemy: Can't reconnect until invalid transaction is rolled back
                            
                                What is causing large jumps in training accuracy and loss between epochs?
                            
                                rllib use custom registered environments
                            
                                Is it possible to extract text from specific portion of image using pytesseract
                            
                                Is it possible to convert a really large int to a string quickly in python
                            
                                How to visualize RNN/LSTM gradients in Keras/TensorFlow?
                            
                                cqlsh ERROR:root:code for hash md5 was not found
                            
                                How to Refactor Module using python rope?
                            
                                More elegant way of find a range of repeating elements
                            
                                AttributeError: module 'tensorflow' has no attribute 'get_variable'
                            
                                Why can't I use pip with Python 3.8? [duplicate]
                            
                                RuntimeError at / cannot cache function '__shear_dense': no locator available for file '/home/...site-packages/librosa/util/utils.py'
                            
                                Tensorflow: How to use tf.keras.metrics in multiclass classification?
                            
                                How to remove extra whitespace from image in opencv? [duplicate]
                            
                                How do you search a column and fill another column with what you find?
                            
                                Is the __init__.py really not necessary for python 3.7 packages?
                            
                                How to read simple text from a PDF file with Python?
                            
                                Select columns in a DataFrame conditional on row
                            
                                Transaction atomic needed for bulk create?
                            
                                ffmpeg delay in decoding h264

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Calculating pairwise Euclidean distance between all the rows of a dataframe

Tags:

python

pandas

dataframe

numpy

euclidean-distance