Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating pairwise Euclidean distance between all the rows of a dataframe

How can I calculate the Euclidean distance between all the rows of a dataframe? I am trying this code, but it is not working:

zero_data = data
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
result.head()

This is how my (44062 by 278) dataframe looks like:

Please see sample data here

like image 997
Quicklearner.gk Avatar asked Mar 07 '20 06:03

Quicklearner.gk


People also ask

How do you find the pairwise distance?

Description. D = pdist( X ) returns the Euclidean distance between pairs of observations in X . D = pdist( X , Distance ) returns the distance by using the method specified by Distance . D = pdist( X , Distance , DistParameter ) returns the distance by using the method specified by Distance and DistParameter .

How do you calculate Euclidean distance for data?

Euclidean distance is calculated as the square root of the sum of the squared differences between the two vectors.

How to get pairwise Euclidean distance from SciPy data?

So a better option is to use pdist from scipy.spatial.distance import pdist pdist (df.values, 'euclid') which will return an array (of size 970707891) of all the pairwise Euclidean distances between the rows of df.

How do you find the Euclidean distance between two pairs of vectors?

sklearn.metrics.pairwise. euclidean_distances(X, Y=None, *, Y_norm_squared=None, squared=False, X_norm_squared=None) [source] ¶ Considering the rows of X (and Y=X) as vectors, compute the distance matrix between each pair of vectors. For efficiency reasons, the euclidean distance between a pair of row vector x and y is computed as:

What is the Euclidean distance between X and Y?

For efficiency reasons, the euclidean distance between a pair of row vector x and y is computed as: dist(x,ist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y)) This formulation has two advantages over other ways of computing distances.

What is the Euclidean distance between two columns in a pandas Dataframe?

Note that we can also use this function to calculate the Euclidean distance between two columns of a pandas DataFrame: The Euclidean distance between the two columns turns out to be 40.49691. 1.


2 Answers

To compute the Eucledian distance between two rows i and j of a dataframe df:

np.linalg.norm(df.loc[i] - df.loc[j])

To compute it between consecutive rows, i.e. 0 and 1, 1 and 2, 2 and 3, ...

np.linalg.norm(df.diff(axis=0).drop(0), axis=1)

If you want to compute it between all the rows, i.e. 0 and 1, 0 and 2, ..., 1 and 1, 1 and 2 ..., then you have to loop through all the combinations of i and j (keep in mind that for 44062 rows there are 970707891 such combinations so using a for-loop will be very slow):

import itertools

for i, j in itertools.combinations(df.index, 2):
    d_ij = np.linalg.norm(df.loc[i] - df.loc[j])

Edit:

Instead, you can use scipy.spatial.distance.cdist which computes distance between each pair of two collections of inputs:

from scipy.spatial.distance import cdist

cdist(df, df, 'euclid')

This will return you a symmetric (44062 by 44062) matrix of Euclidian distances between all the rows of your dataframe. The problem is that you need a lot of memory for it to work (at least 8*44062**2 bytes of memory, i.e. ~16GB). So a better option is to use pdist

from scipy.spatial.distance import pdist

pdist(df.values, 'euclid')

which will return an array (of size 970707891) of all the pairwise Euclidean distances between the rows of df.

P.s. Don't forget to ignore the 'Actual_data' column in the computations of distances. E.g. you can do the following: data = df.drop('Actual_Data', axis=1).values and then cdist(data, data, 'euclid') or pdist(data, 'euclid'). You can also create another dataframe with distances like this:

data = df.drop('Actual_Data', axis=1).values

d = pd.DataFrame(itertools.combinations(df.index, 2), columns=['i','j'])
d['dist'] = pdist(data, 'euclid')


   i  j  dist
0  0  1  ...
1  0  2  ...
2  0  3  ...
3  0  4  ...
...
like image 110
Andreas K. Avatar answered Oct 24 '22 10:10

Andreas K.


Working with a subset of your data for eg.

df_data = [[888888, 3, 0, 0],
 [677767, 0, 2, 1],
 [212341212, 0, 0, 0],
 [141414141414, 0, 0, 0],
 [1112224, 0, 0, 0]]

# Creating the data
df = pd.DataFrame(data=data, columns=['Actual_Data', '8,8', '6,6', '7,7'], dtype=np.float64)

# Which looks like
#     Actual_Data  8,8  6,6  7,7
# 0  8.888880e+05  3.0  0.0  0.0
# 1  6.777670e+05  0.0  2.0  1.0
# 2  2.123412e+08  0.0  0.0  0.0
# 3  1.414141e+11  0.0  0.0  0.0
# 4  1.112224e+06  0.0  0.0  0.0

# Computing the distance matrix
dist_matrix = df.apply(lambda row: [np.linalg.norm(row.values - df.loc[[_id], :].values, 2) for _id in df.index.values], axis=1)

# Which looks like
# 0     [0.0, 211121.00003315636, 211452324.0, 141413252526.0, 223336.000020149]
# 1    [211121.00003315636, 0.0, 211663445.0, 141413463647.0, 434457.0000057543]
# 2                 [211452324.0, 211663445.0, 0.0, 141201800202.0, 211228988.0]
# 3        [141413252526.0, 141413463647.0, 141201800202.0, 0.0, 141413029190.0]
# 4      [223336.000020149, 434457.0000057543, 211228988.0, 141413029190.0, 0.0]

# Reformatting the above into readable format
dist_matrix = pd.DataFrame(
  data=dist_matrix.values.tolist(), 
  columns=df.index.tolist(), 
  index=df.index.tolist())

# Which gives you
#               0             1             2             3             4
# 0  0.000000e+00  2.111210e+05  2.114523e+08  1.414133e+11  2.233360e+05
# 1  2.111210e+05  0.000000e+00  2.116634e+08  1.414135e+11  4.344570e+05
# 2  2.114523e+08  2.116634e+08  0.000000e+00  1.412018e+11  2.112290e+08
# 3  1.414133e+11  1.414135e+11  1.412018e+11  0.000000e+00  1.414130e+11
# 4  2.233360e+05  4.344570e+05  2.112290e+08  1.414130e+11  0.000000e+00

Update

as pointed out in the comments the issue is memory overflow so we have to operate the problem in batches.

# Collecting the data
# df = ....

# Set this number to a lower value if you get the same `memory` errors.
batch = 200 # #'s of row's / user's used to compute the matrix

# To be conservative, let's write the intermediate results to file type.
dffname = []

for ifile,_slice in enumerate(np.array_split(range(df.shape[0]), batch)):

  # Let's compute distance for `batch` #'s of points in data frame
  tmp_df = df.iloc[_slice, :].apply(lambda row: [np.linalg.norm(row.values - df.loc[[_id], :].values, 2) for _id in df.index.values], axis=1)

  tmp_df = pd.DataFrame(tmp_df.values.tolist(), index=df.index.values[_slice], columns=df.index.values)

  # You can change it from csv to any other files
  tmp_df.to_csv(f"{ifile+1}.csv")
  dffname.append(f"{ifile+1}.csv")

# Reading back the dataFrames
dflist = []
for f in dffname:
  dflist.append(pd.read_csv(f, dtype=np.float64, index_col=0))

res = pd.concat(dflist)
like image 2
DOOM Avatar answered Oct 24 '22 10:10

DOOM