How can I calculate the Euclidean distance between all the rows of a dataframe? I am trying this code, but it is not working:
zero_data = data
distance = lambda column1, column2: pd.np.linalg.norm(column1 - column2)
result = zero_data.apply(lambda col1: zero_data.apply(lambda col2: distance(col1, col2)))
result.head()
This is how my (44062 by 278) dataframe looks like:
Description. D = pdist( X ) returns the Euclidean distance between pairs of observations in X . D = pdist( X , Distance ) returns the distance by using the method specified by Distance . D = pdist( X , Distance , DistParameter ) returns the distance by using the method specified by Distance and DistParameter .
Euclidean distance is calculated as the square root of the sum of the squared differences between the two vectors.
So a better option is to use pdist from scipy.spatial.distance import pdist pdist (df.values, 'euclid') which will return an array (of size 970707891) of all the pairwise Euclidean distances between the rows of df.
sklearn.metrics.pairwise. euclidean_distances(X, Y=None, *, Y_norm_squared=None, squared=False, X_norm_squared=None) [source] ¶ Considering the rows of X (and Y=X) as vectors, compute the distance matrix between each pair of vectors. For efficiency reasons, the euclidean distance between a pair of row vector x and y is computed as:
For efficiency reasons, the euclidean distance between a pair of row vector x and y is computed as: dist(x,ist(x, y) = sqrt(dot(x, x) - 2 * dot(x, y) + dot(y, y)) This formulation has two advantages over other ways of computing distances.
Note that we can also use this function to calculate the Euclidean distance between two columns of a pandas DataFrame: The Euclidean distance between the two columns turns out to be 40.49691. 1.
To compute the Eucledian distance between two rows i and j of a dataframe df:
np.linalg.norm(df.loc[i] - df.loc[j])
To compute it between consecutive rows, i.e. 0 and 1, 1 and 2, 2 and 3, ...
np.linalg.norm(df.diff(axis=0).drop(0), axis=1)
If you want to compute it between all the rows, i.e. 0 and 1, 0 and 2, ..., 1 and 1, 1 and 2 ..., then you have to loop through all the combinations of i and j (keep in mind that for 44062 rows there are 970707891 such combinations so using a for-loop will be very slow):
import itertools
for i, j in itertools.combinations(df.index, 2):
d_ij = np.linalg.norm(df.loc[i] - df.loc[j])
Edit:
Instead, you can use scipy.spatial.distance.cdist which computes distance between each pair of two collections of inputs:
from scipy.spatial.distance import cdist
cdist(df, df, 'euclid')
This will return you a symmetric (44062 by 44062) matrix of Euclidian distances between all the rows of your dataframe. The problem is that you need a lot of memory for it to work (at least 8*44062**2 bytes of memory, i.e. ~16GB). So a better option is to use pdist
from scipy.spatial.distance import pdist
pdist(df.values, 'euclid')
which will return an array (of size 970707891) of all the pairwise Euclidean distances between the rows of df.
P.s. Don't forget to ignore the 'Actual_data' column in the computations of distances. E.g. you can do the following: data = df.drop('Actual_Data', axis=1).values
and then cdist(data, data, 'euclid')
or pdist(data, 'euclid')
. You can also create another dataframe with distances like this:
data = df.drop('Actual_Data', axis=1).values
d = pd.DataFrame(itertools.combinations(df.index, 2), columns=['i','j'])
d['dist'] = pdist(data, 'euclid')
i j dist
0 0 1 ...
1 0 2 ...
2 0 3 ...
3 0 4 ...
...
Working with a subset of your data for eg.
df_data = [[888888, 3, 0, 0],
[677767, 0, 2, 1],
[212341212, 0, 0, 0],
[141414141414, 0, 0, 0],
[1112224, 0, 0, 0]]
# Creating the data
df = pd.DataFrame(data=data, columns=['Actual_Data', '8,8', '6,6', '7,7'], dtype=np.float64)
# Which looks like
# Actual_Data 8,8 6,6 7,7
# 0 8.888880e+05 3.0 0.0 0.0
# 1 6.777670e+05 0.0 2.0 1.0
# 2 2.123412e+08 0.0 0.0 0.0
# 3 1.414141e+11 0.0 0.0 0.0
# 4 1.112224e+06 0.0 0.0 0.0
# Computing the distance matrix
dist_matrix = df.apply(lambda row: [np.linalg.norm(row.values - df.loc[[_id], :].values, 2) for _id in df.index.values], axis=1)
# Which looks like
# 0 [0.0, 211121.00003315636, 211452324.0, 141413252526.0, 223336.000020149]
# 1 [211121.00003315636, 0.0, 211663445.0, 141413463647.0, 434457.0000057543]
# 2 [211452324.0, 211663445.0, 0.0, 141201800202.0, 211228988.0]
# 3 [141413252526.0, 141413463647.0, 141201800202.0, 0.0, 141413029190.0]
# 4 [223336.000020149, 434457.0000057543, 211228988.0, 141413029190.0, 0.0]
# Reformatting the above into readable format
dist_matrix = pd.DataFrame(
data=dist_matrix.values.tolist(),
columns=df.index.tolist(),
index=df.index.tolist())
# Which gives you
# 0 1 2 3 4
# 0 0.000000e+00 2.111210e+05 2.114523e+08 1.414133e+11 2.233360e+05
# 1 2.111210e+05 0.000000e+00 2.116634e+08 1.414135e+11 4.344570e+05
# 2 2.114523e+08 2.116634e+08 0.000000e+00 1.412018e+11 2.112290e+08
# 3 1.414133e+11 1.414135e+11 1.412018e+11 0.000000e+00 1.414130e+11
# 4 2.233360e+05 4.344570e+05 2.112290e+08 1.414130e+11 0.000000e+00
as pointed out in the comments the issue is memory overflow
so we have to operate the problem in batches.
# Collecting the data
# df = ....
# Set this number to a lower value if you get the same `memory` errors.
batch = 200 # #'s of row's / user's used to compute the matrix
# To be conservative, let's write the intermediate results to file type.
dffname = []
for ifile,_slice in enumerate(np.array_split(range(df.shape[0]), batch)):
# Let's compute distance for `batch` #'s of points in data frame
tmp_df = df.iloc[_slice, :].apply(lambda row: [np.linalg.norm(row.values - df.loc[[_id], :].values, 2) for _id in df.index.values], axis=1)
tmp_df = pd.DataFrame(tmp_df.values.tolist(), index=df.index.values[_slice], columns=df.index.values)
# You can change it from csv to any other files
tmp_df.to_csv(f"{ifile+1}.csv")
dffname.append(f"{ifile+1}.csv")
# Reading back the dataFrames
dflist = []
for f in dffname:
dflist.append(pd.read_csv(f, dtype=np.float64, index_col=0))
res = pd.concat(dflist)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With