Number of rows in my dataset is 500000+. I need Hausdorff distance of every id
between itself and others. and repeat it for the whole dataset
I have a huge data set. Here is the small part:
df =
id_easy ordinal latitude longitude epoch day_of_week
0 aaa 1.0 22.0701 2.6685 01-01-11 07:45 Friday
1 aaa 2.0 22.0716 2.6695 01-01-11 07:45 Friday
2 aaa 3.0 22.0722 2.6696 01-01-11 07:46 Friday
3 bbb 1.0 22.1166 2.6898 01-01-11 07:58 Friday
4 bbb 2.0 22.1162 2.6951 01-01-11 07:59 Friday
5 ccc 1.0 22.1166 2.6898 01-01-11 07:58 Friday
6 ccc 2.0 22.1162 2.6951 01-01-11 07:59 Friday
I want to calculate Haudorff Distance:
import pandas as pd
import numpy as np
from scipy.spatial.distance import directed_hausdorff
from scipy.spatial.distance import pdist, squareform
u = np.array([(2.6685,22.0701),(2.6695,22.0716),(2.6696,22.0722)]) # coordinates of `id_easy` of taxi `aaa`
v = np.array([(2.6898,22.1166),(2.6951,22.1162)]) # coordinates of `id_easy` of taxi `bbb`
directed_hausdorff(u, v)[0]
Output is 0.05114626086039758
Now I want to calculate this distance for the whole dataset. For all id_easy
s. Desired output is matrix with 0
on diagonal (because distance between the aaa
and aaa
is 0
):
aaa bbb ccc
aaa 0 0.05114 ...
bbb ... 0
ccc 0
You're talking about calculating 500000^2+ distances. If you calculate 1000 of these distances every second, it will take you 7.93 years to complete your matrix. I'm not sure whether the Hausdorff distance is symmetric, but even if it is, that only saves you a factor of two (3.96 years).
The matrix will also take about a terabyte of memory.
I recommend calculating this only when needed, or if you really need the whole matrix, you'll need to parallelize the calculations. On the bright side, this problem can easily be broken up. For example, with four cores, you can split the problem thusly (in pseudocode):
n = len(u)
m = len(v)
A = hausdorff_distance_matrix(u[:n], v[:m])
B = hausdorff_distance_matrix(u[:n], v[m:])
C = hausdorff_distance_matrix(u[n:], v[:m])
D = hausdorff_distance_matrix(u[n:], v[m:])
results = [[A, B],
[C, D]]
Where hausdorff_distance_matrix(u, v)
returns all distance combinations between u
and v
. You'll probably need to split it into a lot more than four segments though.
What is the application? Can you get away with only calculating these piece-wise as needed?
At first I define a method which provides some sample data. It would be a lot easier if you provide something like that in the question. In most performance related problems the size of the real problem is needed to find a optimal solution.
In the following answer I will assume that the average size of id_easy
is 17 and there are 30000 different ids which results in a data set size of 510_000.
Create sample data
import numpy as np
import numba as nb
N_ids=30_000
av_id_size=17
#create_data (pre sorting according to id assumed)
lat_lon=np.random.rand(N_ids*av_id_size,2)
#create_ids (sorted array with ids)
ids=np.empty(N_ids*av_id_size,dtype=np.int64)
ind=0
for i in range(N_ids):
for j in range(av_id_size):
ids[i*av_id_size+j]=ind
ind+=1
Hausdorff function
The following function is a slightly modified version from scipy-source. The following modifications are made:
Distance loop unrolled for the (N,2) case
#Modified Code from Scipy-source
#https://github.com/scipy/scipy/blob/master/scipy/spatial/_hausdorff.pyx
#Copyright (C) Tyler Reddy, Richard Gowers, and Max Linke, 2016
#Copyright © 2001, 2002 Enthought, Inc.
#All rights reserved.
#Copyright © 2003-2013 SciPy Developers.
#All rights reserved.
#Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
#Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.
#Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following
#disclaimer in the documentation and/or other materials provided with the distribution.
#Neither the name of Enthought nor the names of the SciPy Developers may be used to endorse or promote products derived
#from this software without specific prior written permission.
#THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING,
#BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED.
#IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY,
#OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS;
#OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
#(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
@nb.njit()
def directed_hausdorff_nb(ar1, ar2):
N1 = ar1.shape[0]
N2 = ar2.shape[0]
data_dims = ar1.shape[1]
# Shuffling for very small arrays disbabled
# Enable it for larger arrays
#resort1 = np.arange(N1)
#resort2 = np.arange(N2)
#np.random.shuffle(resort1)
#np.random.shuffle(resort2)
#ar1 = ar1[resort1]
#ar2 = ar2[resort2]
cmax = 0
for i in range(N1):
no_break_occurred = True
cmin = np.inf
for j in range(N2):
# faster performance with square of distance
# avoid sqrt until very end
# Simplificaten (loop unrolling) for (n,2) arrays
d = (ar1[i, 0] - ar2[j, 0])**2+(ar1[i, 1] - ar2[j, 1])**2
if d < cmax: # break out of `for j` loop
no_break_occurred = False
break
if d < cmin: # always true on first iteration of for-j loop
cmin = d
# always true on first iteration of for-j loop, after that only
# if d >= cmax
if cmin != np.inf and cmin > cmax and no_break_occurred == True:
cmax = cmin
return np.sqrt(cmax)
Calculating Hausdorff distance on subsets
@nb.njit(parallel=True)
def get_distance_mat(def_slice,lat_lon):
Num_ids=def_slice.shape[0]-1
out=np.empty((Num_ids,Num_ids),dtype=np.float64)
for i in nb.prange(Num_ids):
ar1=lat_lon[def_slice[i:i+1],:]
for j in range(i,Num_ids):
ar2=lat_lon[def_slice[j:j+1],:]
dist=directed_hausdorff_nb(ar1, ar2)
out[i,j]=dist
out[j,i]=dist
return out
Example and Timings
#def_slice defines the start and end of the slices
_,def_slice=np.unique(ids,return_index=True)
def_slice=np.append(def_slice,ids.shape[0])
%timeit res_1=get_distance_mat(def_slice,lat_lon)
#1min 2s ± 301 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Try using the computed distance from scipy
from scipy.spatial.distance import cdist
hausdorff_distance = cdist(df[['latitude', 'longitude']], df[['latitude', 'longitude']], lambda u, v: directed_hausdorff(u, v)[0])
hausdorff_distance_df = pd.DataFrame(hausdorff_distance)
As a note though, whatever method you will end up using - it will take a lot of time to calculate, just to due to the sheer volume of the data. Ask yourself, if you genuinely need every single pair of distances.
Practically, these kind of problems are solved by restricting the number of pairings to a manageable number. For example slice your data frame into smaller sets, with each set restricted to a geographical area and then find the pair of distances within that geographical area.
The above approach is used by supermarkets to identify spots for their new stores. They are not calculating a pair of distances between every single store they own and their competitors own. First they restrict the area, which will have only 5-10 stores in total and only then they proceed to calculate the distances.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With