I am new to numpy/pandas and vectorized computation. I am doing a data task where I have two datasets. Dataset 1 contains a list of places with their longitude and latitude and a variable A. Dataset 2 also contains a list of places with their longitude and latitude. For each place in dataset 1, I would like to calculate its distances to all the places in dataset 2 but I would only like to get a count of places in dataset 2 that are less than the value of variable A. Note also both of the datasets are very large, so that I need to use vectorized operations to expedite the computation.
For example, my dataset1 may look like below:
id lon lat varA
1 20.11 19.88 100
2 20.87 18.65 90
3 18.99 20.75 120
and my dataset2 may look like below:
placeid lon lat
a 18.75 20.77
b 19.77 22.56
c 20.86 23.76
d 17.55 20.74
Then for id == 1 in dataset1, I would like to calculate its distances to all four points (a,c,c,d) in dataset2 and I would like to have a count of how many of the distances are less than the corresponding value of varA. For example, the four distances calculated are 90, 70, 120, 110 and varA is 100. Then the value should be 2.
I already have a vectorized function to calculate distance between the two pair of coordinates. Suppose the function (haversine(x,y)) is properly implemented, I have the following code.
dataset2['count'] = dataset1.apply(lambda x:
haversine(x['lon'],x['lat'],dataset2['lon'], dataset2['lat']).shape[0], axis
= 1)
However, this gives the total number of rows, but not the ones that satisfy my requirements.
Would anyone be able to point me how to make the code work?
If we can instead vectorize our operation and just perform one (or a few) large vector or matrix operation using something like NumPy , our code will run a lot faster . Ok, so now we know why vectorizing our operations might be a good idea. But how can we do so? Let’s look at two examples.
Vectorization is one of the most useful techniques to make your machine learning code more efficient. In this post, you will learn everything you need to know to start using vectorization efficiently in your machine learning projects. Fixed some issues in the math examples. Thanks to stumpeschach for pointing them out! Post published.
There are many methods to calculate distances in machine learning. Here we are going to discuss some of them. It is the distance between x and y in n dimension. Here, we are calculating distance d between to data points p1 and p2. array ( [ [1. ], It is the sum of absolute differences of all coordinates.
Typically, if you want to vectorize a pairwise similarity metric between a set of M D -dimensional points with shape (M, D) and a set of N D -dimensional points with shape (N, D), you need to perform two steps: Expand.
If you can project the coordinates to a local projection (e.g. UTM), which is pretty straight forward with pyproj
and generally more favorable than lon/lat for measurement, then there is a much much MUCH faster way using scipy.spatial
. Neither of df['something'] = df.apply(...)
and np.vectorize()
are not truly vectorized, under the hood, they use looping.
ds1
id lon lat varA
0 1 20.11 19.88 100
1 2 20.87 18.65 90
2 3 18.99 20.75 120
ds2
placeid lon lat
0 a 18.75 20.77
1 b 19.77 22.56
2 c 20.86 23.76
3 d 17.55 20.74
from scipy.spatial import distance
# gey coordinates of each set of points as numpy array
coords_a = ds1.values[:,(1,2)]
coords_b = ds2.values[:, (1,2)]
coords_a
#out: array([[ 20.11, 19.88],
# [ 20.87, 18.65],
# [ 18.99, 20.75]])
distances = distance.cdist(coords_a, coords_b)
#out: array([[ 1.62533074, 2.70148108, 3.95182236, 2.70059253],
# [ 2.99813275, 4.06178532, 5.11000978, 3.92307278],
# [ 0.24083189, 1.97091349, 3.54358575, 1.44003472]])
distances
is in fact distance between every pair of points. coords_a.shape
is (3, 2)
and coords_b.shape
is (4, 2)
, so the result is (3,4)
. The default metric for np.distance
is eculidean
, but there are other metrics as well.
For the sake of this example, let's assume vara
is:
vara = np.array([2,4.5,2])
(instead of 100 90 120
). We need to identify which value in distances
in row one is smaller than 2
, in row two smaller that 4.5
,..., one way to solve this problem is subtracting each value in vara
from corresponding row (note that we must resize vara
):
vara.resize(3,1)
res = res - vara
#out: array([[-0.37466926, 0.70148108, 1.95182236, 0.70059253],
# [-1.50186725, -0.43821468, 0.61000978, -0.57692722],
# [-1.75916811, -0.02908651, 1.54358575, -0.55996528]])
then setting positive values to zero and making negative values positive will give us the final array:
res[res>0] = 0
res = np.absolute(res)
#out: array([[ 0.37466926, 0. , 0. , 0. ],
# [ 1.50186725, 0.43821468, 0. , 0.57692722],
# [ 1.75916811, 0.02908651, 0. , 0.55996528]])
Now, to sum over each row:
sum_ = res.sum(axis=1)
#out: array([ 0.37466926, 2.51700915, 2.34821989])
and to count the items in each row:
count = np.count_nonzero(res, axis=1)
#out: array([1, 3, 3])
This is a fully vectorized (custom) solution which you can tweak to your liking and should accommodate any level of complexity. yet another solution is cKDTree
. the code is from documentation. it should be fairly easy to adopt it to your problem, but in case you need assistance don't hesitate to ask.
x, y = np.mgrid[0:4, 0:4]
points = zip(x.ravel(), y.ravel())
tree = spatial.cKDTree(points)
tree.query_ball_point([2, 0], 1)
[4, 8, 9, 12]
query_ball_point()
finds all points within distance r of point(s) x, and it is amazingly fast.
one final note: don't use these algorithms with lon/lat input, particularly if your area of interest is far from equator, because the error can get huge.
UPDATE:
To project your coordinates, you need to convert from WGS84 (lon/lat)
to appropriate UTM
. To find out which utm zone you should project to use epsg.io.
lon = -122.67598
lat = 45.52168
WGS84 = "+init=EPSG:4326"
EPSG3740 = "+init=EPSG:3740"
Proj_to_EPSG3740 = pyproj.Proj(EPSG3740)
Proj_to_EPSG3740(lon,lat)
# out: (525304.9265963673, 5040956.147893889)
You can do df.apply()
and use Proj_to_...
to project df.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With