I have an input of 36,742 points which means if I wanted to calculate the lower triangle of a distance matrix (using the vincenty approximation) I would need to generate 36,742*36,741*0.5 = 1,349,974,563 distances.
I want to keep the pair combinations which are within 50km of each other. My current set-up is as follows
shops= [[id,lat,lon]...]
def lower_triangle_mat(points):
for i in range(len(shops)-1):
for j in range(i+1,len(shops)):
yield [shops[i],shops[j]]
def return_stores_cutoff(points,cutoff_km=0):
below_cut = []
counter = 0
for x in lower_triangle_mat(points):
dist_km = vincenty(x[0][1:3],x[1][1:3]).km
counter += 1
if counter % 1000000 == 0:
print("%d out of %d" % (counter,(len(shops)*len(shops)-1*0.5)))
if dist_km <= cutoff_km:
below_cut.append([x[0][0],x[1][0],dist_km])
return below_cut
start = time.clock()
stores = return_stores_cutoff(points=shops,cutoff_km=50)
print(time.clock() - start)
This will obviously take hours and hours. Some possibilities I was thinking of:
Edit: I think geohashing is definitely needed here - an example from:
from geoindex import GeoGridIndex, GeoPoint
geo_index = GeoGridIndex()
for _ in range(10000):
lat = random.random()*180 - 90
lng = random.random()*360 - 180
index.add_point(GeoPoint(lat, lng))
center_point = GeoPoint(37.7772448, -122.3955118)
for distance, point in index.get_nearest_points(center_point, 10, 'km'):
print("We found {0} in {1} km".format(point, distance))
However, I would also like to vectorise (instead of loop) the distance calculations for the stores returned by the geo-hash.
Edit2: Pouria Hadjibagheri - I tried using lambda and map:
# [B]: Mapping approach
lwr_tr_mat = ((shops[i],shops[j]) for i in range(len(shops)-1) for j in range(i+1,len(shops)))
func = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km)
# Trying to see if conditional statements slow this down
func_cond = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km) if vincenty(x[0],x[1]).km <= 50 else None
start = time.clock()
out_dist = list(map(func,lwr_tr_mat))
print(time.clock() - start)
start = time.clock()
out_dist = list(map(func_cond,lwr_tr_mat))
print(time.clock() - start)
And they were all around 61 seconds (I restricted number of stores to 2000 from 32,000). Perhaps I used map incorrectly?
The math. dist() method returns the Euclidean distance between two points (p and q), where p and q are the coordinates of that point.
Install it via pip install mpu --user and use it like this to get the haversine distance: import mpu # Point one lat1 = 52.2296756 lon1 = 21.0122287 # Point two lat2 = 52.406374 lon2 = 16.9251681 # What you were looking for dist = mpu.
For this divide the values of longitude and latitude of both the points by 180/pi. The value of pi is 22/7. The value of 180/pi is approximately 57.29577951. If we want to calculate the distance between two places in miles, use the value 3, 963, which is the radius of Earth.
This sounds like a classic use case for k-D trees.
If you first transform your points into Euclidean space then you can use the query_pairs
method of scipy.spatial.cKDTree
:
from scipy.spatial import cKDTree
tree = cKDTree(data)
# where data is (nshops, ndim) containing the Euclidean coordinates of each shop
# in units of km
pairs = tree.query_pairs(50, p=2) # 50km radius, L2 (Euclidean) norm
pairs
will be a set
of (i, j)
tuples corresponding to the row indices of pairs of shops that are ≤50km from each other.
The output of tree.sparse_distance_matrix
is a scipy.sparse.dok_matrix
. Since the matrix will be symmetric and you're only interested in unique row/column pairs, you could use scipy.sparse.tril
to zero out the upper triangle, giving you a scipy.sparse.coo_matrix
. From there you can access the nonzero row and column indices and their corresponding distance values via the .row
, .col
and .data
attributes:
from scipy import sparse
tree_dist = tree.sparse_distance_matrix(tree, max_distance=10000, p=2)
udist = sparse.tril(tree_dist, k=-1) # zero the main diagonal
ridx = udist.row # row indices
cidx = udist.col # column indices
dist = udist.data # distance values
Have you tried mapping entire arrays and functions instead of iterating through them? An example would be as follows:
from numpy.random import rand
my_array = rand(int(5e7), 1) # An array of 50,000,000 random numbers in double.
Now what is normally done is:
squared_list_iter = [value**2 for value in my_array]
Which of course works, but is optimally invalid.
The alternative would be to map the array with a function. This is done as follows:
func = lambda x: x**2 # Here is what I want to do on my array.
squared_list_map = map(func, test) # Here I am doing it!
Now, one might ask, how is this any different, or even better for that matter? Since now we have added a call to a function, too! Here is your answer:
For the former solution (via iteration):
1 loop: 1.11 minutes.
Compared to the latter solution (mapping):
500 loop, on average 560 ns.
Simultaneous conversion of a map()
to list by list(map(my_list))
would increase the time by a factor of 10 to approximately 500 ms
.
You choose!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With