I have an input of 36,742 points which means if I wanted to calculate the lower triangle of a distance matrix (using the vincenty approximation) I would need to generate 36,742*36,741*0.5 = 1,349,974,563 distances. I want to keep the pair combinations which are within 50km of each other. My current set-up is as follows <pre class="prettyprint"><code>shops= [[id,lat,lon]...] def lower_triangle_mat(points): for i in range(len(shops)-1): for j in range(i+1,len(shops)): yield [shops[i],shops[j]] def return_stores_cutoff(points,cutoff_km=0): below_cut = [] counter = 0 for x in lower_triangle_mat(points): dist_km = vincenty(x[0][1:3],x[1][1:3]).km counter += 1 if counter % 1000000 == 0: print("%d out of %d" % (counter,(len(shops)*len(shops)-1*0.5))) if dist_km <= cutoff_km: below_cut.append([x[0][0],x[1][0],dist_km]) return below_cut start = time.clock() stores = return_stores_cutoff(points=shops,cutoff_km=50) print(time.clock() - start) </code></pre> This will obviously take hours and hours. Some possibilities I was thinking of: <ul> <li>Use numpy to vectorise these calculations rather than looping through</li> <li>Use some kind of hashing to get a quick rough-cut off (all stores within 100km) and then only calculate accurate distances between those stores</li> <li>Instead of storing the points in a list use something like a quad-tree but I think that only helps with the ranking of close points rather than actual distance -> so I guess some kind of geodatabase </li> <li>I can obviously try the haversine or project and use euclidean distances, however I am interested in using the most accurate measure possible</li> <li>Make use of parallel processing (however I was having a bit of difficulty coming up how to cut the list to still get all the relevant pairs).</li> </ul> Edit: I think geohashing is definitely needed here - an example from: <pre class="prettyprint"><code>from geoindex import GeoGridIndex, GeoPoint geo_index = GeoGridIndex() for _ in range(10000): lat = random.random()*180 - 90 lng = random.random()*360 - 180 index.add_point(GeoPoint(lat, lng)) center_point = GeoPoint(37.7772448, -122.3955118) for distance, point in index.get_nearest_points(center_point, 10, 'km'): print("We found {0} in {1} km".format(point, distance)) </code></pre> However, I would also like to vectorise (instead of loop) the distance calculations for the stores returned by the geo-hash. Edit2: Pouria Hadjibagheri - I tried using lambda and map: <pre class="prettyprint"><code># [B]: Mapping approach lwr_tr_mat = ((shops[i],shops[j]) for i in range(len(shops)-1) for j in range(i+1,len(shops))) func = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km) # Trying to see if conditional statements slow this down func_cond = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km) if vincenty(x[0],x[1]).km <= 50 else None start = time.clock() out_dist = list(map(func,lwr_tr_mat)) print(time.clock() - start) start = time.clock() out_dist = list(map(func_cond,lwr_tr_mat)) print(time.clock() - start) </code></pre> And they were all around 61 seconds (I restricted number of stores to 2000 from 32,000). Perhaps I used map incorrectly?

This sounds like a classic use case for k-D trees. If you first transform your points into Euclidean space then you can use the <code>query_pairs</code> method of <code>scipy.spatial.cKDTree</code>: <pre class="prettyprint"><code>from scipy.spatial import cKDTree tree = cKDTree(data) # where data is (nshops, ndim) containing the Euclidean coordinates of each shop # in units of km pairs = tree.query_pairs(50, p=2) # 50km radius, L2 (Euclidean) norm </code></pre> <code>pairs</code> will be a <code>set</code> of <code>(i, j)</code> tuples corresponding to the row indices of pairs of shops that are ≤50km from each other. <hr> The output of <code>tree.sparse_distance_matrix</code> is a <code>scipy.sparse.dok_matrix</code>. Since the matrix will be symmetric and you're only interested in unique row/column pairs, you could use <code>scipy.sparse.tril</code> to zero out the upper triangle, giving you a <code>scipy.sparse.coo_matrix</code>. From there you can access the nonzero row and column indices and their corresponding distance values via the <code>.row</code>, <code>.col</code> and <code>.data</code> attributes: <pre class="prettyprint"><code>from scipy import sparse tree_dist = tree.sparse_distance_matrix(tree, max_distance=10000, p=2) udist = sparse.tril(tree_dist, k=-1) # zero the main diagonal ridx = udist.row # row indices cidx = udist.col # column indices dist = udist.data # distance values </code></pre>

Python calculate lots of distances quickly

Tags:

python

haversine

numpy

distance

geohashing

I have an input of 36,742 points which means if I wanted to calculate the lower triangle of a distance matrix (using the vincenty approximation) I would need to generate 36,742*36,741*0.5 = 1,349,974,563 distances.

I want to keep the pair combinations which are within 50km of each other. My current set-up is as follows

shops= [[id,lat,lon]...]

def lower_triangle_mat(points):
    for i in range(len(shops)-1):
        for j in range(i+1,len(shops)):
            yield [shops[i],shops[j]]

def return_stores_cutoff(points,cutoff_km=0):
    below_cut = []
    counter = 0
    for x in lower_triangle_mat(points):
        dist_km = vincenty(x[0][1:3],x[1][1:3]).km
        counter += 1
        if counter % 1000000 == 0:
            print("%d out of %d" % (counter,(len(shops)*len(shops)-1*0.5)))
        if dist_km <= cutoff_km:
            below_cut.append([x[0][0],x[1][0],dist_km])
    return below_cut

start = time.clock()
stores = return_stores_cutoff(points=shops,cutoff_km=50)
print(time.clock() - start)

This will obviously take hours and hours. Some possibilities I was thinking of:

Use numpy to vectorise these calculations rather than looping through
Use some kind of hashing to get a quick rough-cut off (all stores within 100km) and then only calculate accurate distances between those stores
Instead of storing the points in a list use something like a quad-tree but I think that only helps with the ranking of close points rather than actual distance -> so I guess some kind of geodatabase
I can obviously try the haversine or project and use euclidean distances, however I am interested in using the most accurate measure possible
Make use of parallel processing (however I was having a bit of difficulty coming up how to cut the list to still get all the relevant pairs).

Edit: I think geohashing is definitely needed here - an example from:

from geoindex import GeoGridIndex, GeoPoint

geo_index = GeoGridIndex()
for _ in range(10000):
    lat = random.random()*180 - 90
    lng = random.random()*360 - 180
    index.add_point(GeoPoint(lat, lng))

center_point = GeoPoint(37.7772448, -122.3955118)
for distance, point in index.get_nearest_points(center_point, 10, 'km'):
    print("We found {0} in {1} km".format(point, distance))

However, I would also like to vectorise (instead of loop) the distance calculations for the stores returned by the geo-hash.

Edit2: Pouria Hadjibagheri - I tried using lambda and map:

# [B]: Mapping approach           
lwr_tr_mat = ((shops[i],shops[j]) for i in range(len(shops)-1) for j in range(i+1,len(shops)))

func = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km)
# Trying to see if conditional statements slow this down
func_cond = lambda x: (x[0][0],x[1][0],vincenty(x[0],x[1]).km) if vincenty(x[0],x[1]).km <= 50 else None

start = time.clock()
out_dist = list(map(func,lwr_tr_mat))
print(time.clock() - start)

start = time.clock()
out_dist = list(map(func_cond,lwr_tr_mat))
print(time.clock() - start)

And they were all around 61 seconds (I restricted number of stores to 2000 from 32,000). Perhaps I used map incorrectly?

274

asked Feb 09 '16 16:02

mptevsion

2 Answers

This sounds like a classic use case for k-D trees.

If you first transform your points into Euclidean space then you can use the query_pairs method of scipy.spatial.cKDTree:

from scipy.spatial import cKDTree

tree = cKDTree(data)
# where data is (nshops, ndim) containing the Euclidean coordinates of each shop
# in units of km

pairs = tree.query_pairs(50, p=2)   # 50km radius, L2 (Euclidean) norm

pairs will be a set of (i, j) tuples corresponding to the row indices of pairs of shops that are ≤50km from each other.

The output of tree.sparse_distance_matrix is a scipy.sparse.dok_matrix. Since the matrix will be symmetric and you're only interested in unique row/column pairs, you could use scipy.sparse.tril to zero out the upper triangle, giving you a scipy.sparse.coo_matrix. From there you can access the nonzero row and column indices and their corresponding distance values via the .row, .col and .data attributes:

from scipy import sparse

tree_dist = tree.sparse_distance_matrix(tree, max_distance=10000, p=2)
udist = sparse.tril(tree_dist, k=-1)    # zero the main diagonal
ridx = udist.row    # row indices
cidx = udist.col    # column indices
dist = udist.data   # distance values

198

answered Sep 30 '22 13:09

ali_m

Have you tried mapping entire arrays and functions instead of iterating through them? An example would be as follows:

from numpy.random import rand

my_array = rand(int(5e7), 1)  # An array of 50,000,000 random numbers in double.

Now what is normally done is:

squared_list_iter = [value**2 for value in my_array]

Which of course works, but is optimally invalid.

The alternative would be to map the array with a function. This is done as follows:

func = lambda x: x**2  # Here is what I want to do on my array.

squared_list_map = map(func, test)  # Here I am doing it!

Now, one might ask, how is this any different, or even better for that matter? Since now we have added a call to a function, too! Here is your answer:

For the former solution (via iteration):

1 loop: 1.11 minutes.

Compared to the latter solution (mapping):

500 loop, on average 560 ns.

Simultaneous conversion of a map() to list by list(map(my_list)) would increase the time by a factor of 10 to approximately 500 ms.

You choose!

answered Sep 30 '22 11:09

Pouria

Related questions
                            
                                Why do i need to create object of `QApplication` and what is the purpose of it in PyQt GUI programming?
                            
                                how to save pil cropped image to image field in django
                            
                                100 digit floating point python
                            
                                Python - Using defaultdict to make dictionary of custom objects
                            
                                Why am I getting a data conversion warning?
                            
                                Protocol buffers, where to use them? [closed]
                            
                                Plotting Large Datasets in IPython Notebook (Bokeh)
                            
                                Session.run() /Tensor.eval() of Tensorflow run for a crazy long time
                            
                                When is getattr() not like normal attribute access? [duplicate]
                            
                                Find files in a directory containing desired string in Python
                            
                                Is it possible to use functions defined in the shell from python?
                            
                                How to set PYTHONPATH differently for version 2 and 3?
                            
                                Django /manage.py runserver doesn't work (Windows)
                            
                                how to delete kafka message after reading
                            
                                TensorFlow on Jupyter: Can't restore variables
                            
                                Mask 2D numpy array
                            
                                How to set default python version for pip?
                            
                                Django file upload with FTP backend
                            
                                How can I give row and column names to Scipy's csr_matrix?
                            
                                Cannot append items to multiprocessing shared list

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With