Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

speeding up processing 5 million rows of coordinate data

Tags:

python

csv

geopy

I have a csv file with two columns (latitude, longitude) that contains over 5 million rows of geolocation data. I need to identify the points which are not within 5 miles of any other point in the list, and output everything back into another CSV that has an extra column (CloseToAnotherPoint) which is True if there is another point is within 5 miles, and False if there isn't.

Here is my current solution using geopy (not making any web calls, just using the function to calculate distance):

from geopy.point import Point
from geopy.distance import vincenty
import csv


class CustomGeoPoint(object):
    def __init__(self, latitude, longitude):
        self.location = Point(latitude, longitude)
        self.close_to_another_point = False


try:
    output = open('output.csv','w')
    writer = csv.writer(output, delimiter = ',', quoting=csv.QUOTE_ALL)
    writer.writerow(['Latitude', 'Longitude', 'CloseToAnotherPoint'])

    # 5 miles
    close_limit = 5
    geo_points = []

    with open('geo_input.csv', newline='') as geo_csv:
        reader = csv.reader(geo_csv)
        next(reader, None) # skip the headers
        for row in reader:
            geo_points.append(CustomGeoPoint(row[0], row[1]))

    # for every point, look at every point until one is found within 5 miles
    for geo_point in geo_points:
        for geo_point2 in geo_points:
            dist = vincenty(geo_point.location, geo_point2.location).miles
            if 0 < dist <= close_limit: # (0,close_limit]
                geo_point.close_to_another_point = True
                break
        writer.writerow([geo_point.location.latitude, geo_point.location.longitude,
                         geo_point.close_to_another_point])

finally:
    output.close()

As you might be able to tell from looking at it, this solution is extremely slow. So slow in fact that I let it run for 3 days and it still didn't finish!

I've thought about trying to split up the data into chunks (multiple CSV files or something) so that the inner loop doesn't have to look at every other point, but then I would have to figure out how to make sure the borders of each section checked against the borders of its adjacent sections, and that just seems overly complex and I'm afraid it would be more of a headache than it's worth.

So any pointers on how to make this faster?

like image 328
secondbreakfast Avatar asked Mar 14 '16 02:03

secondbreakfast


4 Answers

Let's look at what you're doing.

  1. You read all the points into a list named geo_points.

    Now, can you tell me whether the list is sorted? Because if it was sorted, we definitely want to know that. Sorting is valuable information, especially when you're dealing with 5 million of anything.

  2. You loop over all the geo_points. That's 5 million, according to you.

  3. Within the outer loop, you loop again over all 5 million geo_points.

  4. You compute the distance in miles between the two loop items.

  5. If the distance is less than your threshold, you record that information on the first point, and stop the inner loop.

  6. When the inner loop stops, you write information about the outer loop item to a CSV file.

Notice a couple of things. First, you're looping 5 million times in the outer loop. And then you're looping 5 million times in the inner loop.

This is what O(n²) means.

The next time you see someone talking about "Oh, this is O(log n) but that other thing is O(n log n)," remember this experience - you're running an n² algorithm where n in this case is 5,000,000. Sucks, dunnit?

Anyway, you have some problems.

Problem 1: You'll eventually wind up comparing every point against itself. Which should have a distance of zero, meaning they will all be marked as within whatever distance threshold. If your program ever finishes, all the cells will be marked True.

Problem 2: When you compare point #1 with, say, point #12345, and they are within the threshold distance from each other, you are recording that information about point #1. But you don't record the same information about the other point. You know that point #12345 (geo_point2) is reflexively within the threshold of point #1, but you don't write that down. So you're missing a chance to just skip over 5 million comparisons.

Problem 3: If you compare point #1 and point #2, and they are not within the threshold distance, what happens when you compare point #2 with point #1? Your inner loop is starting from the beginning of the list every time, but you know that you have already compared the start of the list with the end of the list. You can reduce your problem space by half just by making your outer loop go i in range(0, 5million) and your inner loop go j in range(i+1, 5million).

Answers?

Consider your latitude and longitude on a flat plane. You want to know if there's a point within 5 miles. Let's think about a 10 mile square, centered on your point #1. That's a square centered on (X1, Y1), with a top left corner at (X1 - 5miles, Y1 + 5miles) and a bottom right corner at (X1 + 5miles, Y1 - 5miles). Now, if a point is within that square, it might not be within 5 miles of your point #1. But you can bet that if it's outside that square, it's more than 5 miles away.

As @SeverinPappadeaux points out, distance on a spheroid like Earth is not quite the same as distance on a flat plane. But so what? Set your square a little bigger to allow for the difference, and proceed!

Sorted List

This is why sorting is important. If all the points were sorted by X, then Y (or Y, then X - whatever) and you knew it, you could really speed things up. Because you could simply stop scanning when the X (or Y) coordinate got too big, and you wouldn't have to go through 5 million points.

How would that work? Same way as before, except your inner loop would have some checks like this:

five_miles = ... # Whatever math, plus an error allowance!
list_len = len(geo_points) # Don't call this 5 million times

for i, pi in enumerate(geo_points):

    if pi.close_to_another_point:
        continue   # Remember if close to an earlier point

    pi0max = pi[0] + five_miles
    pi1min = pi[1] - five_miles
    pi1max = pi[1] + five_miles

    for j in range(i+1, list_len):
        pj = geo_points[j]
        # Assumes geo_points is sorted on [0] then [1]
        if pj[0] > pi0max:
            # Can't possibly be close enough, nor any later points
            break
        if pj[1] < pi1min or pj[1] > pi1max:
            # Can't be close enough, but a later point might be
            continue

        # Now do "real" comparison using accurate functions.
        if ...:
            pi.close_to_another_point = True
            pj.close_to_another_point = True
            break

What am I doing there? First, I'm getting some numbers into local variables. Then I'm using enumerate to give me an i value and a reference to the outer point. (What you called geo_point). Then, I'm quickly checking to see if we already know that this point is close to another one.

If not, we'll have to scan. So I'm only scanning "later" points in the list, because I know the outer loop scans the early ones, and I definitely don't want to compare a point against itself. I'm using a few temporary variables to cache the result of computations involving the outer loop. Within the inner loop, I do some stupid comparisons against the temporaries. They can't tell me if the two points are close to each other, but I can check if they're definitely not close and skip ahead.

Finally, if the simple checks pass then go ahead and do the expensive checks. If a check actually passes, be sure to record the result on both points, so we can skip doing the second point later.

Unsorted List

But what if the list is not sorted?

@RootTwo points you at a kD tree (where D is for "dimensional" and k in this case is "2"). The idea is really simple, if you already know about binary search trees: you cycle through the dimensions, comparing X at even levels in the tree and comparing Y at odd levels (or vice versa). The idea would be this:

def insert_node(node, treenode, depth=0):
    dimension = depth % 2  # even/odd -> lat/long
    dn = node.coord[dimension]
    dt = treenode.coord[dimension]

    if dn < dt:
        # go left
        if treenode.left is None:
            treenode.left = node
        else:
            insert_node(node, treenode.left, depth+1)
    else:
        # go right
        if treenode.right is None:
            treenode.right = node
        else:
            insert_node(node, treenode.right, depth+1)

What would this do? This would get you a searchable tree where points could be inserted in O(log n) time. That means O(n log n) for the whole list, which is way better than n squared! (The log base 2 of 5 million is basically 23. So n log n is 5 million times 23, compared with 5 million times 5 million!)

It also means you can do a targeted search. Since the tree is ordered, it's fairly straightforward to look for "close" points (the Wikipedia link from @RootTwo provides an algorithm).

Advice

My advice is to just write code to sort the list, if needed. It's easier to write, and easier to check by hand, and it's a separate pass you will only need to make one time.

Once you have the list sorted, try the approach I showed above. It's close to what you were doing, and it should be easy for you to understand and code.

like image 156
aghast Avatar answered Oct 20 '22 01:10

aghast


As the answer to Python calculate lots of distances quickly points out, this is a classic use case for k-D trees.

An alternative is to use a sweep line algorithm, as shown in the answer to How do I match similar coordinates using Python?

Here's the sweep line algorithm adapted for your questions. On my laptop, it takes < 5 minutes to run through 5M random points.

import itertools as it
import operator as op
import sortedcontainers     # handy library on Pypi
import time

from collections import namedtuple
from math import cos, degrees, pi, radians, sqrt
from random import sample, uniform

Point = namedtuple("Point", "lat long has_close_neighbor")

miles_per_degree = 69

number_of_points = 5000000
data = [Point(uniform( -88.0,  88.0),     # lat
              uniform(-180.0, 180.0),     # long
              True
             )
        for _ in range(number_of_points)
       ]

start = time.time()
# Note: lat is first in Point, so data is sorted by .lat then .long.
data.sort()

print(time.time() - start)

# Parameter that determines the size of a sliding lattitude window
# and therefore how close two points need to be to be to get flagged.
threshold = 5.0  # miles
lat_span = threshold / miles_per_degree
coarse_threshold = (.98 * threshold)**2

# Sliding lattitude window.  Within the window, observations are
# ordered by longitude.
window = sortedcontainers.SortedListWithKey(key=op.attrgetter('long'))

# lag_pt is the 'southernmost' point within the sliding window.
point = iter(data)
lag_pt = next(point)

milepost = len(data)//10

# lead_pt is the 'northernmost' point in the sliding window.
for i, lead_pt in enumerate(data):
    if i == milepost:
        print('.', end=' ')
        milepost += len(data)//10

    # Dec of lead_obs represents the leading edge of window.
    window.add(lead_pt)

    # Remove observations further than the trailing edge of window.
    while lead_pt.lat - lag_pt.lat > lat_span:
        window.discard(lag_pt)
        lag_pt = next(point)

    # Calculate 'east-west' width of window_size at dec of lead_obs
    long_span = lat_span / cos(radians(lead_pt.lat))
    east_long = lead_pt.long + long_span
    west_long = lead_pt.long - long_span

    # Check all observations in the sliding window within
    # long_span of lead_pt.
    for other_pt in window.irange_key(west_long, east_long):

        if other_pt != lead_pt:
            # lead_pt is at the top center of a box 2 * long_span wide by 
            # 1 * long_span tall.  other_pt is is in that box. If desired, 
            # put additional fine-grained 'closeness' tests here. 

            # coarse check if any pts within 80% of threshold distance
            # then don't need to check distance to any more neighbors
            average_lat = (other_pt.lat + lead_pt.lat) / 2
            delta_lat   = other_pt.lat - lead_pt.lat
            delta_long  = (other_pt.long - lead_pt.long)/cos(radians(average_lat))

            if delta_lat**2 + delta_long**2 <= coarse_threshold:
                break

            # put vincenty test here
            #if 0 < vincenty(lead_pt, other_pt).miles <= close_limit:
            #    break

    else:
        data[i] = data[i]._replace(has_close_neighbor=False)

print()      
print(time.time() - start)
like image 41
RootTwo Avatar answered Oct 20 '22 01:10

RootTwo


If you sort the list by latitude (n log(n)), and the points are roughly evenly distributed, it will bring it down to about 1000 points within 5 miles for each point (napkin math, not exact). By only looking at the points that are near in latitude, the runtime goes from n^2 to n*log(n)+.0004n^2. Hopefully this speeds it up enough.

like image 31
Oscar Smith Avatar answered Oct 20 '22 01:10

Oscar Smith


I would give pandas a try. Pandas is made for efficient handling of large amounts of data. That may help with the efficiency of the csv portion anyhow. But from the sounds of it, you've got yourself an inherently inefficient problem to solve. You take point 1 and compare it against 4,999,999 other points. Then you take point 2 and compare it with 4,999,998 other points and so on. Do the math. That's 12.5 trillion comparisons you're doing. If you can do 1,000,000 comparisons per second, that's 144 days of computation. If you can do 10,000,000 comparisons per second, that's 14 days. For just additions in straight python, 10,000,000 operations can take something like 1.1 seconds, but I doubt your comparisons are as fast as an add operation. So give it at least a fortnight or two.

Alternately, you could come up with an alternate algorithm, though I don't have any particular one in mind.

like image 43
Mike Lane Avatar answered Oct 19 '22 23:10

Mike Lane