Logo Questions Linux Laravel Mysql Ubuntu Git Menu

How to avoid running a deadly function 30,000 times in a list comprehension?

I have two lists of identical length. I want to check conditions in one list. If the conditions are true then run a very memory/processing intensive function on the other list.

My first attempt was as so:

records = [(a, deadly_func(b)) for a, b in zip(listA, listB) if a == "condition"]

This immediately allocated all the memory on my desktop and went on for a while before I killed it. Evidently, it ran deadly_func(b) on all 30,000 items in listB, whereas the intention was to use the 'if' statement to filter listB down to about 30 items.

I was able to make a working version with:

records = [(a, i) for a, i in zip(listA, range(len(listB)) if a == "condition"]
records = [(a, deadly_func(listB[i]) for a, i in records] 

Why did my first attempt not work? Is there a more pythonic way to make this work?

Edit: Thank you for the responses. Here is the actual code for both versions

Did not work:

import shapefile, shapely.geometry as shpgeo

lat = 42.3968243
lon = -71.0313479

sf = shapefile.Reader("/opt/ziplfs/tl_2014_us_zcta510.shp")

records = [(r[0], shpgeo.shape(s.__geo_interface__)) for r, s in zip(sf.records(), sf.shapes()) if haversine(lon, lat, float(r[8]), float(r[7])) < 10]

haversine() is a user-made haversine function, taking two pairs of lat and long and returns a distance in km.

from math import sqrt, sin, cos, radians, asin
def haversine(lon1, lat1, lon2, lat2):
    Calculate the great circle distance between two points 
    on the earth (specified in decimal degrees). Return is in kilometers
    # convert decimal degrees to radians 
    lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])

    # haversine formula 
    dlon = lon2 - lon1 
    dlat = lat2 - lat1 
    a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
    c = 2 * asin(sqrt(a)) 
    r = 6371 # Radius of earth in kilometers. Use 3956 for miles
    return c * r

The shapefile ('tl_2014_us_zcta510.shp') is all zipcodes in the US, from the Census Bureau. Download here if you really love shapefiles, and have 800 MB on your hard drive you don't know what to do with.

This script should return a list of tuples representing all zipcodes in the US with a centroid within 10 km of Chelsea, MA.

For the working version, replace the records line with:

records = [(r[0], i) for r, i in zip(sf.records(), range(len(sf.records()))) if haversine(lon, lat, float(r[8]), float(r[7])) < 10]
shapes = [shpgeo.shape(sf.shape(i).__geo_interface__) for r, i in records]

I did some timing tests. The 'non-working' version:

$ python test.py 
Time Elapsed: 0:00:14.221533
$ python test.py 
Time Elapsed: 0:00:14.637827
$ python test.py 
Time Elapsed: 0:00:14.253425

and the working version:

$ python test.py 
Time Elapsed: 0:00:01.887987
$ python test.py 
Time Elapsed: 0:00:01.886635
$ python test.py 
Time Elapsed: 0:00:01.982547

Maybe not 'deadly' per say, but significant when you repeat 30k times.

like image 590
kingledion Avatar asked Mar 07 '23 22:03


1 Answers

No repro? This code does not run deadly_func on all the elements of listB. Just the ones where the corresponding listA value is True:

listA = [True, False, True, False]
listB = [1, 2, 3, 4]

def deadly_func(x):
    print("Called with {}".format(x))
    return x

print([(a, deadly_func(b)) for a, b in zip(listA, listB) if a])

# Output:
# Called with 1
# Called with 3
# [(True, 1), (True, 3)]


Based on the updated question, my guess is that sf.shapes() is the expensive part. Thus calling sf.shape(i) on only the subset of elements you want is more efficient.

If my guess is correct, this should do the trick:

records = [(r[0], shpgeo.shape(sf.shape(i).__geo_interface__)) for i, r in enumerate(sf.records()) if haversine(lon, lat, float(r[8]), float(r[7])) < 10]

(Of course, this is mostly what you already did.)

like image 199
user94559 Avatar answered Mar 18 '23 02:03