I have two lists of identical length. I want to check conditions in one list. If the conditions are true then run a very memory/processing intensive function on the other list.
My first attempt was as so:
records = [(a, deadly_func(b)) for a, b in zip(listA, listB) if a == "condition"]
This immediately allocated all the memory on my desktop and went on for a while before I killed it. Evidently, it ran deadly_func(b) on all 30,000 items in listB, whereas the intention was to use the 'if' statement to filter listB down to about 30 items.
I was able to make a working version with:
records = [(a, i) for a, i in zip(listA, range(len(listB)) if a == "condition"]
records = [(a, deadly_func(listB[i]) for a, i in records]
Why did my first attempt not work? Is there a more pythonic way to make this work?
Edit: Thank you for the responses. Here is the actual code for both versions
Did not work:
import shapefile, shapely.geometry as shpgeo
lat = 42.3968243
lon = -71.0313479
sf = shapefile.Reader("/opt/ziplfs/tl_2014_us_zcta510.shp")
records = [(r[0], shpgeo.shape(s.__geo_interface__)) for r, s in zip(sf.records(), sf.shapes()) if haversine(lon, lat, float(r[8]), float(r[7])) < 10]
haversine() is a user-made haversine function, taking two pairs of lat and long and returns a distance in km.
from math import sqrt, sin, cos, radians, asin
def haversine(lon1, lat1, lon2, lat2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees). Return is in kilometers
"""
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = lon2 - lon1
dlat = lat2 - lat1
a = sin(dlat/2)**2 + cos(lat1) * cos(lat2) * sin(dlon/2)**2
c = 2 * asin(sqrt(a))
r = 6371 # Radius of earth in kilometers. Use 3956 for miles
return c * r
The shapefile ('tl_2014_us_zcta510.shp') is all zipcodes in the US, from the Census Bureau. Download here if you really love shapefiles, and have 800 MB on your hard drive you don't know what to do with.
This script should return a list of tuples representing all zipcodes in the US with a centroid within 10 km of Chelsea, MA.
For the working version, replace the records line with:
records = [(r[0], i) for r, i in zip(sf.records(), range(len(sf.records()))) if haversine(lon, lat, float(r[8]), float(r[7])) < 10]
shapes = [shpgeo.shape(sf.shape(i).__geo_interface__) for r, i in records]
I did some timing tests. The 'non-working' version:
$ python test.py
Time Elapsed: 0:00:14.221533
$ python test.py
Time Elapsed: 0:00:14.637827
$ python test.py
Time Elapsed: 0:00:14.253425
and the working version:
$ python test.py
Time Elapsed: 0:00:01.887987
$ python test.py
Time Elapsed: 0:00:01.886635
$ python test.py
Time Elapsed: 0:00:01.982547
Maybe not 'deadly' per say, but significant when you repeat 30k times.
No repro? This code does not run deadly_func
on all the elements of listB
. Just the ones where the corresponding listA
value is True
:
listA = [True, False, True, False]
listB = [1, 2, 3, 4]
def deadly_func(x):
print("Called with {}".format(x))
return x
print([(a, deadly_func(b)) for a, b in zip(listA, listB) if a])
# Output:
# Called with 1
# Called with 3
# [(True, 1), (True, 3)]
EDIT
Based on the updated question, my guess is that sf.shapes()
is the expensive part. Thus calling sf.shape(i)
on only the subset of elements you want is more efficient.
If my guess is correct, this should do the trick:
records = [(r[0], shpgeo.shape(sf.shape(i).__geo_interface__)) for i, r in enumerate(sf.records()) if haversine(lon, lat, float(r[8]), float(r[7])) < 10]
(Of course, this is mostly what you already did.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With