I'm looking to geocode a large list of addresses using multiprocessing. I have the following code:
import multiprocessing
import geocoder
addresses = ['New York City, NY','Austin, TX', 'Los Angeles, CA', 'Boston, MA'] # and on and on
def geocode_worker(address):
return geocoder.arcgis(address)
def main_process():
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
return pool.map(geocode_worker, addresses)
if __name__ == '__main__':
main_process()
But it gives me this error:
Traceback (most recent call last):
File "/opt/anaconda3/lib/python3.7/threading.py", line 926, in _bootstrap_inner
self.run()
File "/opt/anaconda3/lib/python3.7/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/opt/anaconda3/lib/python3.7/multiprocessing/pool.py", line 470, in _handle_results
task = get()
File "/opt/anaconda3/lib/python3.7/multiprocessing/connection.py", line 251, in recv
return _ForkingPickler.loads(buf.getbuffer())
File "/opt/anaconda3/lib/python3.7/site-packages/geocoder/base.py", line 599, in __getattr__
if not self.ok:
File "/opt/anaconda3/lib/python3.7/site-packages/geocoder/base.py", line 536, in ok
return len(self) > 0
File "/opt/anaconda3/lib/python3.7/site-packages/geocoder/base.py", line 422, in __len__
return len(self._list)
The last 3 lines of the error are repeated over and over again, and then the last line of the traceback is:
RecursionError: maximum recursion depth exceeded while calling a Python object
Can anyone help me figure out why?
The problem is that the ArcgisQuery object returned by geocoder is not picklable - or rather, it's not unpicklable. The unpickle process hits an infinite loop due to its use of __getattr__, which internally tries to access self.ok, which eventually relies on self._list to be defined, which is not defined while unpickling, because it is only defined in __init__, and __init__ is not called while unpickling. Since it's not defined, it tries to use __getattr__ to find it, which tries to access self.ok again, and creates the infinite loop.
You can work around this by not passing the ArcgisQuery object itself between the worker process and your main process, and instead pass just its underlying __dict__. Then, rebuild the ArcgisQuery objects in your main process:
import multiprocessing
import geocoder
from geocoder.arcgis import ArcgisQuery
addresses = ['New York City, NY','Austin, TX', 'Los Angeles, CA', 'Boston, MA'] # and on and on
def geocode_worker(address):
out = geocoder.arcgis(address)
return out.__dict__ # Only return the object's __dict__
def main_process():
pool = multiprocessing.Pool(processes=multiprocessing.cpu_count())
l = pool.map(geocode_worker, addresses)
out = []
for d in l:
q = ArcgisQuery(d['location']) # location is a required constructor arg
q.__dict__.update(d) # Load the rest of our state into the new object
out.append(q)
return out
if __name__ == '__main__':
print(main_process())
If you don't actually need the whole ArcgisQuery object, and only need some parts of it, you could also just return those from the worker processes, to avoid the need for this hack.
For what it's worth, it looks like geocoder could fix its pickling problem by implementing __getstate__ and __setstate__ on ArcgisQuery or its base class, like this:
def __getstate__(self):
return self.__dict__
def __setstate__(self, state):
self.__dict__.update(state)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With