I have a large list of sentences (~7 millions), and I want to extract the nouns from them.
I used joblib
library to parallelize the extracting process, like in the following:
import spacy
from tqdm import tqdm
from joblib import Parallel, delayed
nlp = spacy.load('en_core_web_sm')
class nouns:
def get_nouns(self, text):
doc = nlp(u"{}".format(text))
return [token.text for token in doc if token.tag_ in ['NN', 'NNP', 'NNS', 'NNPS']]
def parallelize(self, sentences):
results = Parallel(n_jobs=1)(delayed(self.get_nouns)(sent) for sent in tqdm(sentences))
return results
if __name__ == '__main__':
sentences = ['we went to the school yesterday',
'The weather is really cold',
'Can we catch the dog?',
'How old are you John?',
'I like diving and swimming',
'Can the world become united?']
obj = nouns()
print(obj.parallelize(sentences))
when n_jobs
in parallelize function is more than 1, I get this long error:
100%|██████████| 6/6 [00:00<00:00, 200.00it/s]
joblib.externals.loky.process_executor._RemoteTraceback:
"""
Traceback (most recent call last):
File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\queues.py", line 150, in _feed
obj_ = dumps(obj, reducers=reducers)
File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 243, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 236, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 267, in dump
return Pickler.dump(self, obj)
File "C:\Python35\lib\pickle.py", line 408, in dump
self.save(obj)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 841, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 770, in save_list
self._batch_appends(obj)
File "C:\Python35\lib\pickle.py", line 797, in _batch_appends
save(tmp[0])
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 725, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 718, in save_instancemethod
self.save_reduce(types.MethodType, (obj.__func__, obj.__self__), obj=obj)
File "C:\Python35\lib\pickle.py", line 599, in save_reduce
save(args)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 725, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 395, in save_function
self.save_function_tuple(obj)
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 594, in save_function_tuple
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 841, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 599, in save_reduce
save(args)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 740, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 740, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 495, in save
rv = reduce(self.proto)
File "stringsource", line 2, in preshed.maps.PreshMap.__reduce_cython__
TypeError: self.c_map cannot be converted to a Python object for pickling
"""Exception in thread QueueFeederThread:
Traceback (most recent call last):
File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\queues.py", line 150, in _feed
obj_ = dumps(obj, reducers=reducers)
File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 243, in dumps
dump(obj, buf, reducers=reducers, protocol=protocol)
File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\reduction.py", line 236, in dump
_LokyPickler(file, reducers=reducers, protocol=protocol).dump(obj)
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 267, in dump
return Pickler.dump(self, obj)
File "C:\Python35\lib\pickle.py", line 408, in dump
self.save(obj)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 841, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 770, in save_list
self._batch_appends(obj)
File "C:\Python35\lib\pickle.py", line 797, in _batch_appends
save(tmp[0])
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 725, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 718, in save_instancemethod
self.save_reduce(types.MethodType, (obj.__func__, obj.__self__), obj=obj)
File "C:\Python35\lib\pickle.py", line 599, in save_reduce
save(args)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 725, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 395, in save_function
self.save_function_tuple(obj)
File "C:\Python35\lib\site-packages\joblib\externals\cloudpickle\cloudpickle.py", line 594, in save_function_tuple
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 841, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 810, in save_dict
self._batch_setitems(obj.items())
File "C:\Python35\lib\pickle.py", line 836, in _batch_setitems
save(v)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 599, in save_reduce
save(args)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 740, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 520, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python35\lib\pickle.py", line 623, in save_reduce
save(state)
File "C:\Python35\lib\pickle.py", line 475, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python35\lib\pickle.py", line 740, in save_tuple
save(element)
File "C:\Python35\lib\pickle.py", line 495, in save
rv = reduce(self.proto)
File "stringsource", line 2, in preshed.maps.PreshMap.__reduce_cython__
TypeError: self.c_map cannot be converted to a Python object for pickling
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Python35\lib\threading.py", line 914, in _bootstrap_inner
self.run()
File "C:\Python35\lib\threading.py", line 862, in run
self._target(*self._args, **self._kwargs)
File "C:\Python35\lib\site-packages\joblib\externals\loky\backend\queues.py", line 175, in _feed
onerror(e, obj)
File "C:\Python35\lib\site-packages\joblib\externals\loky\process_executor.py", line 310, in _on_queue_feeder_error
self.thread_wakeup.wakeup()
File "C:\Python35\lib\site-packages\joblib\externals\loky\process_executor.py", line 155, in wakeup
self._writer.send_bytes(b"")
File "C:\Python35\lib\multiprocessing\connection.py", line 183, in send_bytes
self._check_closed()
File "C:\Python35\lib\multiprocessing\connection.py", line 136, in _check_closed
raise OSError("handle is closed")
OSError: handle is closed
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
File ".../playground.py", line 43, in <module>
print(obj.Paralize(sentences))
File ".../playground.py", line 32, in Paralize
results = Parallel(n_jobs=2)(delayed(self.get_nouns)(sent) for sent in tqdm(sentences))
File "C:\Python35\lib\site-packages\joblib\parallel.py", line 934, in __call__
self.retrieve()
File "C:\Python35\lib\site-packages\joblib\parallel.py", line 833, in retrieve
self._output.extend(job.get(timeout=self.timeout))
File "C:\Python35\lib\site-packages\joblib\_parallel_backends.py", line 521, in wrap_future_result
return future.result(timeout=timeout)
File "C:\Python35\lib\concurrent\futures\_base.py", line 405, in result
return self.__get_result()
File "C:\Python35\lib\concurrent\futures\_base.py", line 357, in __get_result
raise self._exception
_pickle.PicklingError: Could not pickle the task to send it to the workers.
What is the problem in my code?
Joblib is a set of tools to provide lightweight pipelining in Python. In particular: transparent disk-caching of functions and lazy re-evaluation (memoize pattern) easy simple parallel computing.
TL;DR - it preserves order for both backends. Extending @Chris Farr's answer, I implemented a simple test. I make a function wait for some random amount of time (you can check these wait times are not identical). I get that the order is preserved every time, with both backends.
The delayed function is a simple trick to be able to create a tuple (function, args, kwargs) with a function-call syntax. Warning. Under Windows, the use of multiprocessing. Pool requires to protect the main loop of code to avoid recursive spawning of subprocesses when using joblib.
Below is a list of simple steps to use "Joblib" for parallel computing. Wrap normal python function calls into delayed() method of joblib. Create Parallel object with a number of processes/threads to use for parallel computing. Pass the list of delayed wrapped functions to an instance of Parallel.
Same issue. I solved by changing the backend from loky
to threading
in Parallel
.
Q: What is the problem in my code?
Well, most probably the issue comes not from the code, but from the "hidden" processing, that appears, once n_jobs
directs ( and joblib
internally orchestrates ) to prepare that many exact copies of the main process, so as to let them work independently one of each other ( effectively thus escaping from GIL-locking and mapping the multiple process-flows onto physical hardware resources )
This step is responsible for making copies of all pythonic objects and was known to use Pickle
for doing this. The Pickle
module was known for its historical principal limitations on what can be pickled and what cannot.
The error message confirms this:
TypeError: self.c_map cannot be converted to a Python object for pickling
One may try a trick to supply Mike McKearns dill
module instead of Pickle
and test, if your "problematic" python objects will get pickled with this module without throwing this error.
dill
has the same API signatures, so a pure import dill as pickle
may help with leaving all the other code the same.
I had the same problems, with large models to get distributed into and back from multiple processes and the dill
was a way to go. Also the performance has increased.
Bonus:
dill
allows to save / restore the full python interpreter state!
This was a cool side-effect of finding dill
, once import dill as pickle
was done, pickle.dump_session( <aFile> )
will save ones complete state-full copy of the python interpreter session. This can be restored, if needed ( post-crash restores, trained trained and optimised ML-model state-fully saved / restored, incremental learning ML-model state-fully saved and re-distributed for remote restores for the deployed user-bases, etc. )
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With