I'm getting the following error when setting my n_jobs
parameter > 1
for the random forest regressor. If I set n_jobs=1
, everything works.
AttributeError: 'Thread' object has no attribute '_children'
I'm running this code in a flask service. What's interesting is that it does not happen when ran outside of the flask service. I've only repro'd this on a freshly installed Ubuntu box. On my Mac it works just fine.
This is a thread that talked about this, but didn't seem to go anywhere past the workaround: 'Thread' object has no attribute '_children' - django + scikit-learn
Any thoughts on this?
Here is my test code:
@test.route('/testfun') def testfun(): from sklearn.ensemble import RandomForestRegressor import numpy as np train_data = np.array([[1,2,3], [2,1,3]]) target_data = np.array([1,1]) model = RandomForestRegressor(n_jobs=2) model.fit(train_data, target_data) return "yey"
Stacktrace:
Traceback (most recent call last): File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1836, in __call__ return self.wsgi_app(environ, start_response) File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1820, in wsgi_app response = self.make_response(self.handle_exception(e)) File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1403, in handle_exception reraise(exc_type, exc_value, tb) File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1817, in wsgi_app response = self.full_dispatch_request() File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1477, in full_dispatch_request rv = self.handle_user_exception(e) File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1381, in handle_user_exception reraise(exc_type, exc_value, tb) File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1475, in full_dispatch_request rv = self.dispatch_request() File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1461, in dispatch_request return self.view_functions[rule.endpoint](**req.view_args) File "/home/vagrant/flask.global-relevance-engine/global_relevance_engine/routes/test.py", line 47, in testfun model.fit(train_data, target_data) File "/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/forest.py", line 273, in fit for i, t in enumerate(trees)) File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 574, in __call__ self._pool = ThreadPool(n_jobs) File "/usr/lib/python2.7/multiprocessing/pool.py", line 685, in __init__ Pool.__init__(self, processes, initializer, initargs) File "/usr/lib/python2.7/multiprocessing/pool.py", line 136, in __init__ self._repopulate_pool() File "/usr/lib/python2.7/multiprocessing/pool.py", line 199, in _repopulate_pool w.start() File "/usr/lib/python2.7/multiprocessing/dummy/__init__.py", line 73, in start self._parent._children[self] = None
This probably happens due to a bug in multiprocessing.dummy
(see here and here) that existed before python 2.7.5 and 3.3.2.
See the comments for confirmation that a newer version worked for OP.
dummy
If you can't upgrade but have access to .../py/Lib/multiprocessing/dummy/__init__.py
, edit the start
method within the DummyProcess
class as follows (should be ~line 73):
if hasattr(self._parent, '_children'): # add this line
self._parent._children[self] = None # indent this existing line
DummyProcess
is where this bug exists. Let's see where it exists in your imported code to make sure we patch it in the right place.
The existence of DummyProcess
in that chain guarantees it has already been imported after the import of RandomForestRegressor
.
Also I think we have access to the DummyProcess
class before any instances of it are made.
Therefore we can patch the class once instead of needing to hunt down instances to patch.
# Let's make it available in our namespace:
from sklearn.ensemble import RandomForestRegressor
from multiprocessing import dummy as __mp_dummy
# Now we can define a replacement and patch DummyProcess:
def __DummyProcess_start_patch(self): # pulled from an updated version of Python
assert self._parent is __mp_dummy.current_process() # modified to avoid further imports
self._start_called = True
if hasattr(self._parent, '_children'):
self._parent._children[self] = None
__mp_dummy.threading.Thread.start(self) # modified to avoid further imports
__mp_dummy.DummyProcess.start = __DummyProcess_start_patch
Unless I've missed something, from now on all instances of DummyProcess created will be patched and therefore that error won't occur.
For anyone making more extensive use of sklearn, I think you can accomplish this in reverse and make it work for all of sklearn instead of focusing on one module.
You will want to import DummyProcess
and patch it as above before you do any sklearn imports.
Then sklearn will be using the patched class from the beginning.
Original answer:
As I wrote the comment, I realized that I may have found your problem - I think your flask environment is using an older version of python.
The reason is that in the latest version of python multiprocessing, the line where you are receiving that error is protected by a condition:
if hasattr(self._parent, '_children'):
self._parent._children[self] = None
It looks like this bug was fixed during python 2.7 (I think fixed from 2.7.5). Perhaps your flask is an older 2.7 or 2.6?
Can you check your environment? If you can't update the interpreter, perhaps we can find a way to monkey patch multiprocessing to keep it from crashing.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With