Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit - random forest regressor - AttributeError: 'Thread' object has no attribute '_children'

I'm getting the following error when setting my n_jobs parameter > 1 for the random forest regressor. If I set n_jobs=1, everything works.

AttributeError: 'Thread' object has no attribute '_children'

I'm running this code in a flask service. What's interesting is that it does not happen when ran outside of the flask service. I've only repro'd this on a freshly installed Ubuntu box. On my Mac it works just fine.

This is a thread that talked about this, but didn't seem to go anywhere past the workaround: 'Thread' object has no attribute '_children' - django + scikit-learn

Any thoughts on this?

Here is my test code:

@test.route('/testfun')

    def testfun():
        from sklearn.ensemble import RandomForestRegressor
        import numpy as np

        train_data = np.array([[1,2,3], [2,1,3]])
        target_data = np.array([1,1])

        model = RandomForestRegressor(n_jobs=2)
        model.fit(train_data, target_data)
        return "yey"

Stacktrace:


    Traceback (most recent call last):
      File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1836, in __call__
        return self.wsgi_app(environ, start_response)
      File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1820, in wsgi_app
        response = self.make_response(self.handle_exception(e))
      File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1403, in handle_exception
        reraise(exc_type, exc_value, tb)
      File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1817, in wsgi_app
        response = self.full_dispatch_request()
      File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1477, in full_dispatch_request
        rv = self.handle_user_exception(e)
      File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1381, in handle_user_exception
        reraise(exc_type, exc_value, tb)
      File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1475, in full_dispatch_request
        rv = self.dispatch_request()
      File "/usr/local/lib/python2.7/dist-packages/flask/app.py", line 1461, in dispatch_request
        return self.view_functions[rule.endpoint](**req.view_args)
      File "/home/vagrant/flask.global-relevance-engine/global_relevance_engine/routes/test.py", line 47, in testfun
        model.fit(train_data, target_data)
      File "/usr/local/lib/python2.7/dist-packages/sklearn/ensemble/forest.py", line 273, in fit
        for i, t in enumerate(trees))
      File "/usr/local/lib/python2.7/dist-packages/sklearn/externals/joblib/parallel.py", line 574, in __call__
        self._pool = ThreadPool(n_jobs)
      File "/usr/lib/python2.7/multiprocessing/pool.py", line 685, in __init__
        Pool.__init__(self, processes, initializer, initargs)
      File "/usr/lib/python2.7/multiprocessing/pool.py", line 136, in __init__
        self._repopulate_pool()
      File "/usr/lib/python2.7/multiprocessing/pool.py", line 199, in _repopulate_pool
        w.start()
      File "/usr/lib/python2.7/multiprocessing/dummy/__init__.py", line 73, in start
        self._parent._children[self] = None

like image 449
HappyCamper Avatar asked Sep 30 '15 23:09

HappyCamper


1 Answers

The Problem

This probably happens due to a bug in multiprocessing.dummy (see here and here) that existed before python 2.7.5 and 3.3.2.

Solution A - Upgrade Python

See the comments for confirmation that a newer version worked for OP.

Solution B - Modify dummy

If you can't upgrade but have access to .../py/Lib/multiprocessing/dummy/__init__.py, edit the start method within the DummyProcess class as follows (should be ~line 73):

if hasattr(self._parent, '_children'):  # add this line
    self._parent._children[self] = None  # indent this existing line

Solution C - Monkey Patch

DummyProcess is where this bug exists. Let's see where it exists in your imported code to make sure we patch it in the right place.

  • RandomForestRegressor
  • inherits: ForestRegressor
  • inherits: BaseForest
  • created in: sklearn.ensemble.forest
  • which imports: Parallel from sklearn.externals.joblib
  • which imports ThreadPool from multiprocessing.pool
  • which imports and stores Process from multiprocessing.dummy
  • which has been assigned to: DummyProcess also in multiprocessing.dummy

The existence of DummyProcess in that chain guarantees it has already been imported after the import of RandomForestRegressor. Also I think we have access to the DummyProcess class before any instances of it are made. Therefore we can patch the class once instead of needing to hunt down instances to patch.

# Let's make it available in our namespace:
from sklearn.ensemble import RandomForestRegressor
from multiprocessing import dummy as __mp_dummy

# Now we can define a replacement and patch DummyProcess:
def __DummyProcess_start_patch(self):  # pulled from an updated version of Python
    assert self._parent is __mp_dummy.current_process()  # modified to avoid further imports
    self._start_called = True
    if hasattr(self._parent, '_children'):
        self._parent._children[self] = None
    __mp_dummy.threading.Thread.start(self)  # modified to avoid further imports
__mp_dummy.DummyProcess.start = __DummyProcess_start_patch

Unless I've missed something, from now on all instances of DummyProcess created will be patched and therefore that error won't occur.

For anyone making more extensive use of sklearn, I think you can accomplish this in reverse and make it work for all of sklearn instead of focusing on one module. You will want to import DummyProcess and patch it as above before you do any sklearn imports. Then sklearn will be using the patched class from the beginning.


Original answer:

As I wrote the comment, I realized that I may have found your problem - I think your flask environment is using an older version of python.

The reason is that in the latest version of python multiprocessing, the line where you are receiving that error is protected by a condition:

if hasattr(self._parent, '_children'):
    self._parent._children[self] = None

It looks like this bug was fixed during python 2.7 (I think fixed from 2.7.5). Perhaps your flask is an older 2.7 or 2.6?

Can you check your environment? If you can't update the interpreter, perhaps we can find a way to monkey patch multiprocessing to keep it from crashing.

like image 115
KobeJohn Avatar answered Oct 31 '22 09:10

KobeJohn