Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

error in GridsearchCV sklearn

I am trying to tune a GB Classifier in sklearn using GridsearchCV. Here is the code:

from sklearn.grid_search import GridSearchCV
from sklearn.ensemble import GradientBoostingClassifier

param_grid = {'learning_rate': [0.1, 0.01, 0.001],
              'max_depth': [4, 6],
              'min_samples_leaf': [9, 17],
              'max_features': [0.3, 0.1]}

est = GradientBoostingClassifier(n_estimators=3000)
# this may take some minutes
gs_cv = GridSearchCV(est, param_grid, scoring='f1', n_jobs=-1, verbose=1, pre_dispatch=5).fit(X.values, y)

# best hyperparameter setting
print 'Best hyperparameters: %r' % gs_cv.best_params_

The dataset X is 1 million rows * 245 features. I am running on a machine with close to 32 cores. I get the following error when I run the above code,

error                                     Traceback (most recent call last)
<ipython-input-22-cb545fec9989> in <module>()
      9 est = GradientBoostingClassifier(n_estimators=3000)
     10 # this may take some minutes
---> 11 gs_cv = GridSearchCV(est, param_grid, scoring='f1', n_jobs=-1, verbose=1, pre_dispatch=5).fit(X.values, y)
     12 
     13 # best hyperparameter setting

/var/webeng/opensource/aetna-anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in fit(self, X, y)
    594 
    595         """
--> 596         return self._fit(X, y, ParameterGrid(self.param_grid))
    597 
    598 

/var/webeng/opensource/aetna-anaconda/lib/python2.7/site-packages/sklearn/grid_search.pyc in _fit(self, X, y, parameter_iterable)
    376                                     train, test, self.verbose, parameters,
    377                                     self.fit_params, return_parameters=True)
--> 378             for parameters in parameter_iterable
    379             for train, test in cv)
    380 

/var/webeng/opensource/aetna-anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in __call__(self, iterable)
    658                 # consumption.
    659                 self._iterating = False
--> 660             self.retrieve()
    661             # Make sure that we get a last message telling us we are done
    662             elapsed_time = time.time() - self._start_time

/var/webeng/opensource/aetna-anaconda/lib/python2.7/site-packages/sklearn/externals/joblib/parallel.pyc in retrieve(self)
    510                 self._lock.release()
    511             try:
--> 512                 self._output.append(job.get())
    513             except tuple(self.exceptions) as exception:
    514                 try:

/var/webeng/opensource/aetna-anaconda/lib/python2.7/multiprocessing/pool.pyc in get(self, timeout)
    556             return self._value
    557         else:
--> 558             raise self._value
    559 
    560     def _set(self, i, obj):

error: 'i' format requires -2147483648 <= number <= 2147483647

When I run the same code with a subset of 1000 rows, it works. Tried varying pre_dispatch but still getting issues. Is it because of the data size or something else? Thanks.

Using sklearn 0.15.2 on Python 2.7.9

like image 788
Nitin Avatar asked Oct 31 '22 06:10

Nitin


1 Answers

I see 3 possible ways to solve this:

1) try to update sklearn to the latest version

2) try to replace

from sklearn.grid_search import GridSearchCV

with:

from sklearn.model_selection import GridSearchCV

3) If you want to use n_jobs > 1 inside GridSearchCV then you have to protect the script using if __name__ == '__main__':

e.g.

if __name__ == '__main__':
    clf = MLPClassifier()
    my_param_grid = {'activation': ('tanh', 'relu')}
    grid= model_selection.GridSearchCV(clf,   
    param_grid=my_param_grid,n_jobs=-1)
    grid.fit(X, y)

Consider doing all the 3 steps

like image 196
seralouk Avatar answered Nov 13 '22 18:11

seralouk