Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Scikit-Learn: Several X-Vals in parallel?

I would like to try several different models for my data and crossvalidate them, so the results are somewhat reliable.

For my crossvalidation I call:

cross_val_score(model, X, y, scoring = 'mean_squared_error', cv=kf, n_jobs = -1)

which does my 10-fold crossvalidation in parallel. Since the machine I'm running on has 40 cores and enough memory, I would like to try four different values for "model" in parallel, each doing a 10-fold crossvalidation.

However, when I try to do it using joblib in the following way, I get an error:

results = Parallel(n_jobs = num_jobs)(delayed(crossVal)(model) for model in models)

/usr/local/lib/python2.7/dist-packages/sklearn/cross_validation.py:1433: UserWarning: Multiprocessing-backed parallel loops cannot be nested, setting n_jobs=1 for train, test in cv)

where crossVal is a method I defined, which consists mainly of calling cross_val_score.

Is there an elegant way to do this without starting several different python files manually?

like image 984
SGer Avatar asked Dec 13 '25 20:12

SGer


1 Answers

Joblib can use multiprocessing and threading backend, by default it uses multiprocessing (This is because of CPython implementation, where threading will be faster only in some particular cases, i don't want to go into details here, you can find tonns of articles about CPython and Python GIL).

It's not an error, it's just a warning which tells you that you tried to create processes from processes. I.e. by this line:

results = Parallel(n_jobs = num_jobs)(delayed(crossVal)(model) for model in models)

You already spawned some number of processes (n_jobs), and then each cross_val_score inside your crossVal tries to do the same thing (Spawn some amount of processes), because cross_val_score by default is using multoprocessing. Joblib doesn't allow to do such things with multiprocessing backend. Thus AFAIK it gives this warning and runs nested Parallel loop in single process, i.e. cross_val_score internals now run in single thread, but still your crossVal function runs in multiprocessing mode.

You can avoid this warning if you get rid of any of those two multiprocessing cycles, i.e. you can get rid of nested multiprocessing by calling:

cross_val_score(..., n_jobs=1)

in your crossVal function, or you can call default cross_val_score several times in simple loop, without multiprocessing, and then aggregate results, e.g:

results = [cross_val_score(estimator = est, ...) for est in estimators]

In the first case you can run min(n_models, n_jobs) simultaneously (In your original case when joblib gives warning you do this implicitly already), in the second one - min(n_folds, n_cores). If you want to run min(n_jobs, n_models*n_folds) you should use GridSearchCV, because internally it spawns jobs in this way:

    out = Parallel(
        n_jobs=self.n_jobs, verbose=self.verbose,
        pre_dispatch=pre_dispatch
    )(
        delayed(_fit_and_score)(clone(base_estimator), X, y, self.scorer_,
                                train, test, self.verbose, parameters,
                                self.fit_params, return_parameters=True,
                                error_score=self.error_score)
            for parameters in parameter_iterable
            for train, test in cv)
like image 107
Ibraim Ganiev Avatar answered Dec 15 '25 09:12

Ibraim Ganiev



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!