Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Train multiple models in parallel with sklearn?

I want to train multiple LinearSVC models with different random states but I prefer to do it in parallel. Is there an mechanism supporting this in sklearn? I know Gridsearch or some ensemble methods are doing in implicitly but what is the thing under the hood?

like image 353
erogol Avatar asked Apr 12 '15 12:04

erogol


People also ask

Is scikit-learn parallel?

Scikit-learn uses joblib for single-machine parallelism. This lets you train most estimators (anything that accepts the n_jobs parameter) using all the cores of your laptop or workstation. Training the estimators using Spark as a parallel backend for scikit-learn is most useful in the following scenarios.

Is sklearn multithreaded?

Scikit-learn relies heavily on NumPy and SciPy, which internally call multi-threaded linear algebra routines implemented in libraries such as MKL, OpenBLAS or BLIS.

What does N_jobs =- 1 mean?

n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer.

How do you combine two classification models?

The most common method to combine models is by averaging multiple models, where taking a weighted average improves the accuracy. Bagging, boosting, and concatenation are other methods used to combine deep learning models. Stacked ensemble learning uses different combining techniques to build a model.


1 Answers

The "thing" under the hood is the library joblib, which powers for example the multi-processing in GridSearchCV and some ensemble methods. It's Parallel helper class is a very handy Swiss knife for embarrassingly parallel for loops.

This is an example to train multiple LinearSVC models with different random states in parallel with 4 processes using joblib:

from joblib import Parallel, delayed
from sklearn.svm import LinearSVC
import numpy as np

def train_model(X, y, seed):
    model = LinearSVC(random_state=seed)
    return model.fit(X, y)

X = np.array([[1,2,3],[4,5,6]])
y = np.array([0, 1])
result = Parallel(n_jobs=4)(delayed(train_model)(X, y, seed) for seed in range(10))
# result is a list of 10 models trained using different seeds
like image 194
YS-L Avatar answered Dec 28 '22 14:12

YS-L