I want to train multiple LinearSVC models with different random states but I prefer to do it in parallel. Is there an mechanism supporting this in sklearn? I know Gridsearch or some ensemble methods are doing in implicitly but what is the thing under the hood?
Scikit-learn uses joblib for single-machine parallelism. This lets you train most estimators (anything that accepts the n_jobs parameter) using all the cores of your laptop or workstation. Training the estimators using Spark as a parallel backend for scikit-learn is most useful in the following scenarios.
Scikit-learn relies heavily on NumPy and SciPy, which internally call multi-threaded linear algebra routines implemented in libraries such as MKL, OpenBLAS or BLIS.
n_jobs=-1 means that the computation will be dispatched on all the CPUs of the computer.
The most common method to combine models is by averaging multiple models, where taking a weighted average improves the accuracy. Bagging, boosting, and concatenation are other methods used to combine deep learning models. Stacked ensemble learning uses different combining techniques to build a model.
The "thing" under the hood is the library joblib
, which powers for example the multi-processing in GridSearchCV
and some ensemble methods. It's Parallel
helper class is a very handy Swiss knife for embarrassingly parallel for loops.
This is an example to train multiple LinearSVC models with different random states in parallel with 4 processes using joblib:
from joblib import Parallel, delayed
from sklearn.svm import LinearSVC
import numpy as np
def train_model(X, y, seed):
model = LinearSVC(random_state=seed)
return model.fit(X, y)
X = np.array([[1,2,3],[4,5,6]])
y = np.array([0, 1])
result = Parallel(n_jobs=4)(delayed(train_model)(X, y, seed) for seed in range(10))
# result is a list of 10 models trained using different seeds
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With