Parallel Sklearn Model Building with Dask or Joblib

Tags:

I have a large set of sklearn pipelines that I'd like to build in parallel with Dask. Here's a simple but naive sequential approach:

from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.2)

pipe_nb = Pipeline([('clf', MultinomialNB())])
pipe_lr = Pipeline([('clf', LogisticRegression())])
pipe_rf = Pipeline([('clf', RandomForestClassifier())])

pipelines = [pipe_nb, pipe_lr, pipe_rf]  # In reality, this would include many more different types of models with varying but specific parameters

for pl in pipelines:
    pl.fit(X_train, Y_train)

Note that this is not GridSearchCV or RandomSearchCV problem

In the case of RandomSearchCV, I know how to parallelize it with Dask:

dask_client = Client('tcp://some.host.com:8786')  

clf_rf = RandomForestClassifier()
param_dist = {'n_estimators': scipy.stats.randint(100, 500}
search_rf = RandomizedSearchCV(
                clf_rf,
                param_distributions=param_dist, 
                n_iter = 100, 
                scoring = 'f1',
                cv=10,
                error_score = 0, 
                verbose = 3,
               )

with joblib.parallel_backend('dask'):
    search_rf.fit(X_train, Y_train)

However, I'm not interested in hyperparameter tuning and it isn't clear how to modify this code in order to fit a set of multiple different models with their own specific parameters in parallel with Dask.

395

asked Jan 24 '19 21:01

slaw

1 Answers

dask.delayed is probably the easiest solution here.

from sklearn.naive_bayes import MultinomialNB 
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_iris
from sklearn.model_selection import train_test_split

iris = load_iris()
X_train, X_test, Y_train, Y_test = train_test_split(iris.data, iris.target, test_size=0.2)

pipe_nb = Pipeline([('clf', MultinomialNB())])
pipe_lr = Pipeline([('clf', LogisticRegression())])
pipe_rf = Pipeline([('clf', RandomForestClassifier())])

pipelines = [pipe_nb, pipe_lr, pipe_rf]  # In reality, this would include many more different types of models with varying but specific parameters

# Use dask.delayed instead of a for loop.
import dask.delayed

pipelines_ = [dask.delayed(pl).fit(X_train, Y_train) for pl in pipelines]
fit_pipelines = dask.compute(*pipelines_)

194

answered Nov 15 '22 02:11

TomAugspurger

Related questions
                            
                                TypeError: object of type 'numpy.int64' has no len()
                            
                                How to handle text classification problems when multiple features are involved
                            
                                How to create JWK from RSA Key pair?
                            
                                Creating a new logger for each async function invocation, good idea or not?
                            
                                Custom Hebbian Layer Implementation in Keras - input/output dims and lateral node connections
                            
                                In Python, Is it possible to connect Azure SQL Server using Active Directory Password Authentication?
                            
                                How can I select the good colors from an image with OpenCV and mask?
                            
                                Pandas xlsxwriter to write dataframe to excel and implementing column-width and border related formatting
                            
                                K.<v> notation in Python 2
                            
                                Selenium with Firefox webdriver results in error: Service geckodriver unexpectedly exited. Status code was: -11
                            
                                How to replace special characters within a text with a space in Python?
                            
                                Get value of nested attribute by filtering list on other attribute with Python Glom
                            
                                How to resample text (imbalanced groups) in a pipeline?
                            
                                What does axis=[1,2,3] mean in K.sum in keras backend?
                            
                                How to use bearer authentication in openapi-codegen generated python code
                            
                                How to set same colors for same indexes in different charts in matplotlib and seaborn
                            
                                Difference between add_form and form
                            
                                What is numpy.mgrid, technically?
                            
                                OpenCV 4 TypeError: Expected cv::UMat for argument 'labels'
                            
                                Python subprocess.call with timeout retry

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Parallel Sklearn Model Building with Dask or Joblib

Tags:

python

scikit-learn

dask

dask-distributed

slaw

People also ask

1 Answers

TomAugspurger

Recent Activity

Donate For Us