How to convert a sklearn pipeline into a pyspark pipeline?

Tags:

We have a machine learning classifier model that we have trained with a pandas dataframe and a standard sklearn pipeline (StandardScaler, RandomForestClassifier, GridSearchCV etc). We are working on Databricks and would like to scale up this pipeline to a large dataset using the parallel computation spark offers.

What is the quickest way to convert our sklearn pipeline into something that computes in parallel? (We can easily switch between pandas and spark DFs as required.)

For context, our options seem to be:

Rewrite the pipeline using MLLib (time-consuming)
Use a sklearn-spark bridging library

On option 2, Spark-Sklearn seems to be deprecated, but Databricks instead recommends that we use joblibspark. However, this raises an exception on Databricks:

from sklearn import svm, datasets
from sklearn.model_selection import GridSearchCV
from joblibspark import register_spark
from sklearn.utils import parallel_backend
register_spark() # register spark backend

iris = datasets.load_iris()
parameters = {'kernel':('linear', 'rbf'), 'C':[1, 10]}
svr = svm.SVC(gamma='auto')

clf = GridSearchCV(svr, parameters, cv=5)
with parallel_backend('spark', n_jobs=3):
    clf.fit(iris.data, iris.target)

raises

py4j.security.Py4JSecurityException: Method public int org.apache.spark.SparkContext.maxNumConcurrentTasks() is not whitelisted on class class org.apache.spark.SparkContext

456

asked Sep 01 '20 12:09

anonuser9674123

1 Answers

According to the Databricks instructions (here and here), the necessary requirements are:

Python 3.6+
pyspark>=2.4
scikit-learn>=0.21
joblib>=0.14

I cannot reproduce your issue in a community Databricks cluster running Python 3.7.5, Spark 3.0.0, scikit-learn 0.22.1, and joblib 0.14.1:

import sys
import sklearn
import joblib

spark.version
# '3.0.0'

sys.version
# '3.7.5 (default, Nov  7 2019, 10:50:52) \n[GCC 8.3.0]'

sklearn.__version__
# '0.22.1'

joblib.__version__
# '0.14.1'

With the above settings, your code snippet runs smoothly, and produces indeed a classifier clf as:

GridSearchCV(cv=5, error_score=nan,
             estimator=SVC(C=1.0, break_ties=False, cache_size=200,
                           class_weight=None, coef0=0.0,
                           decision_function_shape='ovr', degree=3,
                           gamma='auto', kernel='rbf', max_iter=-1,
                           probability=False, random_state=None, shrinking=True,
                           tol=0.001, verbose=False),
             iid='deprecated', n_jobs=None,
             param_grid={'C': [1, 10], 'kernel': ('linear', 'rbf')},
             pre_dispatch='2*n_jobs', refit=True, return_train_score=False,
             scoring=None, verbose=0)

as does the alternative example from here:

from sklearn.utils import parallel_backend
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn import svm
from joblibspark import register_spark

register_spark() # register spark backend

iris = datasets.load_iris()
clf = svm.SVC(kernel='linear', C=1)
with parallel_backend('spark', n_jobs=3):
  scores = cross_val_score(clf, iris.data, iris.target, cv=5)

print(scores)

giving

[0.96666667 1.         0.96666667 0.96666667 1.        ]

162

answered Nov 15 '22 19:11

desertnaut

Related questions
                            
                                Test Pydantic settings in FastAPI
                            
                                Package requires a different Python: 2.7.17 not in '>=3.6.1' while setting up pre-commit
                            
                                How to catch concurrent.futures._base.TimeoutError correctly when using asyncio.wait_for and asyncio.Semaphore?
                            
                                Does it make sense to build a residual network with only fully connected layers (instedad of convolutional layers)?
                            
                                Random number generator with conditions - Python
                            
                                Tensorflow Keras RMSE metric returns different results than my own built RMSE loss function
                            
                                How to Access Private Github Repo File (.csv) in Python using Pandas or Requests
                            
                                How do I read project dependencies from pyproject.toml from my setup.py, to avoid duplicating the information in both files?
                            
                                Replace certain value in pandas Dataframe without knowing neither column nor row
                            
                                Time efficient way to skip no of line from very large text file (16gb) using python
                            
                                Why does a type hint `float` accept `int` while it is not even a subclass?
                            
                                Automating Winmerge comparison in Python
                            
                                make input features map from expansion tensor in keras
                            
                                What method does Python call when I access an attribute of a class via the class name?
                            
                                How do you use pipenv in a GitHub action?
                            
                                Cascade multiple RNN models for N-dimensional output
                            
                                Can't find model 'en_core_web_md'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory
                            
                                How to save and edit server rendering data?
                            
                                Installing socketio module on python3 seems to be corrupting pip
                            
                                Identify the first and all non-zero values in every row in Pandas DataFrame

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to convert a sklearn pipeline into a pyspark pipeline?

Tags:

python

scikit-learn

pyspark

apache-spark-ml

databricks

anonuser9674123

People also ask

1 Answers

desertnaut

Recent Activity

Donate For Us