integrating scikit-learn with pyspark

Question

I'm exploring pyspark and the possibilities of integrating scikit-learn with pyspark. I'd like to train a model on each partition using scikit-learn. That means, when my RDD is is defined and gets distributed among different worker nodes, I'd like to use scikit-learn and train a model (let's say a simple k-means) on each partition which exists on each worker node. As scikit-learn algorithms takes a Pandas dataframe, my initial idea was to call toPandas for each partition and then train my model. However, the toPandas function collects the DataFrame into the driver and this is not something that I'm looking for. Is there any other way to achieve such a goal?

4 revs, 4 users 83% · Accepted Answer

scikit-learn can't be fully integrated with spark as for now, and the reason is that scikit-learn algorithms aren't implemented to be distributed as it work just on a single machine.

Nevertheless, you can find ready to use Spark - Scikit integration tools in spark-sklearn that supports (for the moments) executing GridSearch on Spark for cross validation.

Edit

As of 2020 the spark-sklearn is deprecated and the joblib-spark is the recommended successor of it. Based on the documentation you can easily distribute a cross validation to a Spark cluster like this:

from sklearn.utils import parallel_backend
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn import svm
from joblibspark import register_spark

register_spark() # register spark backend

iris = datasets.load_iris()
clf = svm.SVC(kernel='linear', C=1)
with parallel_backend('spark', n_jobs=3):
  scores = cross_val_score(clf, iris.data, iris.target, cv=5)

print(scores)

A GridSearchCV can be distributed in the same way.

integrating scikit-learn with pyspark

Tags:

apache-spark

scikit-learn

pyspark

HHH

1 Answers

4 revs, 4 users 83%

Recent Activity

Donate For Us

integrating scikit-learn with pyspark

Tags:

apache-spark

scikit-learn

pyspark

HHH

1 Answers

4 revs, 4 users 83%

Related questions

Recent Activity

Donate For Us