Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

integrating scikit-learn with pyspark

I'm exploring pyspark and the possibilities of integrating scikit-learn with pyspark. I'd like to train a model on each partition using scikit-learn. That means, when my RDD is is defined and gets distributed among different worker nodes, I'd like to use scikit-learn and train a model (let's say a simple k-means) on each partition which exists on each worker node. As scikit-learn algorithms takes a Pandas dataframe, my initial idea was to call toPandas for each partition and then train my model. However, the toPandas function collects the DataFrame into the driver and this is not something that I'm looking for. Is there any other way to achieve such a goal?

like image 745
HHH Avatar asked Mar 12 '23 19:03

HHH


1 Answers

scikit-learn can't be fully integrated with spark as for now, and the reason is that scikit-learn algorithms aren't implemented to be distributed as it work just on a single machine.

Nevertheless, you can find ready to use Spark - Scikit integration tools in spark-sklearn that supports (for the moments) executing GridSearch on Spark for cross validation.

Edit

As of 2020 the spark-sklearn is deprecated and the joblib-spark is the recommended successor of it. Based on the documentation you can easily distribute a cross validation to a Spark cluster like this:

from sklearn.utils import parallel_backend
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn import svm
from joblibspark import register_spark

register_spark() # register spark backend

iris = datasets.load_iris()
clf = svm.SVC(kernel='linear', C=1)
with parallel_backend('spark', n_jobs=3):
  scores = cross_val_score(clf, iris.data, iris.target, cv=5)

print(scores)

A GridSearchCV can be distributed in the same way.

like image 95
4 revs, 4 users 83% Avatar answered Mar 20 '23 17:03

4 revs, 4 users 83%