I'm exploring pyspark and the possibilities of integrating scikit-learn with pyspark. I'd like to train a model on each partition using scikit-learn. That means, when my RDD is is defined and gets distributed among different worker nodes, I'd like to use scikit-learn and train a model (let's say a simple k-means) on each partition which exists on each worker node. As scikit-learn algorithms takes a Pandas dataframe, my initial idea was to call toPandas
for each partition and then train my model. However, the toPandas
function collects the DataFrame into the driver and this is not something that I'm looking for. Is there any other way to achieve such a goal?
scikit-learn can't be fully integrated with spark as for now, and the reason is that scikit-learn algorithms aren't implemented to be distributed as it work just on a single machine.
Nevertheless, you can find ready to use Spark - Scikit integration tools in spark-sklearn that supports (for the moments) executing GridSearch on Spark for cross validation.
Edit
As of 2020 the spark-sklearn is deprecated and the joblib-spark is the recommended successor of it. Based on the documentation you can easily distribute a cross validation to a Spark cluster like this:
from sklearn.utils import parallel_backend
from sklearn.model_selection import cross_val_score
from sklearn import datasets
from sklearn import svm
from joblibspark import register_spark
register_spark() # register spark backend
iris = datasets.load_iris()
clf = svm.SVC(kernel='linear', C=1)
with parallel_backend('spark', n_jobs=3):
scores = cross_val_score(clf, iris.data, iris.target, cv=5)
print(scores)
A GridSearchCV can be distributed in the same way.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With