Apply sklearn trained model on a dataframe with PySpark

Tags:

I trained a random forest algorithm with Python and would like to apply it on a big dataset with PySpark.

I first loaded the trained sklearn RF model (with joblib), loaded my data that contains the features into a Spark dataframe and then I add a column with the predictions, with a user-defined function like that:

def predictClass(features):
    return rf.predict(features)
udfFunction = udf(predictClass, StringType())
new_dataframe = dataframe.withColumn('prediction', 
udfFunction('features'))

It takes so much time to run though, is there a more efficient way to do the same thing? (without using Spark ML)

943

asked May 31 '17 13:05

Pierre

2 Answers

I had to do same thing in recent project. The bad thing about applying udf for each row that pyspark has to read sklearn model each time so that's why it takes ages to finish. The best solution I have found was to use .mapPartitions or foreachPartition method on rdd, really good explanation is here

https://github.com/mahmoudparsian/pyspark-tutorial/blob/master/tutorial/map-partitions/README.md

It works fast because it ensures you that there is no shuffling and for each partition pyspark has to read the model and predict only once. So, the flow would be:

convert DF to RDD
broadcast model to nodes so it will be accessible for workers
write an udf function which takes interator (which contains all rows within a partition) as an argument
iterate through rows and create a proper matrix with your features (order matters)
call .predict only once
return predictions
transform rdd to df if needed

answered Sep 23 '22 03:09

Jacek Placek

sklearn RF model can be quite large when being pickled. It is possible that frequent picklings/unpicklings of the model during task dispatch cause the problem. You could consider using broadcast variables.

From the official document:

Broadcast variables allow the programmer to keep a read-only variable cached on each machine rather than shipping a copy of it with tasks. They can be used, for example, to give every node a copy of a large input dataset in an efficient manner. Spark also attempts to distribute broadcast variables using efficient broadcast algorithms to reduce communication cost.

answered Sep 23 '22 03:09

peter

Related questions
                            
                                patch function with same name as module python django mock
                            
                                Fetching most recent related object for set of objects in Peewee
                            
                                How to see C++ function invocations behind the SWIG interface, TensorFlow
                            
                                Docker volume - need permissions to write to database
                            
                                Selecting rows in a MultiIndexed dataframe
                            
                                Tensorflow slim pre-trained alexnet [closed]
                            
                                Matplotlib reuse figure created by another script
                            
                                How to turn off events.out.tfevents file in tf.contrib.learn Estimator
                            
                                resolving YAML files and substituting into templates
                            
                                Geany autocomplete Python constraints
                            
                                running python script as a systemd service
                            
                                Why are Conda Virtual Environments so big?
                            
                                How to replace a value within a tensor by indices?
                            
                                How to install dbus-python on macOS?
                            
                                Practical Use of Reversed Set Operators in Python
                            
                                Split queue into train/test set
                            
                                How Yolo calculate P(Object) in the YOLO 9000
                            
                                Attaching a pre-built query to a scoped_session in SQLAlchemy
                            
                                Missing application resource while running script in pyspark
                            
                                Why close a cursor for Sqlite3 in Python

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Apply sklearn trained model on a dataframe with PySpark

Tags:

python

apache-spark

scikit-learn

pyspark

Pierre

People also ask

2 Answers

Jacek Placek

peter

Recent Activity

Donate For Us