How to use a PySpark UDF in a Scala Spark project?

Tags:

Several people (1, 2, 3) have discussed using a Scala UDF in a PySpark application, usually for performance reasons. I am interested in the opposite - using a python UDF in a Scala Spark project.

I am particularly interested in building a model using sklearn (and MLFlow) then efficiently applying that to records in a Spark streaming job. I know I could also host the python model behind a REST API and make calls to that API in the Spark streaming application in mapPartitions, but managing concurrency for that task and setting up the API for hosted model isn't something I'm super excited about.

Is this possible without too much custom development with something like Py4J? Is this just a bad idea?

Thanks!

562

asked Aug 18 '18 16:08

turtlemonvh

1 Answers

Maybe I'm late to the party, but at least I can help with this for posterity. This is actually achievable by creating your python udf and registering it with spark.udf.register("my_python_udf", foo). You can view the doc here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.UDFRegistration.register

This function can then be called from sqlContext in Python, Scala, Java, R or any language really, because you're accessing sqlContext directly (where the udf is registered). For example, you would call something like

spark.sql("SELECT my_python_udf(...)").show()

PROS - You get to call your sklearn model from Scala.

CONS - You have to use sqlContext and write SQL style queries.

I hope this helps, at least for any future visitors.

answered Oct 09 '22 13:10

napoleon_borntoparty

Related questions
                            
                                Regex unapply in for comprehension with if guard not compiling
                            
                                create an ambiguous low priority implicit
                            
                                How to refer to protected inner class in Scala when inheriting from Java (with byte code only)
                            
                                Using chardet to detect bad encoding in a MySQL db with JDBC
                            
                                Play 2 Scala - Best way to upload a big CSV file with Iteratee in order to process each line reactively
                            
                                scala-library.jar version in sbt published artifacts
                            
                                Scala pickling case class versioning
                            
                                scala path dependent types and type level proofs
                            
                                Is there a way to get a warning when a Scala Value Class needs to become instantiated?
                            
                                sbt project is very slow to resolve dependencies
                            
                                What's the difference between Erlang Actors, Scala Actors and the theoretical concept "Actor"?
                            
                                Spark Caching: RDD Only 8% cached
                            
                                What is the difference between Mongo Scala Driver and Reactive-Mongo Driver in play framework? [closed]
                            
                                How to log to an explicit AWS CloudWatch log stream and change it programmatically (Java/Scala/log4j)
                            
                                Scala & Spark: Recycling SQL statements
                            
                                Spark colocated join between two partitioned dataframes
                            
                                akka-stream + akka-http lifecycle

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to use a PySpark UDF in a Scala Spark project?

Tags:

scala

apache-spark

pyspark

py4j

mlflow

turtlemonvh

People also ask

1 Answers

napoleon_borntoparty

Recent Activity

Donate For Us