Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to use a PySpark UDF in a Scala Spark project?

Several people (1, 2, 3) have discussed using a Scala UDF in a PySpark application, usually for performance reasons. I am interested in the opposite - using a python UDF in a Scala Spark project.

I am particularly interested in building a model using sklearn (and MLFlow) then efficiently applying that to records in a Spark streaming job. I know I could also host the python model behind a REST API and make calls to that API in the Spark streaming application in mapPartitions, but managing concurrency for that task and setting up the API for hosted model isn't something I'm super excited about.

Is this possible without too much custom development with something like Py4J? Is this just a bad idea?

Thanks!

like image 562
turtlemonvh Avatar asked Aug 18 '18 16:08

turtlemonvh


People also ask

How do I register a function as UDF in Spark Scala?

In Spark, you create UDF by creating a function in a language you prefer to use for Spark. For example, if you are using Spark with scala, you create a UDF in scala language and wrap it with udf() function or register it as udf to use it on DataFrame and SQL respectively.

Why UDF are not recommended in Spark?

It is well known that the use of UDFs (User Defined Functions) in Apache Spark, and especially in using the Python API, can compromise our application performace. For this reason, at Damavis we try to avoid their use as much as possible infavour of using native functions or SQL .


1 Answers

Maybe I'm late to the party, but at least I can help with this for posterity. This is actually achievable by creating your python udf and registering it with spark.udf.register("my_python_udf", foo). You can view the doc here https://spark.apache.org/docs/latest/api/python/pyspark.sql.html#pyspark.sql.UDFRegistration.register

This function can then be called from sqlContext in Python, Scala, Java, R or any language really, because you're accessing sqlContext directly (where the udf is registered). For example, you would call something like

spark.sql("SELECT my_python_udf(...)").show()

PROS - You get to call your sklearn model from Scala.

CONS - You have to use sqlContext and write SQL style queries.

I hope this helps, at least for any future visitors.

like image 59
napoleon_borntoparty Avatar answered Oct 09 '22 13:10

napoleon_borntoparty