Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

When to use a UDF versus a function in PySpark? [duplicate]

I'm using Spark with Databricks and have the following code:

def replaceBlanksWithNulls(column):
    return when(col(column) != "", col(column)).otherwise(None)

Both of these next statements work:

x = rawSmallDf.withColumn("z", replaceBlanksWithNulls("z"))

and using a UDF:

replaceBlanksWithNulls_Udf = udf(replaceBlanksWithNulls)
y = rawSmallDf.withColumn("z", replaceBlanksWithNulls_Udf("z"))

It is unclear to me from the documentation when I should use one over the other and why?

like image 318
Rodney Avatar asked May 09 '19 01:05

Rodney


2 Answers

An UDF can essentially be any sort of function (there are exceptions, of course) - it is not necessary to use Spark structures such as when, col, etc. By using an UDF the replaceBlanksWithNulls function can be written as normal python code:

def replaceBlanksWithNulls(s):
    return "" if s != "" else None

which can be used on a dataframe column after registering it:

replaceBlanksWithNulls_Udf = udf(replaceBlanksWithNulls)
y = rawSmallDf.withColumn("z", replaceBlanksWithNulls_Udf("z"))

Note: The default return type of an UDF is strings. If another type is required that must be specified when registering it, e.g.

from pyspark.sql.types import LongType
squared_udf = udf(squared, LongType())

In this case, the column operation is not complex and there are Spark functions that can acheive the same thing (i.e. replaceBlanksWithNulls as in the question:

x = rawSmallDf.withColumn("z", when(col("z") != "", col("z")).otherwise(None))

This is always prefered whenever possible since it allows Spark to optimize the query, see e.g. Spark functions vs UDF performance?

like image 177
Shaido Avatar answered Oct 16 '22 17:10

Shaido


You can find the difference in the Spark SQL (as mentioned in the document). For example, you can find that if you write:

spark.sql("select replaceBlanksWithNulls(column_name) from dataframe")

does not work if you didn't register the function replaceBlanksWithNulls as a udf. In spark sql we need to know the returned type of the function for the exectuion. Hence, we need to register the custom function as a user-defined function (udf) to be used in spark sql.

like image 42
OmG Avatar answered Oct 16 '22 18:10

OmG