Spark: UDF executed many times

Tags:

I have a dataframe with the following code:

def test(lat: Double, lon: Double) = {
  println(s"testing ${lat / lon}")
  Map("one" -> "one", "two" -> "two")
}

val testUDF = udf(test _)

df.withColumn("test", testUDF(col("lat"), col("lon")))
  .withColumn("test1", col("test.one"))
  .withColumn("test2", col("test.two"))

Now checking the logs, I found out that for each row the UDF is executed 3 times. If I add the "test3" from a "test.three" column then the UDF is executed once more.

Can someone explain me why?

Can this be avoid properly (without caching the dataframe after "test" is added, even if this works)?

683

asked Nov 04 '19 15:11

Rolintocour

1 Answers

If you want to avoid multiple calls to a udf (which is useful especially if the udf is a bottleneck in your job) you can do it as follows:

val testUDF = udf(test _).asNondeterministic()

Basically you tell Spark that your function is not deterministic and now Spark makes sure it is called only once because it is not safe to call it multiple times (each call could possibly return different result).

Also be aware that this trick is not for free, by doing this you are putting some constraints on the optimizer, one side effect of this is for example that Spark optimizer does not push filters through expressions that are not deterministic so you become responsible for optimal position of the filters in your query.

117

answered Sep 23 '22 19:09

David Vrba

Related questions
                            
                                Sort list of string with localization in scala
                            
                                IntelliJ IDEA w/ Scala Plugin not finding scala.concurrent
                            
                                Spark streaming DStream RDD to get file name
                            
                                Using a custom class loader for a module dependency in SBT
                            
                                Why does IDEA not resolve scala.reflect, but scala-reflect is included in project settings?
                            
                                Create Spark DataFrame in Spark Streaming from JSON Message on Kafka
                            
                                Spark forcing log4j
                            
                                Splitting an HList that was concatenated using Prepend[A, B]
                            
                                Accessing HDFS HA from spark job (UnknownHostException error)
                            
                                How to transform RDD, Dataframe or Dataset straight to a Broadcast variable without collect?
                            
                                Passing a map with struct-type key into a Spark UDF
                            
                                Handling microseconds in Spark Scala
                            
                                Scala's equivalent of pandas in Python or data.frame / data.table in R
                            
                                Typesafe config: encryption/obfuscation of sensitive values in memory
                            
                                Akka HTTP api routes structure
                            
                                Json.obj Scala, string concat: Compilation error
                            
                                How to filter on a Option[Boolean] column in slick
                            
                                scala observable unify observable with a sequence without intermediate datastructure update
                            
                                how to process data in chunks/batches with kafka streams?
                            
                                Why do you extend Serializable in Scala?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Spark: UDF executed many times

Tags:

scala

apache-spark

apache-spark-sql

Rolintocour

People also ask

1 Answers

David Vrba

Recent Activity

Donate For Us