Spark UDF called more than once per record when DF has too many columns

Tags:

I'm using Spark 1.6.1 and encountering a strange behaviour: I'm running an UDF with some heavy computations (a physics simulations) on a dataframe containing some input data, and building up a result-Dataframe containing many columns (~40).

Strangely, my UDF is called more than once per Record of my input Dataframe in this case (1.6 times more often), which I find unacceptable because its very expensive. If I reduce the number of columns (e.g. to 20), then this behavior disappears.

I managed to write down a small script which demonstrates this:

import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.functions.udf


object Demo {

  case class Result(a: Double)

  def main(args: Array[String]): Unit = {

    val sc = new SparkContext(new SparkConf().setAppName("Demo").setMaster("local[*]"))
    val sqlContext = new SQLContext(sc)
    import sqlContext.implicits._

    val numRuns = sc.accumulator(0) // to count the number of udf calls

    val myUdf = udf((i:Int) => {numRuns.add(1);Result(i.toDouble)})

    val data = sc.parallelize((1 to 100), numSlices = 5).toDF("id")

    // get results of UDF
    var results = data
      .withColumn("tmp", myUdf($"id"))
      .withColumn("result", $"tmp.a")


    // add many columns to dataframe (must depend on the UDF's result)
    for (i <- 1 to 42) {
      results=results.withColumn(s"col_$i",$"result")
    }

    // trigger action
    val res = results.collect()
    println(res.size) // prints 100

    println(numRuns.value) // prints 160

  }
}

Now, is there a way to solve this without reducing the number of columns?

640

asked Oct 29 '16 15:10

Raphael Roth

2 Answers

I can't really explain this behavior - but obviously the query plan somehow chooses a path where some of the records are calculated twice. This means that if we cache the intermediate result (right after applying the UDF) we might be able to "force" Spark not to recompute the UDF. And indeed, once caching is added it behaves as expected - UDF is called exactly 100 times:

// get results of UDF
var results = data
  .withColumn("tmp", myUdf($"id"))
  .withColumn("result", $"tmp.a").cache()

Of course, caching has its own costs (memory...), but it might end up beneficial in your case if it saves many UDF calls.

answered Sep 22 '22 21:09

Tzach Zohar

We had this same problem about a year ago and spent a lot of time till we finally figured out what was the problem.

We also had a very expensive UDF to calculate and we found out that it gets calculated again and again for every time we refer to its column. Its just happened to us again a few days ago, so I decided to open a bug on this: SPARK-18748

We also opened a question here then, but now I see the title wasn't so good: Trying to turn a blob into multiple columns in Spark

I agree with Tzach about somehow "forcing" the plan to calculate the UDF. We did it uglier, but we had to, because we couldn't cache() the data - it was too big:

val df = data.withColumn("tmp", myUdf($"id"))
val results = sqlContext.createDataFrame(df.rdd, df.schema)
             .withColumn("result", $"tmp.a")

update:

Now I see that my jira ticket was linked to another one: SPARK-17728, which still didn't really handle this issue the right way, but it gives one more optional work around:

val results = data.withColumn("tmp", explode(array(myUdf($"id"))))
                  .withColumn("result", $"tmp.a")

answered Sep 26 '22 21:09

uzadude

Related questions
                            
                                AWS SSL on EC2 instance without Load Balancer - NodeJS
                            
                                Find and replace regex in Intellij, but keep some of the matched regex?
                            
                                How do you share gRPC proto definitions between services
                            
                                mariadb galera - Error when a node shutdown ERROR 1047 WSREP has not yet prepared node for application use
                            
                                Xamarin - Visual Studio stuck at zipalign.exe
                            
                                React Test Renderer Simulating Clicks on Elements
                            
                                Windows Defender Antivirus scan from C# [AccessViolation exception]
                            
                                Break for loop in an if statement
                            
                                React performance implications of long key value on component
                            
                                using Matplotlib how to highlight one point in the final plot
                            
                                ConfigurationProperties does not bind properties
                            
                                Android Studio:EditText editable is deprecated How to use inputType

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With