How does Spark send closures to workers?

Tags:

apache-spark

When I write an RDD transformation, e.g.

val rdd = sc.parallelise(1 to 1000) 
rdd.map(x => x * 3)

I understand that the closure (x => x * 3) which is simply a Function1 needs to be Serializable and ~~I think I read somewhere~~EDIT: it's right there implied in the documentation: http://spark.apache.org/docs/latest/programming-guide.html#passing-functions-to-spark that it is "sent" to the workers for execution. (e.g. Akka sending an "executable piece of code" down the wire to workers to run)

Is that how it works?

Someone at a meetup I attended commented and said that it is not actually sending any serialized code, but since each worker get a "copy" of the jar anyway, it just needs a reference to which function to run or something like this (but I'm not sure I quote that person correctly)

I'm now at an utter confusion on how it actually works.

So my questions are

how are transformation closures sent to workers? Serialized via akka? or they are "already there" because spark sends the entire uber jar to each worker (sounds unlikely to me...)
if so, then how the rest of the jar is sent to the workers? is this is what the "cleanupClosure" doing? e.g. sending only the relevant bytecode to the worker instead of the entire uberjar? (e.g. only dependent code to the closure?)
so to summarise, does spark, at any point, syncs the jars in the --jars classpath with the workers somehow? or does it sends "just the right amount" of code to workers? and if it does send closures, are they being cached for the need of recalculation? or does it send the closure with the task every time a task is scheduled? sorry if this is silly questions but I really don't know.

Please add sources if you can for your answer, I couldn't find it explicit in the documentation, and I'm too wary to try and conclude it just by reading the code.

201

asked Aug 14 '15 17:08

Eran Medan

1 Answers

The closures are most certainly serialized at runtime. I have plenty of instances seen Closure Not Serializable exceptions at runtime - from pyspark and from scala. There is complex code called

From ClosureCleaner.scala

def clean(
    closure: AnyRef,
    checkSerializable: Boolean = true,
    cleanTransitively: Boolean = true): Unit = {
  clean(closure, checkSerializable, cleanTransitively, Map.empty)
}

that attempts to minify the code being serialized. The code is then sent across the wire - if it were serializable. Otherwise an exception will be thrown.

Here is another excerpt from ClosureCleaner to check the ability to serialize an incoming function:

  private def ensureSerializable(func: AnyRef) {
    try {
      if (SparkEnv.get != null) {
        SparkEnv.get.closureSerializer.newInstance().serialize(func)
      }
    } catch {
      case ex: Exception => throw new SparkException("Task not serializable", ex)
    }
  }

196

answered Sep 19 '22 09:09

WestCoastProjects

Related questions
                            
                                Why agg() in PySpark is only able to summarize one column at a time? [duplicate]
                            
                                How to export DataFrame to csv in Scala?
                            
                                How to convert rows into a list of dictionaries in pyspark?
                            
                                How to solve "Can't assign requested address: Service 'sparkDriver' failed after 16 retries" when running spark code?
                            
                                map values in a dataframe from a dictionary using pyspark
                            
                                Replacing whitespace in all column names in spark Dataframe
                            
                                Dropping multiple columns from Spark dataframe by Iterating through the columns from a Scala List of Column names
                            
                                pyspark approxQuantile function
                            
                                Spark: error reading DateType columns in partitioned parquet data
                            
                                Apache Spark shell crashes when trying to start executor on worker
                            
                                Spark RDD equivalent to Scala collections partition
                            
                                ON DUPLICATE KEY UPDATE while inserting from pyspark dataframe to an external database table via JDBC
                            
                                Why spark executor receives SIGTERM?
                            
                                Spark ML - MulticlassClassificationEvaluator - can we get precision/recall by each class label?
                            
                                Is proper event-time sessionization possible with Spark Structured Streaming?
                            
                                Python Spark Dataframes: Better way to export groups to text file
                            
                                Proper save/load of MatrixFactorizationModel

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With