What operations contribute to Spark Task Deserialization time?

1 Answers

A quick buzz into the source code on master (https://github.com/kayousterhout/spark-1/blob/master/core/src/main/scala/org/apache/spark/executor/Executor.scala#L179)

It's essentially this:

    val (taskFiles, taskJars, taskBytes) = Task.deserializeWithDependencies(serializedTask)
    updateDependencies(taskFiles, taskJars)
    task = ser.deserialize[Task[Any]](taskBytes, Thread.currentThread.getContextClassLoader)

    // If this task has been killed before we deserialized it, let's quit now. Otherwise,
    // continue executing the task.
    if (killed) {
      // Throw an exception rather than returning, because returning within a try{} block
      // causes a NonLocalReturnControl exception to be thrown. The NonLocalReturnControl
      // exception will be caught by the catch block, leading to an incorrect ExceptionFailure
      // for the task.
      throw new TaskKilledException
    }

    attemptedTask = Some(task)
    logDebug("Task " + taskId + "'s epoch is " + task.epoch)
    env.mapOutputTracker.updateEpoch(task.epoch)

From this line (taskFiles, taskJars, taskBytes) I suspect that each task is deserializing the JARs; in my case I have a 136 MB fat JAR that isn't helping.

answered Sep 23 '22 17:09

Larsenal

Related questions
                            
                                Apache Spark sort partition by user ID and write each partition to CSV
                            
                                Why does sbt assembly fail with "Not a valid command: assembly"?
                            
                                Lost executor Spark
                            
                                PySpark: Numpy memory not being released in executor map-partition function (memory leak)
                            
                                Joining Spark DataFrames on a nearest key condition
                            
                                I cannot use --package option on bitnami/spark docker container
                            
                                Spark MLlib - Collaborative Filtering Implicit Feed
                            
                                Spark: What is the time complexity of the connected components algorithm used in GraphX?
                            
                                How to repartition evenly in Spark?
                            
                                Out of memory error when writing out spark dataframes to parquet format
                            
                                Difference between a map and udf
                            
                                Cassandra Error message: Not marking nodes down due to local pause. Why?
                            
                                Spark on localhost
                            
                                Spark RDD- map vs mapPartitions
                            
                                Sending Spark streaming metrics to open tsdb
                            
                                When are Spark RDD blocks created and destroyed/removed?
                            
                                Spark StringIndexer.fit is very slow on large records
                            
                                Spark 2.3.1 Structured Streaming state store inner working
                            
                                Unable to read keystore file from pyspark
                            
                                How to More Efficiently Load Parquet Files in Spark (pySpark v1.2.0)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What operations contribute to Spark Task Deserialization time?

Tags:

apache-spark

Larsenal

People also ask

1 Answers

Larsenal

Recent Activity

Donate For Us