How do Spark Nodes communicate during a Shuffle?

Tags:

apache-spark

I've seen from this question that Spark nodes effectively "communicate directly", but I'm less concerned with the theory and more with the implementation. Here it shows, in the "###Encryption" section near the bottom of the page, that you can configure Spark to use a number of SSL protocols for security, which would suggest, to me at least, that it uses some form of HTTP(s) for communication. My question is effectively two parts: what protocol do Spark nodes use to communicate, and how is the data formatted for this transfer?

328

asked Aug 16 '17 13:08

Chance

1 Answers

Spark uses RPC (Netty) to communicate between the executor processes. You can look into the NettyRpcEndpointRef class to see that actual implementation.

For shuffling data, we start from the BlockManager which is responsible for providing data blocks. There is one per executor process. Internally a BlockStoreShuffleReader which manages the reads from different executors using a SerializerManager. This manager holds an actual serializer, which is defined by the spark.serializer property:

val serializer = instantiateClassFromConf[Serializer](
  "spark.serializer", "org.apache.spark.serializer.JavaSerializer")
logDebug(s"Using serializer: ${serializer.getClass}")

When the BlockManager attempts to read a block, it uses the serializer from that underlying configuration. It can be either a KryoSerializer or a JavaSerializer, depending on your setting.

Bottom line, for reading and writing shuffled data Spark uses the user defined serializer.

For task serialization, this is a little different.

Spark uses a variable called closureSerializer, which defaults to JavaSerializerInstance, meaning Java serialization. You can see this inside the DAGScheduler.submitMissingTasks method:

val taskBinaryBytes: Array[Byte] = stage match {
  case stage: ShuffleMapStage =>
    JavaUtils.bufferToArray(
      closureSerializer.serialize((stage.rdd, stage.shuffleDep): AnyRef))
  case stage: ResultStage =>
      JavaUtils.bufferToArray(closureSerializer.serialize((stage.rdd, stage.func): AnyRef))
}

The actual object that gets serialized and sent to each executor is called TaskDescription:

def encode(taskDescription: TaskDescription): ByteBuffer = {
  val bytesOut = new ByteBufferOutputStream(4096)
  val dataOut = new DataOutputStream(bytesOut)

  dataOut.writeLong(taskDescription.taskId)
  dataOut.writeInt(taskDescription.attemptNumber)
  dataOut.writeUTF(taskDescription.executorId)
  dataOut.writeUTF(taskDescription.name)
  dataOut.writeInt(taskDescription.index)

  // Write files.
  serializeStringLongMap(taskDescription.addedFiles, dataOut)

  // Write jars.
  serializeStringLongMap(taskDescription.addedJars, dataOut)

  // Write properties.
  dataOut.writeInt(taskDescription.properties.size())
  taskDescription.properties.asScala.foreach { case (key, value) =>
    dataOut.writeUTF(key)
    // SPARK-19796 -- writeUTF doesn't work for long strings, which can happen for property values
    val bytes = value.getBytes(StandardCharsets.UTF_8)
    dataOut.writeInt(bytes.length)
    dataOut.write(bytes)
  }

  // Write the task. The task is already serialized, so write it directly to the byte buffer.
  Utils.writeByteBuffer(taskDescription.serializedTask, bytesOut)

  dataOut.close()
  bytesOut.close()
  bytesOut.toByteBuffer
}

And gets sent over RPC from the CoarseGrainedSchedulerBackend.launchTasks method:

executorData.executorEndpoint.send(LaunchTask(new SerializableBuffer(serializedTask)))

What I've showed so far talks about launching tasks. For shuffling data, Spark holds a BlockStoreShuffleReader which manages the reads from different executors.

198

answered Sep 18 '22 03:09

Yuval Itzchakov

Related questions
                            
                                How to get files name with spark sc.textFile?
                            
                                Spark spark-submit --jars arguments wants comma list, how to declare a directory of jars?
                            
                                Spark: Force two RDD[Key, Value] with co-located partitions using custom partitioner
                            
                                Joining PySpark DataFrames on nested field
                            
                                Spark Matrix multiplication with python
                            
                                How to ensure partitioning induced by Spark DataFrame join?
                            
                                What is the purpose of cache an RDD in Apache Spark?
                            
                                Spark write to postgres slow
                            
                                Peak Execution Memory in Spark
                            
                                Export data from Amazon Redshift as JSON
                            
                                How to load only the data of the last partition
                            
                                Find median in spark SQL for multiple double datatype columns
                            
                                Apache spark case with multiple when clauses on different columns
                            
                                Spark union fails with nested JSON dataframe
                            
                                How to load a csv directly into a Spark Dataset?
                            
                                How to Test Spark RDD
                            
                                merge two dataset which are having different column names in Apache spark
                            
                                Why does spark-shell fail with "The root scratch dir: /tmp/hive on HDFS should be writable."?
                            
                                Why does a query fail with "AnalysisException: Expected only partition pruning predicates"?
                            
                                Apache Spark standalone for Anonymous UID (Without user name)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With