Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Spark, is it possible to share data between two executors?

I have a really big read only data that I want all the executors on the same node to use. Is that possible in Spark. I know, you can broadcast variables, but can you broadcast really big arrays. Does, under the hood, it shares data between executors on the same node? How is this able to share data between the JVMs of the executors running on the same node?

like image 892
pythonic Avatar asked Oct 22 '16 09:10

pythonic


People also ask

What is shared variables in Spark?

Spark supports two types of shared variables: broadcast variables, which can be used to cache a value in memory on all nodes, and accumulators, which are variables that are only “added” to, such as counters and sums. This guide shows each of these features in each of Spark's supported languages.

How many executors can you have in Spark?

The first Spark job starts with two executors (because the minimum number of nodes is set to two in this example). The cluster can autoscale to a maximum of ten executors (because the maximum number of nodes is set to ten).

What is sent at executors in Spark?

Once connected, Spark acquires executors on nodes in the cluster, which are processes that run computations and store data for your application. Next, it sends your application code (defined by JAR or Python files passed to SparkContext) to the executors.

What is the use of executor memory in Spark?

An executor is a process that is launched for a Spark application on a worker node. Each executor memory is the sum of yarn overhead memory and JVM Heap memory. JVM Heap memory comprises of: RDD Cache Memory.

What is the difference between executors and executors in spark?

A single node can run multiple executors and executors for an application can span multiple worker nodes. An executor stays up for the duration of the Spark Application and runs the tasks in multiple threads.

How do spark drivers launch the executors?

Let’s say a user submits a job using “spark-submit”. “spark-submit” will in-turn launch the Driver which will execute the main () method of our code. Driver contacts the cluster manager and requests for resources to launch the Executors. The cluster manager launches the Executors on behalf of the Driver.

How much memory does a spark executor use?

spark-executor-memory + spark.yarn.executor.memoryOverhead. So, if we request 20GB per executor, AM will actually get 20GB + memoryOverhead = 20 + 7% of 20GB = ~23GB memory for us. Running executors with too much memory often results in excessive garbage collection delays.

Why can't shared variables be used across tasks in spark?

Also, on the remote machine, no updates to the variables sent back to the driver program. Therefore, it would be inefficient to support general, read-write shared variables across tasks. Although, in spark for two common usage patterns, there are two types of shared variables, such as:


1 Answers

Yes, you could use broadcast variables when considering your data is readonly (immutable). the broadcast variable must satisfy the following properties.

  • Fit in memory
  • Immutable
  • Distributed to the cluster

So, here the only condition is your data have to be able to fit in memory on one node. That means the data should NOT be anything super large or beyond the memory limits like a massive table.

Each executer receives a copy of the broadcast variable and all the tasks in that particular executor are reading/using that data. It's like sending a large, read-only data to all the worker nodes in the cluster. i.e., ship to each worker only once instead of with each task and executors (it's tasks) read the data.

like image 61
Kris Avatar answered Sep 29 '22 12:09

Kris