Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark coalesce relationship with number of executors and cores

I'm bringing up a very silly question about Spark as I want to clear my confusion. I'm very new in Spark and still trying to understand how it works internally.

Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100.

Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the task will work on one single partition independently?

Round: 1 12 executors each with 5 cores => 60 tasks process 60 partitions
Round: 2 8 executors each with 5 cores => 40 tasks

process the rest of the 40 partitions and 4 executors never place a job for the 2nd time

Or all tasks from the same executor will work on the same partition?

Round: 1: 12 executors => process 12 partitions
Round: 2: 12 executors => process 12 partitions
Round: 3: 12 executors => process 12 partitions
....
....
....
Round: 9 (96 partitions already processed): 4 executors => process the remaining 4 partitions

like image 769
sikara tijuhara Avatar asked Jul 19 '16 18:07

sikara tijuhara


People also ask

Can coalesce increase number of partitions?

You can try to increase the number of partitions with coalesce, but it won't work! numbersDf3 keeps four partitions even though we attemped to create 6 partitions with coalesce(6). The coalesce algorithm changes the number of nodes by moving data from some partitions to existing partitions.

How do you determine the number of executors and cores in Spark?

According to the recommendations which we discussed above: Leave 1 core per node for Hadoop/Yarn daemons => Num cores available per node = 16-1 = 15. So, Total available of cores in cluster = 15 x 10 = 150. Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30.

Why coalesce is better than repartition in Spark?

The repartition() can be used to increase or decrease the number of partitions, but it involves heavy data shuffling across the cluster. On the other hand, coalesce() can be used only to decrease the number of partitions. In most of the cases, coalesce() does not trigger a shuffle.

How would you set the number of executors in any Spark based application?

The number of executors for a spark application can be specified inside the SparkConf or via the flag –num-executors from command-line. Cluster Manager : An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN).


1 Answers

Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100.

In spark by default number of partitions = hdfs blocks, as coalesce(100) is specified, Spark will divide input data into 100 partitions.

Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the tasks will work on one single partition independently?

Worker node with executors

As you passed might be passed.

--num-executors 12: Number of executors to launch in an application.

--executor-cores 5 : Number of cores per executor. 1 core = 1 task at a time

So the execution of partitions will go like this.

Round 1

12 partitions will be processed by 12 executors with 5 tasks(threads) each.

Round 2

12 partitions will be processed by 12 executors with 5 tasks(threads) each.
.
.
.

Round: 9 (96 partitions already processed):

4 partitions will be processed by 4 executors with 5 tasks(threads) each.

NOTE: Usually, Some executors may complete assigned work quickly(various parameters like data locality, Network I/O, CPU, etc.). So, it will pick the next partition to process by waiting for a configured amount of scheduling time.

like image 189
mrsrinivas Avatar answered Oct 11 '22 15:10

mrsrinivas