I'm bringing up a very silly question about Spark as I want to clear my confusion. I'm very new in Spark and still trying to understand how it works internally.
Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100.
Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the task will work on one single partition independently?
Round: 1 12 executors each with 5 cores => 60 tasks process 60 partitions
Round: 2 8 executors each with 5 cores => 40 tasksprocess the rest of the 40 partitions and 4 executors never place a job for the 2nd time
Or all tasks from the same executor will work on the same partition?
Round: 1: 12 executors => process 12 partitions
Round: 2: 12 executors => process 12 partitions
Round: 3: 12 executors => process 12 partitions
....
....
....
Round: 9 (96 partitions already processed): 4 executors => process the remaining 4 partitions
You can try to increase the number of partitions with coalesce, but it won't work! numbersDf3 keeps four partitions even though we attemped to create 6 partitions with coalesce(6). The coalesce algorithm changes the number of nodes by moving data from some partitions to existing partitions.
According to the recommendations which we discussed above: Leave 1 core per node for Hadoop/Yarn daemons => Num cores available per node = 16-1 = 15. So, Total available of cores in cluster = 15 x 10 = 150. Number of available executors = (total cores/num-cores-per-executor) = 150/5 = 30.
The repartition() can be used to increase or decrease the number of partitions, but it involves heavy data shuffling across the cluster. On the other hand, coalesce() can be used only to decrease the number of partitions. In most of the cases, coalesce() does not trigger a shuffle.
The number of executors for a spark application can be specified inside the SparkConf or via the flag –num-executors from command-line. Cluster Manager : An external service for acquiring resources on the cluster (e.g. standalone manager, Mesos, YARN).
Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100.
In spark by default number of partitions
= hdfs blocks
, as coalesce(100)
is specified, Spark will divide input data into 100 partitions.
Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the tasks will work on one single partition independently?
As you passed might be passed.
--num-executors 12
: Number of executors to launch in an application.
--executor-cores 5
: Number of cores per executor. 1 core = 1 task at a time
So the execution of partitions will go like this.
12 partitions will be processed by 12 executors with 5 tasks(threads) each.
12 partitions will be processed by 12 executors with 5 tasks(threads) each.
.
.
.
4 partitions will be processed by 4 executors with 5 tasks(threads) each.
NOTE: Usually, Some executors may complete assigned work quickly(various parameters like data locality, Network I/O, CPU, etc.). So, it will pick the next partition to process by waiting for a configured amount of scheduling time.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With