I'm bringing up a very silly question about Spark as I want to clear my confusion. I'm very new in Spark and still trying to understand how it works internally. Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100. Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the task will work on one single partition independently? <blockquote> Round: 1 12 executors each with 5 cores => 60 tasks process 60 partitions Round: 2 8 executors each with 5 cores => 40 tasks process the rest of the 40 partitions and 4 executors never place a job for the 2nd time </blockquote> Or all tasks from the same executor will work on the same partition? <blockquote> Round: 1: 12 executors => process 12 partitions Round: 2: 12 executors => process 12 partitions Round: 3: 12 executors => process 12 partitions .... .... .... Round: 9 (96 partitions already processed): 4 executors => process the remaining 4 partitions </blockquote>

<blockquote> Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100. </blockquote> In spark by default <code>number of partitions</code> = <code>hdfs blocks</code>, as <code>coalesce(100)</code> is specified, Spark will divide input data into 100 partitions. <blockquote> Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the tasks will work on one single partition independently? </blockquote> <img src="https://i.stack.imgur.com/D7pcV.png" alt="Worker node with executors"> As you passed might be passed. <code>--num-executors 12</code>: Number of executors to launch in an application. <code>--executor-cores 5</code> : Number of cores per executor. 1 core = 1 task at a time So the execution of partitions will go like this. <h3>Round 1</h3> 12 partitions will be processed by 12 executors with 5 tasks(threads) each. <h3>Round 2</h3> 12 partitions will be processed by 12 executors with 5 tasks(threads) each. . . . <h3>Round: 9 (96 partitions already processed):</h3> 4 partitions will be processed by 4 executors with 5 tasks(threads) each. NOTE: Usually, Some executors may complete assigned work quickly(various parameters like data locality, Network I/O, CPU, etc.). So, it will pick the next partition to process by waiting for a configured amount of scheduling time.

Spark coalesce relationship with number of executors and cores

Tags:

apache-spark

hadoop

hadoop-yarn

I'm bringing up a very silly question about Spark as I want to clear my confusion. I'm very new in Spark and still trying to understand how it works internally.

Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100.

Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the task will work on one single partition independently?

Round: 1 12 executors each with 5 cores => 60 tasks process 60 partitions
Round: 2 8 executors each with 5 cores => 40 tasks

process the rest of the 40 partitions and 4 executors never place a job for the 2nd time

Or all tasks from the same executor will work on the same partition?

Round: 1: 12 executors => process 12 partitions
Round: 2: 12 executors => process 12 partitions
Round: 3: 12 executors => process 12 partitions
....
....
....
Round: 9 (96 partitions already processed): 4 executors => process the remaining 4 partitions

769

asked Jul 19 '16 18:07

sikara tijuhara

1 Answers

Say, if I have a list of input files(assume 1000) which I want to process or write somewhere and I want to use coalesce to reduce my partition number to 100.

In spark by default number of partitions = hdfs blocks, as coalesce(100) is specified, Spark will divide input data into 100 partitions.

Now I run this job with 12 executors with 5 cores per executor, that means 60 tasks when it runs. Does that mean, each of the tasks will work on one single partition independently?

Worker node with executors

As you passed might be passed.

--num-executors 12: Number of executors to launch in an application.

--executor-cores 5 : Number of cores per executor. 1 core = 1 task at a time

So the execution of partitions will go like this.

Round 1

12 partitions will be processed by 12 executors with 5 tasks(threads) each.

Round 2

12 partitions will be processed by 12 executors with 5 tasks(threads) each.
.
.
.

Round: 9 (96 partitions already processed):

4 partitions will be processed by 4 executors with 5 tasks(threads) each.

NOTE: Usually, Some executors may complete assigned work quickly(various parameters like data locality, Network I/O, CPU, etc.). So, it will pick the next partition to process by waiting for a configured amount of scheduling time.

189

answered Oct 11 '22 15:10

mrsrinivas

Related questions
                            
                                Building Hadoop with Maven - "Failed to execute goal org.apache.maven.plugins:maven-antrun-plugin:1.6:run (create-testdirs)"
                            
                                How to get the SerDe Properties of an existing Hive Table
                            
                                Impala on Hadoop 2.2.0 without CDH?
                            
                                Hadoop maps are failing due to ConnectException
                            
                                Flume: Directory to Avro -> Avro to HDFS - Not valid avro after transfer
                            
                                org.apache.hadoop.mapred.LocalClientProtocolProvider not found
                            
                                Hbase master keeps dying, claims a hbase:namespace already exists
                            
                                Load large csv in hadoop via Hue would only store a 64MB block
                            
                                What is the difference between apache Ambari Server and Agent
                            
                                RHbase/thrift install issue
                            
                                Standard practices for logging in MapReduce jobs
                            
                                Hive transform using Python: Unable to initialize custom script
                            
                                Key of object type in the hadoop mapper
                            
                                Hadoop setting the HADOOP_HOME correctly to bin/hadoop it gives command not found
                            
                                Spark NotSerializableException
                            
                                What happens when the intermediate output does not fit in RAM in Spark
                            
                                Startin HBase Shell - Zookeeper exists but fails
                            
                                Why my BroadcastHashJoin is slower than ShuffledHashJoin in Spark
                            
                                Connect to Impala using impyla client with Kerberos auth
                            
                                Error Loading CSV data into a Hive table

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With