Given a Spark application <ol> <li>What factors decide the number of executors in a stand alone mode? In the Mesos and YARN according to this documents, we can specify the number of executers/cores and memory.</li> <li>Once a number of executors are started. Does Spark start the tasks in a round robin fashion or is it smart enough to see if some of the executors are idle/busy and then schedule the tasks accordingly.</li> <li>Also, how does Spark decide on the number of tasks? I did write a simple max temperature program with small dataset and Spark spawned two tasks in a single executor. This is in the Spark stand alone mode.</li> </ol>

Answering your questions: <ol> <li>The standalone mode uses the same configuration variable as Mesos and Yarn modes to set the number of executors. The variable <code>spark.cores.max</code> defines the maximun number of cores used in the spark Context. The default value is infinity so Spark will use all the cores in the cluster. The spark.task.cpus variable defines how many CPUs Spark will allocate for a single task, the default value is 1. With these two variables you can define the maximun number of parallel tasks in your cluster.</li> <li>When you create an RDD subClass you can define in which machines to run your task. This is defined in the <code>getPreferredLocations</code> method. But as the method signatures suggest this is only a preference so if Spark detects that one machine is not busy, it will launch the task in this idle machine. However I don't know the mechanism used by Spark to know what machines are idle. To achieve locality, we (Stratio) decided to make each Partions smaller so the task takes less time and achieve locality.</li> <li>The number of tasks of each Spark's operation is defined according to the length of the RDD's partitions. This vector is the result of the getPartitions method that you have to override if you want to develop a new RDD subClass. This method returns how a RDD is split, where the information is and the partitions. When you join two or more RDDs using, for example, union or join operations, the number of tasks of the resulting RDD is the maximum number of tasks of the RDDs involved in the operation. For example: if you join RDD1 that has 100 tasks and RDD2 that has 1000 tasks, the next operation of the resulting RDD will have 1000 tasks. Note that a high number of partitions is not necessarily synonym of more data.</li> </ol> I hope this will help.

What factors decide the number of executors in a stand alone mode?

3 Answers

Answering your questions:

The standalone mode uses the same configuration variable as Mesos and Yarn modes to set the number of executors. The variable spark.cores.max defines the maximun number of cores used in the spark Context. The default value is infinity so Spark will use all the cores in the cluster. The spark.task.cpus variable defines how many CPUs Spark will allocate for a single task, the default value is 1. With these two variables you can define the maximun number of parallel tasks in your cluster.
When you create an RDD subClass you can define in which machines to run your task. This is defined in the getPreferredLocations method. But as the method signatures suggest this is only a preference so if Spark detects that one machine is not busy, it will launch the task in this idle machine. However I don't know the mechanism used by Spark to know what machines are idle. To achieve locality, we (Stratio) decided to make each Partions smaller so the task takes less time and achieve locality.
The number of tasks of each Spark's operation is defined according to the length of the RDD's partitions. This vector is the result of the getPartitions method that you have to override if you want to develop a new RDD subClass. This method returns how a RDD is split, where the information is and the partitions. When you join two or more RDDs using, for example, union or join operations, the number of tasks of the resulting RDD is the maximum number of tasks of the RDDs involved in the operation. For example: if you join RDD1 that has 100 tasks and RDD2 that has 1000 tasks, the next operation of the resulting RDD will have 1000 tasks. Note that a high number of partitions is not necessarily synonym of more data.

I hope this will help.

198

answered Oct 17 '22 21:10

jlopezmat

I agree with @jlopezmat about how Spark chooses its configuration. With respect to your test code, your are seeing two task due to the way textFile is implemented. From SparkContext.scala:

  /**
   * Read a text file from HDFS, a local file system (available on all nodes), or any
   * Hadoop-supported file system URI, and return it as an RDD of Strings.
   */
  def textFile(path: String, minPartitions: Int = defaultMinPartitions): RDD[String] = {
    hadoopFile(path, classOf[TextInputFormat], classOf[LongWritable], classOf[Text],
      minPartitions).map(pair => pair._2.toString)
  }

and if we check what is the value of defaultMinPartitions:

  /** Default min number of partitions for Hadoop RDDs when not given by user */
  def defaultMinPartitions: Int = math.min(defaultParallelism, 2)

answered Oct 17 '22 21:10

Daniel H.

Spark chooses the number of tasks based on the number of partitions in the original data set. If you are using HDFS as your data source, then the number of partitions with be equal to the number of HDFS blocks, by default. You can change the number of partitions in a number of different ways. The top two: as an extra argument to the SparkContext.textFile method; by calling the RDD.repartion method.

answered Oct 17 '22 20:10

David

Related questions
                            
                                How to schedule task for start of every hour
                            
                                Linux - threads and process scheduling priorities
                            
                                sys.setswitchinterval in Python 3.2 and beyond
                            
                                Dataflow computing in python
                            
                                Cooperative Scheduling vs Preemptive Scheduling?
                            
                                SQL Server Agent Job Timeout
                            
                                Java Scheduling: Quartz vs Others? (ie. Obsidian) [closed]
                            
                                Scheduling Employees - what data structure to use?
                            
                                round robin scheduling java iterators
                            
                                Difference between AlarmManager and ScheduledExecutorService
                            
                                How do you handle scheduling/deadlines around programmers? [closed]
                            
                                How to do “sequential” Job Scheduling (Quartz?)
                            
                                Relational Schema for Fowler's Temporal Expressions
                            
                                Can the order of execution of fork() be determined?
                            
                                Spring scheduler which is run after application is started and after midnight
                            
                                Comparing DateTime structs to find free slots
                            
                                how dispatcher works?
                            
                                Goroutines are cooperatively scheduled. Does that mean that goroutines that don't yield execution will cause goroutines to run one by one?
                            
                                Run once a day in C#
                            
                                How does select work when multiple channels are involved?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What factors decide the number of executors in a stand alone mode?

Tags:

apache-spark

scheduling

Praveen Sripati

People also ask

3 Answers

jlopezmat

Daniel H.

David

Recent Activity

Donate For Us