What's the meaning of the title "Locality Level" and the 5 status Data local --> process local --> node local --> rack local --> Any? <img src="https://i.stack.imgur.com/mohbn.png" alt="enter image description here">

Here are my two cents and I summarized mostly from spark official guide. Firstly, I want to add one more locality level which is <code>NO_PREF</code> which has been discussed at this thread. Then, let's put those levels together into a single table, <img src="https://i.stack.imgur.com/vbqnG.png" alt="enter image description here"> It's noted that specific level can be skipped as per guide from spark configuration. For instance, if you want to skip <code>NODE_LOCAL</code>, just set <code>spark.locality.wait.node</code> to 0.

What's the meaning of "Locality Level"on Spark cluster

Q: What is Spark locality wait?

locality. wait &sbquo; is set to 3 seconds. Spark will wait to launch a task on an executor local to the data using this value. After this period if the data-local node is still&sbquo; unavailable, Spark&sbquo; will give up and launch the task on another less-local node.

2 Answers

The locality level as far as I know indicates which type of access to data has been performed. When a node finishes all its work and its CPU become idle, Spark may decide to start other pending task that require obtaining data from other places. So ideally, all your tasks should be process local as it is associated with lower data access latency.

You can configure the wait time before moving to other locality levels using:

spark.locality.wait

More information about the parameters can be found in the Spark Configuration docs

With respect to the different levels PROCESS_LOCAL, NODE_LOCAL, RACK_LOCAL, or ANY I think the methods findTask and findSpeculativeTask in org.apache.spark.scheduler.TaskSetManager illustrate how Spark chooses tasks based on their locality level. It first will check for PROCESS_LOCAL tasks which are going to be launched in the same executor process. If not, it will check for NODE_LOCAL tasks that may be in other executors in the same node or it need to be retrieved from systems like HDFS, cached, etc. RACK_LOCAL means that data is in another node and therefore it need to be transferred prior execution. And finally, ANY is just to take any pending task that may run in the current node.

  /**    * Dequeue a pending task for a given node and return its index and locality level.    * Only search for tasks matching the given locality constraint.    */   private def findTask(execId: String, host: String, locality: TaskLocality.Value)     : Option[(Int, TaskLocality.Value)] =   {     for (index <- findTaskFromList(execId, getPendingTasksForExecutor(execId))) {       return Some((index, TaskLocality.PROCESS_LOCAL))     }      if (TaskLocality.isAllowed(locality, TaskLocality.NODE_LOCAL)) {       for (index <- findTaskFromList(execId, getPendingTasksForHost(host))) {         return Some((index, TaskLocality.NODE_LOCAL))       }     }      if (TaskLocality.isAllowed(locality, TaskLocality.RACK_LOCAL)) {       for {         rack <- sched.getRackForHost(host)         index <- findTaskFromList(execId, getPendingTasksForRack(rack))       } {         return Some((index, TaskLocality.RACK_LOCAL))       }     }      // Look for no-pref tasks after rack-local tasks since they can run anywhere.     for (index <- findTaskFromList(execId, pendingTasksWithNoPrefs)) {       return Some((index, TaskLocality.PROCESS_LOCAL))     }      if (TaskLocality.isAllowed(locality, TaskLocality.ANY)) {       for (index <- findTaskFromList(execId, allPendingTasks)) {         return Some((index, TaskLocality.ANY))       }     }      // Finally, if all else has failed, find a speculative task     findSpeculativeTask(execId, host, locality)   }

186

answered Oct 03 '22 01:10

Daniel H.

Here are my two cents and I summarized mostly from spark official guide.

Firstly, I want to add one more locality level which is NO_PREF which has been discussed at this thread.
Then, let's put those levels together into a single table,

enter image description here

It's noted that specific level can be skipped as per guide from spark configuration.

For instance, if you want to skip NODE_LOCAL, just set spark.locality.wait.node to 0.

answered Oct 03 '22 02:10

Eugene

Related questions
                            
                                In node.js, how to declare a shared variable that can be initialized by master process and accessed by worker processes?
                            
                                Node.JS built in cluster or PM2 clustering?
                            
                                how to specify error log file and output file in qsub
                            
                                How to submit a job to any [subset] of nodes from nodelist in SLURM?
                            
                                Adding node to existing cluster in Kubernetes
                            
                                ORA-01654: unable to extend index
                            
                                RealWorld HazelCast [closed]
                            
                                What are the implications of R + W > N for Cassandra clusters?
                            
                                Set hadoop system user for client embedded in Java webapp
                            
                                How to run Cron Job in Node.js application that uses cluster module?
                            
                                How to add a new node to my Elasticsearch cluster
                            
                                NodeJS|Cluster: How to send data from master to all or single child/workers?
                            
                                How to change memory per node for apache spark worker
                            
                                How to fix symbol lookup error: undefined symbol errors in a cluster environment
                            
                                Easy way to use parallel options of scikit-learn functions on HPC
                            
                                How to set amount of Spark executors?
                            
                                What does Apache Mesos actually do?
                            
                                Singleton in Cluster environment
                            
                                AWS ECS Task Memory Hard and Soft Limits
                            
                                Difference between Clustering and Load balancing? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

What's the meaning of "Locality Level"on Spark cluster

Tags:

apache-spark

cluster-computing

fanhk

People also ask

2 Answers

Daniel H.

Eugene

Recent Activity

Donate For Us