Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Hadoop Yarn - Underutilization of cores

No matter how much I tinker with the settings in yarn-site.xml i.e using all of the below options

yarn.scheduler.minimum-allocation-vcores yarn.nodemanager.resource.memory-mb yarn.nodemanager.resource.cpu-vcores yarn.scheduler.maximum-allocation-mb yarn.scheduler.maximum-allocation-vcores 

i just still cannot get my application i.e Spark to utilize all the cores on the cluster. The spark executors seem to be correctly taking up all the available memory, but each executor just keeps taking a single core and no more.

Here are the options configured in spark-defaults.conf

spark.executor.cores                    3 spark.executor.memory                   5100m spark.yarn.executor.memoryOverhead      800 spark.driver.memory                     2g spark.yarn.driver.memoryOverhead        400 spark.executor.instances                28 spark.reducer.maxMbInFlight             120 spark.shuffle.file.buffer.kb            200 

Notice that spark.executor.cores is set to 3, but it doesn't work. How do i fix this?

like image 974
Abbas Gadhia Avatar asked Apr 30 '15 10:04

Abbas Gadhia


People also ask

What are cores in Hadoop?

HDFS (storage) and YARN (processing) are the two core components of Apache Hadoop.

How does YARN work in Hadoop?

YARN is the main component of Hadoop v2. 0. YARN helps to open up Hadoop by allowing to process and run data for batch processing, stream processing, interactive processing and graph processing which are stored in HDFS. In this way, It helps to run different types of distributed applications other than MapReduce.

What is YARN ResourceManager?

As previously described, ResourceManager (RM) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system.

What is Apache YARN explain its functioning?

YARN extends the power of Hadoop to new technologies found within the data center so that you can take advantage of cost-effective linear-scale storage and processing. It provides independent software vendors and developers a consistent framework for writing data access applications that run in Hadoop.


1 Answers

The problem lies not with yarn-site.xml or spark-defaults.conf but actually with the resource calculator that assigns the cores to the executors or in the case of MapReduce jobs, to the Mappers/Reducers.

The default resource calculator i.e org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator uses only memory information for allocating containers and CPU scheduling is not enabled by default. To use both memory as well as the CPU, the resource calculator needs to be changed to org.apache.hadoop.yarn.util.resource.DominantResourceCalculator in the capacity-scheduler.xml file.

Here's what needs to change.

<property>     <name>yarn.scheduler.capacity.resource-calculator</name>     <value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value> </property> 
like image 112
Abbas Gadhia Avatar answered Oct 08 '22 12:10

Abbas Gadhia