No matter how much I tinker with the settings in yarn-site.xml
i.e using all of the below options
yarn.scheduler.minimum-allocation-vcores yarn.nodemanager.resource.memory-mb yarn.nodemanager.resource.cpu-vcores yarn.scheduler.maximum-allocation-mb yarn.scheduler.maximum-allocation-vcores
i just still cannot get my application i.e Spark to utilize all the cores on the cluster. The spark executors seem to be correctly taking up all the available memory, but each executor just keeps taking a single core and no more.
Here are the options configured in spark-defaults.conf
spark.executor.cores 3 spark.executor.memory 5100m spark.yarn.executor.memoryOverhead 800 spark.driver.memory 2g spark.yarn.driver.memoryOverhead 400 spark.executor.instances 28 spark.reducer.maxMbInFlight 120 spark.shuffle.file.buffer.kb 200
Notice that spark.executor.cores
is set to 3, but it doesn't work. How do i fix this?
HDFS (storage) and YARN (processing) are the two core components of Apache Hadoop.
YARN is the main component of Hadoop v2. 0. YARN helps to open up Hadoop by allowing to process and run data for batch processing, stream processing, interactive processing and graph processing which are stored in HDFS. In this way, It helps to run different types of distributed applications other than MapReduce.
As previously described, ResourceManager (RM) is the master that arbitrates all the available cluster resources and thus helps manage the distributed applications running on the YARN system.
YARN extends the power of Hadoop to new technologies found within the data center so that you can take advantage of cost-effective linear-scale storage and processing. It provides independent software vendors and developers a consistent framework for writing data access applications that run in Hadoop.
The problem lies not with yarn-site.xml
or spark-defaults.conf
but actually with the resource calculator that assigns the cores to the executors or in the case of MapReduce jobs, to the Mappers/Reducers.
The default resource calculator i.e org.apache.hadoop.yarn.util.resource.DefaultResourceCalculator
uses only memory information for allocating containers and CPU scheduling is not enabled by default. To use both memory as well as the CPU, the resource calculator needs to be changed to org.apache.hadoop.yarn.util.resource.DominantResourceCalculator
in the capacity-scheduler.xml
file.
Here's what needs to change.
<property> <name>yarn.scheduler.capacity.resource-calculator</name> <value>org.apache.hadoop.yarn.util.resource.DominantResourceCalculator</value> </property>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With