I'm working on setting up a hadoop cluster where the nodes are all fairly heterogeneous, i.e. they each have a different number of cores. Currently I have to manually edit the mapred-site.xml
on each node to fill in {cores}
:
<property>
<name>mapred.tasktracker.map.tasks.maximum</name>
<value>{cores}</value>
</property>
Is there an easier way to to this when I add new nodes? Most of the other values are some default and the maximum map tasks is the only thing that changes from node to node.
If you're comfortable with some scripting then the following will give you the number of 'processors' for each machine (which mean different things to different architectures but is more or less what you want):
cat /proc/cpuinfo | grep processor | wc -l
Then you can use sed
or some equivalent to update your mapred-site.xml file according to the output of this.
So putting this all together:
CORES=`cat /proc/cpuinfo | grep processor | wc -l`
sed -i "s/{cores}/$CORES/g" mapred-site.xml
A footnote, but you probably don't want to configure the number of mappers and the number of reducers each to the number of cores, more so that you probably want to split them between the two types, and have a core spare for data node and task tracker etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With