I remember reading somewhere that Hadoop's performance deteriorates significantly if the machines it runs on are very different from one another, but I can't seem to find that comment anymore. I am considering running a Hadoop cluster on an array of VMs that is not directly managed by my group, and I need to know if this is a requirement that I should put in my request.
So, should I insist on all of my machines having identical hardware, or is it okay to run on different machines in different hardware configurations?
Thanks.
The ideal setup for running Hadoop operations are machines which have a dual core configuration (physical, preferably) and 4GB to 8GB servers/nodes which use ECC memory. Focusing on good memory specifications is important because HDFS running smoothly is very highly reliant on memory efficiency and robustness.
HDFS is a distributed file system (or distributed storage) that runs on commodity hardware and can manage massive amounts of data. You may extend a Hadoop cluster to hundreds or thousands of nodes using HDFS.
Hadoop clusters are composed of a network of master and worker nodes that orchestrate and execute the various jobs across the Hadoop distributed file system. The master nodes typically utilize higher quality hardware and include a NameNode, Secondary NameNode, and JobTracker, with each running on a separate machine.
Following papers describes how heterogeneous cluster affect the performance of hadoop map-reduce:
In a heterogeneous cluster, the computing capacities of nodes may vary significantly. A high-speed node can finish processing data stored in a local disk of the node faster than low-speed counterparts. After a fast node complete the processing of its local input data, the node must support load sharing by handling unprocessed data located in one or more remote slow nodes. When the amount of transferred data due to load sharing is very large, the overhead of moving unprocessed data from slow nodes to fast nodes becomes a critical issue affecting Hadoop’s performance.
Following references has more details:
It also provides ways in which you could improve the performance on heterogeneous cluster or avoid this performance penalty.
It is wisely suggested that you have homogenous machines on your cluster but if these machines do not have wildly different specifications and performance difference, you should carry on with building your cluster.
For production systems, you should suggest for homogenous machines. For development, performance is not critical.
How ever, you should be able to benchmark your Hadoop cluster after you have built it.
A homogenous cluster is certainly ideal, but it's not strictly necessary. Yahoo!, Inc., for example, runs heterogeneous clusters in their production environments. From talking with researchers there, they find that there is a performance hit due to scheduling issues (a big enough hit that they're working hard to add performance-aware scheduling to their tools), but the penalty is not crippling.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With