Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should hadoop clusters run on identical hardware?

Tags:

hadoop

I remember reading somewhere that Hadoop's performance deteriorates significantly if the machines it runs on are very different from one another, but I can't seem to find that comment anymore. I am considering running a Hadoop cluster on an array of VMs that is not directly managed by my group, and I need to know if this is a requirement that I should put in my request.

So, should I insist on all of my machines having identical hardware, or is it okay to run on different machines in different hardware configurations?

Thanks.

like image 424
ILikeFood Avatar asked Jun 25 '12 17:06

ILikeFood


People also ask

What is the best hardware configuration to run Hadoop?

The ideal setup for running Hadoop operations are machines which have a dual core configuration (physical, preferably) and 4GB to 8GB servers/nodes which use ECC memory. Focusing on good memory specifications is important because HDFS running smoothly is very highly reliant on memory efficiency and robustness.

What type of hardware is used for data nodes in HDFS?

HDFS is a distributed file system (or distributed storage) that runs on commodity hardware and can manage massive amounts of data. You may extend a Hadoop cluster to hundreds or thousands of nodes using HDFS.

How do Hadoop clusters work?

Hadoop clusters are composed of a network of master and worker nodes that orchestrate and execute the various jobs across the Hadoop distributed file system. The master nodes typically utilize higher quality hardware and include a NameNode, Secondary NameNode, and JobTracker, with each running on a separate machine.


2 Answers

Following papers describes how heterogeneous cluster affect the performance of hadoop map-reduce:

In a heterogeneous cluster, the computing capacities of nodes may vary significantly. A high-speed node can finish processing data stored in a local disk of the node faster than low-speed counterparts. After a fast node complete the processing of its local input data, the node must support load sharing by handling unprocessed data located in one or more remote slow nodes. When the amount of transferred data due to load sharing is very large, the overhead of moving unprocessed data from slow nodes to fast nodes becomes a critical issue affecting Hadoop’s performance.

Following references has more details:

  1. http://computerresearch.org/stpr/index.php/gjcst/article/view/749/658
  2. http://www.usenix.org/event/osdi08/tech/full_papers/zaharia/zaharia.pdf

It also provides ways in which you could improve the performance on heterogeneous cluster or avoid this performance penalty.

It is wisely suggested that you have homogenous machines on your cluster but if these machines do not have wildly different specifications and performance difference, you should carry on with building your cluster.

For production systems, you should suggest for homogenous machines. For development, performance is not critical.

How ever, you should be able to benchmark your Hadoop cluster after you have built it.

like image 65
pyfunc Avatar answered Sep 20 '22 21:09

pyfunc


A homogenous cluster is certainly ideal, but it's not strictly necessary. Yahoo!, Inc., for example, runs heterogeneous clusters in their production environments. From talking with researchers there, they find that there is a performance hit due to scheduling issues (a big enough hit that they're working hard to add performance-aware scheduling to their tools), but the penalty is not crippling.

like image 27
s3cur3 Avatar answered Sep 23 '22 21:09

s3cur3