Should hadoop clusters run on identical hardware?

Tags:

hadoop

I remember reading somewhere that Hadoop's performance deteriorates significantly if the machines it runs on are very different from one another, but I can't seem to find that comment anymore. I am considering running a Hadoop cluster on an array of VMs that is not directly managed by my group, and I need to know if this is a requirement that I should put in my request.

So, should I insist on all of my machines having identical hardware, or is it okay to run on different machines in different hardware configurations?

Thanks.

424

asked Jun 25 '12 17:06

ILikeFood

2 Answers

Following papers describes how heterogeneous cluster affect the performance of hadoop map-reduce:

In a heterogeneous cluster, the computing capacities of nodes may vary signiﬁcantly. A high-speed node can ﬁnish processing data stored in a local disk of the node faster than low-speed counterparts. After a fast node complete the processing of its local input data, the node must support load sharing by handling unprocessed data located in one or more remote slow nodes. When the amount of transferred data due to load sharing is very large, the overhead of moving unprocessed data from slow nodes to fast nodes becomes a critical issue affecting Hadoop’s performance.

Following references has more details:

http://computerresearch.org/stpr/index.php/gjcst/article/view/749/658
http://www.usenix.org/event/osdi08/tech/full_papers/zaharia/zaharia.pdf

It also provides ways in which you could improve the performance on heterogeneous cluster or avoid this performance penalty.

It is wisely suggested that you have homogenous machines on your cluster but if these machines do not have wildly different specifications and performance difference, you should carry on with building your cluster.

For production systems, you should suggest for homogenous machines. For development, performance is not critical.

How ever, you should be able to benchmark your Hadoop cluster after you have built it.

answered Sep 20 '22 21:09

pyfunc

A homogenous cluster is certainly ideal, but it's not strictly necessary. Yahoo!, Inc., for example, runs heterogeneous clusters in their production environments. From talking with researchers there, they find that there is a performance hit due to scheduling issues (a big enough hit that they're working hard to add performance-aware scheduling to their tools), but the penalty is not crippling.

answered Sep 23 '22 21:09

s3cur3

Related questions
                            
                                S3 parallel read and write performance?
                            
                                How to run HBase shell against a remote cluster
                            
                                How can I load Avros in Spark using the schema on-board the Avro file(s)?
                            
                                How to specify column list in hive insert into query
                            
                                How to convert a Hadoop Path object into a Java File object
                            
                                file path in hdfs
                            
                                HDFS access from remote host through Java API, user authentication
                            
                                How to use sqoop to export the default hive delimited output?
                            
                                Wrong result for count(*) in hive table
                            
                                In Spark is counting the records in an RDD expensive task?
                            
                                Setting permissions for cloudera hadoop
                            
                                Hadoop - get results from output files after reduce?
                            
                                Hive describe partitions to show partition url
                            
                                Hadoop error on Windows : java.lang.UnsatisfiedLinkError
                            
                                Hadoop DFS permission issue when running job
                            
                                What is Hue all about?
                            
                                How to mount HDFS on Ubuntu 14.04
                            
                                exporting Hive table to csv in hdfs
                            
                                Read ORC files directly from Spark shell
                            
                                Spark submit to yarn as a another user

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With