Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop on EC2 vs Elastic Map Reduce

I'm trying to evaluate the differences between these two options. Here are some pros and cons I can think of :

Elastic Map Reduce => Better support from Amazon, No need to administer cluster, More Expensive (?) EC2 + Hadoop => More control of your hadoop configuration, Cheaper (?)

I'm wondering if anyone might have benchmarked the performance of EC2 + Hadoop vis a vis EMR? Is there any significant difference in cost for large cluster deployments? What other differences exist?

like image 367
OckhamsRazor Avatar asked Mar 02 '13 18:03

OckhamsRazor


2 Answers

We use both approaches (EMR and EC2) at my job.

The advantages of EMR that Amar mentioned are more or less true: so if you want simplicity it may be the way to go.

But there are other considerations:

  • the version of EMR is far behind apache head. it is approximately 0.20.205 whereas head is at 2.X, which is essentially 3 versions up (1.0, 1.1, 2.0..)

hadoop@domU-12-31-39-07-B9-97:~$ ll hadoop*.jar lrwxrwxrwx 1 hadoop hadoop 73 Feb 5 12:00 hadoop-examples-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-examples-0.20.205.jar lrwxrwxrwx 1 hadoop hadoop 69 Feb 5 12:00 hadoop-test-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-test-0.20.205.jar lrwxrwxrwx 1 hadoop hadoop 69 Feb 5 12:00 hadoop-core-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-core-0.20.205.jar lrwxrwxrwx 1 hadoop hadoop 70 Feb 5 12:00 hadoop-tools-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-tools-0.20.205.jar lrwxrwxrwx 1 hadoop hadoop 68 Feb 5 12:00 hadoop-ant-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-ant-0.20.205.jar

  • As a direct consequence I had to re-code /restructure my Map/reduce program due to missing contrib modules in the older version running on EMR

  • You do not have as much of an opportunity to use non-Map/Reduce algorithms as if you were using updated version of M/R.

  • Flexibility to mix and match versions of hadoop ecosystem.

like image 78
WestCoastProjects Avatar answered Oct 12 '22 04:10

WestCoastProjects


Well, administering/monitoring/maintaining a cluster isn't a small task in itself. Using EMR really you could get machines configured and up and running with your custom bootstrap code in no time. Apart from doing all these things EMR provides a A lot of other tools/options/facilities too.

Here you don't have to worry about terminating a cluster after the jobs are done, you can surely implement a way for yourself in the EC2+Hadoop setup, but EMR does this for you in a neat way.

Also you have facility to resize the cluster size even while your jobs are running!

The Pig and Hive that are available with EMR also contain patches which make it easier to work with files in S3.

Even here in this answer you may find that EMR has been given an upper hand.

like image 27
Amar Avatar answered Oct 12 '22 05:10

Amar