I'm trying to evaluate the differences between these two options. Here are some pros and cons I can think of :
Elastic Map Reduce => Better support from Amazon, No need to administer cluster, More Expensive (?) EC2 + Hadoop => More control of your hadoop configuration, Cheaper (?)
I'm wondering if anyone might have benchmarked the performance of EC2 + Hadoop vis a vis EMR? Is there any significant difference in cost for large cluster deployments? What other differences exist?
We use both approaches (EMR and EC2) at my job.
The advantages of EMR that Amar mentioned are more or less true: so if you want simplicity it may be the way to go.
But there are other considerations:
hadoop@domU-12-31-39-07-B9-97:~$ ll hadoop*.jar lrwxrwxrwx 1 hadoop hadoop 73 Feb 5 12:00 hadoop-examples-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-examples-0.20.205.jar lrwxrwxrwx 1 hadoop hadoop 69 Feb 5 12:00 hadoop-test-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-test-0.20.205.jar lrwxrwxrwx 1 hadoop hadoop 69 Feb 5 12:00 hadoop-core-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-core-0.20.205.jar lrwxrwxrwx 1 hadoop hadoop 70 Feb 5 12:00 hadoop-tools-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-tools-0.20.205.jar lrwxrwxrwx 1 hadoop hadoop 68 Feb 5 12:00 hadoop-ant-0.20.205.jar -> /home/hadoop/.versions/0.20.205/share/hadoop/hadoop-ant-0.20.205.jar
As a direct consequence I had to re-code /restructure my Map/reduce program due to missing contrib modules in the older version running on EMR
You do not have as much of an opportunity to use non-Map/Reduce algorithms as if you were using updated version of M/R.
Flexibility to mix and match versions of hadoop ecosystem.
Well, administering/monitoring/maintaining a cluster isn't a small task in itself. Using EMR really you could get machines configured and up and running with your custom bootstrap code in no time. Apart from doing all these things EMR provides a A lot of other tools/options/facilities too.
Here you don't have to worry about terminating a cluster after the jobs are done, you can surely implement a way for yourself in the EC2+Hadoop setup, but EMR does this for you in a neat way.
Also you have facility to resize the cluster size even while your jobs are running!
The Pig and Hive that are available with EMR also contain patches which make it easier to work with files in S3.
Even here in this answer you may find that EMR has been given an upper hand.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With