Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MapReduce on AWS

Anybody played around with MapReduce on AWS yet? Any thoughts? How's the implementation?

like image 998
Lou Kosak Avatar asked Apr 02 '09 14:04

Lou Kosak


3 Answers

It's easy to get started.

Here's a FAQ: http://aws.amazon.com/elasticmapreduce/faqs/

And here's the Getting Started Guide: http://docs.amazonwebservices.com/ElasticMapReduce/latest/GettingStartedGuide/

If you have an EC2 account already, you can enable MapReduce and have a sample application up and running in less than 10 minutes using the AWS Management Console.

I did the pre-packaged Word Count sample application, which returns a count of each word contained in about 20 MB of text. You can provision up to 20 instances to run concurrently, though I just used 2 instances and the job completed in about 3 minutes.

The job returns a 300 KB alphabetized list of words and how often each word appears in the sample corpus.

I really like that MapReduce jobs can be written in my choice of Perl, Python, Ruby, PHP, C++, R, or Java. The process was painless and straightforward, and the interface gives good feedback on the status of your instances and the job flow.

Be aware that, since AWS charges for a full hour when an instance is created, and since the MapReduce instances are automatically terminated at the end of the job flow, the cost of multiple fast-running job flows can add up quickly.

For example, if I create a job flow that uses 20 instances and returns results in 15 minutes, and then re-run the job flow 3 more times, I'll be charged for 80 hours of machine time even though I only had 20 instances running for 1 hour.

like image 109
mb. Avatar answered Oct 12 '22 01:10

mb.


You also have the possibility to run MapReduce (Hadoop) on AWS with StarCluster. This tool configures the cluster for you and has the advantage that you don´t have to pay the extra Amazon Elastic MapReduce Price (if you want to reduce your costs) and you could create your own Image (AMI) with your tools (this could be good if the installation of the tools can´t be done by a bootstrap script).

like image 32
martin s Avatar answered Oct 12 '22 02:10

martin s


It is very convenient because you don't have to administer your own cluster. You just pay per use so I think it is a good idea if you have a job that needs to run once in a while. We are running Amazon MapReduce just once a month so, for our usage, it is worth it.

However, as far as I can tell, a drawback of Amazon Map Reduce is that you can't tell which Operating System is running, or even its version. This caused me problems running c++ code that compiled with g++ 4.44, some of the OS images does not support cUrl library, etc.

If you don't need any special libraries for your use case, I would say go for it.

like image 41
sagie Avatar answered Oct 12 '22 02:10

sagie