Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop or Hadoop Streaming for MapReduce on AWS

I'm about to start a mapreduce project which will run on AWS and I am presented with a choice, to either use Java or C++.

I understand that writing the project in Java would make more functionality available to me, however C++ could pull it off too, through Hadoop Streaming.

Mind you, I have little background in either language. A similar project has been done in C++ and the code is available to me.

So my question: is this extra functionality available through AWS or is it only relevant if you have more control over the cloud? Is there anything else I should bear in mind in order to make a decision, like availability of plugins for hadoop that work better with one language or the other?

Thanks in advance

like image 675
aeolist Avatar asked Dec 28 '09 21:12

aeolist


People also ask

Does AWS use MapReduce?

Amazon EMR (previously called Amazon Elastic MapReduce) is a managed cluster platform that simplifies running big data frameworks, such as Apache Hadoop and Apache Spark , on AWS to process and analyze vast amounts of data.

Can Hadoop be used in AWS?

Running Hadoop on AWS Amazon EMR is a managed service that lets you process and analyze large datasets using the latest versions of big data processing frameworks such as Apache Hadoop, Spark, HBase, and Presto on fully customizable clusters. Easy to use: You can launch an Amazon EMR cluster in minutes.

Can Hadoop be used for streaming data?

Hadoop is a framework for distributed processing and data storage. It contains support for many different modules for different purposes such as distributed database management, security, data streaming and processing.

What is the use of Hadoop streaming?

Hadoop streaming is a utility that comes with the Hadoop distribution. This utility allows you to create and run Map/Reduce jobs with any executable or script as the mapper and/or the reducer.


1 Answers

You have a few options for running Hadoop on AWS. The simplest is to run your MapReduce jobs via their Elastic MapReduce service: http://aws.amazon.com/elasticmapreduce. You could also run a Hadoop cluster on EC2, as described at http://archive.cloudera.com/docs/ec2.html.

If you suspect you'll need to write your own input/output formats, partitioners, and combiners, I'd recommend using Java with the latter system. If your job is relatively simple and you don't plan to use your Hadoop cluster for any other purpose, I'd recommend choosing the language with which you are most comfortable and using EMR.

Either way, good luck!

Disclosure: I am a founder of Cloudera.

Regards, Jeff

like image 196
Jeff Hammerbacher Avatar answered Oct 07 '22 01:10

Jeff Hammerbacher