Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

EMR vs EC2/Hadoop on AWS

I know that EC2 is more flexible but more work over EMR. However in terms of costs, if using EC2 it probably requires EBS volumes attached to the EC2 instances, whereas AWS just streams in data from S3. So crunching the numbers on the AWS calculator, even though for EMR one must pay for EC2 also, EMR becomes cheaper than EC2 ?? Am i wrong here ? Of course EC2 with EBS is probably faster, but is it worth the cost ?

thanks, Matt

like image 722
matthieu lieber Avatar asked Oct 02 '13 03:10

matthieu lieber


People also ask

What is the difference between EC2 and EMR in AWS?

Amazon EC2 is a cloud based service which gives customers access to a varying range of compute instances, or virtual machines. Amazon EMR is a managed big data service which provides pre-configured compute clusters of Apache Spark, Apache Hive, Apache HBase, Apache Flink, Apache Hudi, and Presto.

Does AWS EMR use Hadoop?

Amazon EMR is a managed service that lets you process and analyze large datasets using the latest versions of big data processing frameworks such as Apache Hadoop, Spark, HBase, and Presto on fully customizable clusters.

What is the difference between Hadoop and EMR?

While Hadoop MapReduce is a software framework for easily writing applications which process vast amounts of data, EMR implements mentioned framework and provides a managed Hadoop platform to distribute and process vast amounts of data across dynamically scalable Amazon EC2 instances.

Does AWS EMR use EC2?

By default, Amazon EMR application processes use EC2 instance profiles when they call other AWS services. For multi-tenant clusters, Amazon EMR offers three options to manage user access to Amazon S3 data.


2 Answers

EMR does a lot of things for you that you won't find on standard Hadoop on EC2. Some particularly important ones include

  • Copying Hadoop logs from your machines to S3. This is very useful for debugging errors after the cluster has been shut down.
  • Running job flows of multiple MapReduce, Pig, or Hive jobs
  • Setting sensible configuration defaults based on hardware size you choose
  • Access to spot instances for cheaper compute
  • Ability to resize clusters dynamically

You'll also find that the EMR S3 filesystem is faster and more reliable than the standard one packaged with Apache Hadoop. It supports Multipart upload, and streams writes directly to S3 rather than buffering to disk first. For a bit more on this, see Tip #5

Additionally, if you do decide to use EC2 directly, I'd recommend using instance-storage instead of EBS for your nodes. There's really no reason to pay the extra cost of EBS for Hadoop; you'll notice that EMR clusters all run on instance-storage nodes as well.

like image 191
ddaniels888 Avatar answered Nov 15 '22 05:11

ddaniels888


You are correct that EMR uses instance-store backed EC2 instances, rather than EBS. However, there's nothing stopping you from creating an instance-store based instance, packing an AMI and using it for your Hadoop cluster. Using EBS also might not represent a lot of additional costs, depending on your workload and frequency. Also, there's an added cost to the EC2 instance when using it through EMR.

I've been using EMR for two years now and I would highly recommend the service as you don't need to invest time in managing and updating your distribution. If your workload is compatible with EMR (getting data from DynamoDB or S3), I would go for EMR as opposed to EC2/Hadoop.

like image 22
andreimarinescu Avatar answered Nov 15 '22 03:11

andreimarinescu