Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

S3 and EMR data locality [closed]

Data locality with MapReduce and HDFS is very important (same thing goes for Spark, HBase). I've been researching about AWS and the two options when deploying the cluster in their cloud:

  • EC2
  • EMR + S3

The second option seems more appealing for different reasons, where the most interesting is the ability to scale storage and processing separately and to shutdown processing when you don't need it (more correct, to turn it on only when needed). This is an example explaining the advantages of using S3.

What bugs me is the issue of data locality. If the data is stored in S3 it will need to be pulled to HDFS every time a job is run. My question is - how big can this issue be, and is it still worth of it?

What comforts me is the fact that I'll be pulling the data only the first time and then all the next jobs will have the intermediate results locally.

I'm hopping for an answer from some person having practical experience with this. Thank you.

like image 759
Kobe-Wan Kenobi Avatar asked Jun 01 '17 09:06

Kobe-Wan Kenobi


People also ask

Can EMR read data from S3?

EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency. Previously, Amazon EMR used the s3n and s3a file systems.

What is EMR and S3?

By far, the most popular storage infrastructure for a data lake is Amazon S3. EMR allows you to store data in Amazon S3 and run compute as you need to process that data. EMR clusters can be launched in minutes. You don't have to worry about node provisioning, cluster setup, Hadoop configuration, or cluster tuning.

Can EMR write to S3?

The most common output format of an Amazon EMR cluster is as text files, either compressed or uncompressed. Typically, these are written to an Amazon S3 bucket.

Does AWS EMR store data?

Create and configure an Amazon S3 bucketAmazon EMR uses the AWS SDK for Java with Amazon S3 to store input data, log files, and output data. Amazon S3 refers to these storage locations as buckets.


2 Answers

EMR does not pull data from S3 to HDFS. It uses its own implementation of HDFS support on S3 (as if you are operating on an actual HDFS). https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html

As for data locality, S3 is RACK_LOCAL to EMR spark clusters.

like image 105
Fermat's Little Student Avatar answered Oct 15 '22 04:10

Fermat's Little Student


As per the source mentioned below, EMR+S3 with EMRFS doesn't maintain data locality and is not suitable for analytics processing based on tools such as SQL. RedShift is the right choice for such use cases where compute and data are at one place. Please refer to 39:00 to 42:00 in the below link:

https://youtu.be/08G9NfDETVE

This is also mentioned in https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html. Please refer to the performance per dollar section.

To check how EMR works with S3 please refer to Programming elastic map reduce book by KEVIN SCHMIDT & CHRISTOPHER PHILLIPS(Chapter 1 Amazon Elastic MapReduce Versus Traditional Hadoop Installs section).

like image 1
ravi Avatar answered Oct 15 '22 03:10

ravi