S3 and EMR data locality [closed]

Tags:

Data locality with MapReduce and HDFS is very important (same thing goes for Spark, HBase). I've been researching about AWS and the two options when deploying the cluster in their cloud:

EC2
EMR + S3

The second option seems more appealing for different reasons, where the most interesting is the ability to scale storage and processing separately and to shutdown processing when you don't need it (more correct, to turn it on only when needed). This is an example explaining the advantages of using S3.

What bugs me is the issue of data locality. If the data is stored in S3 it will need to be pulled to HDFS every time a job is run. My question is - how big can this issue be, and is it still worth of it?

What comforts me is the fact that I'll be pulling the data only the first time and then all the next jobs will have the intermediate results locally.

I'm hopping for an answer from some person having practical experience with this. Thank you.

759

asked Jun 01 '17 09:06

Kobe-Wan Kenobi

2 Answers

EMR does not pull data from S3 to HDFS. It uses its own implementation of HDFS support on S3 (as if you are operating on an actual HDFS). https://docs.aws.amazon.com/emr/latest/ManagementGuide/emr-fs.html

As for data locality, S3 is RACK_LOCAL to EMR spark clusters.

105

answered Oct 15 '22 04:10

Fermat's Little Student

As per the source mentioned below, EMR+S3 with EMRFS doesn't maintain data locality and is not suitable for analytics processing based on tools such as SQL. RedShift is the right choice for such use cases where compute and data are at one place. Please refer to 39:00 to 42:00 in the below link:

https://youtu.be/08G9NfDETVE

This is also mentioned in https://databricks.com/blog/2017/05/31/top-5-reasons-for-choosing-s3-over-hdfs.html. Please refer to the performance per dollar section.

To check how EMR works with S3 please refer to Programming elastic map reduce book by KEVIN SCHMIDT & CHRISTOPHER PHILLIPS(Chapter 1 Amazon Elastic MapReduce Versus Traditional Hadoop Installs section).

answered Oct 15 '22 03:10

ravi

Related questions
                            
                                AWS Java SDK not finding profile when using AWS SSO
                            
                                AWS EKS NodeGroup "Create failed": Instances failed to join the kubernetes cluster
                            
                                How can I calculate an AWS API signature (v4) in python?
                            
                                How are people handling connection loss through downscaling in amazon auto scaling group?
                            
                                Using server side includes or ssi ,AWS S3
                            
                                How to unsubscribe an iOS Device from an amazon SNS topic?
                            
                                Amazon S3 file download through curl by using IAM user credentials
                            
                                AWS ElasticBeanstalk ENV Vars not working
                            
                                How to structure AWS Elastic Beanstalk production and staging environments with web and worker tiers?
                            
                                How to config Meteor on AWS/EBS using METEOR_SETTINGS environment variable
                            
                                Terraform initial state file creation
                            
                                Amazon RDS: Restore snapshot without backing up right after
                            
                                How to call AWS API Gateway Endpoint with Cognito Id (+configuration)?
                            
                                Is there any alternative for WebJobs in AWS (like in Azure)?
                            
                                Elasticsearch : Meaning of "@" symbol
                            
                                AWS Elastic Beanstalk - add load balancer to app retroactively
                            
                                Least privilege AWS IAM policy for cloudformation
                            
                                Do I get Amazon SES Free-Tier Pricing when I send emails from Heroku?
                            
                                How to create CloudWatch logs trigger for AWS Lambda using aws ruby SDK?
                            
                                Custom 404 Page for Static Website using AWS S3 buckets not working

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

S3 and EMR data locality [closed]

Tags:

amazon-web-services

amazon-s3

amazon-ec2

hadoop

amazon-emr

Kobe-Wan Kenobi

People also ask

2 Answers

Fermat's Little Student

ravi

Recent Activity

Donate For Us