In Big Data the code is pushed towards the data for execution. This makes sense, since data is huge and the code for execution is relatively small. Coming to AWS EMR, the data can be either in HDFS or in S3. In case of S3, the data has to be pulled to the core/task nodes for execution from some other nodes. This might be a bit of overhead when compared to the data in HDFS. Recently, I noticed that when the MR job was executing there was huge latency getting the log files into S3. Sometimes it took a couple of minutes for the log files to appear even after the job has been completed. Any thoughts on this? Does anyone have metrics for the MR job completion with the data in HDFS vs S3?

In terms of performance HDFS is better than S3 HDFS is better if your requirement is long term, requires high performance and you want to execute iterative machine learning algorithms S3 is better if your load is variable, requires high durability and persistence with less cost. For more information visit this link http://www.nithinkanil.com/2015/05/hdfs-vs-s3.html

AWS EMR performance HDFS vs S3

Tags:

amazon-s3

hadoop

mapreduce

hdfs

amazon-emr

In Big Data the code is pushed towards the data for execution. This makes sense, since data is huge and the code for execution is relatively small. Coming to AWS EMR, the data can be either in HDFS or in S3. In case of S3, the data has to be pulled to the core/task nodes for execution from some other nodes. This might be a bit of overhead when compared to the data in HDFS.

Recently, I noticed that when the MR job was executing there was huge latency getting the log files into S3. Sometimes it took a couple of minutes for the log files to appear even after the job has been completed.

Any thoughts on this? Does anyone have metrics for the MR job completion with the data in HDFS vs S3?

507

asked Nov 22 '13 11:11

Praveen Sripati

2 Answers

That's problematic on a different level.

S3 has only eventual consistency. You don't immediately see/can read after something was written by your code (e.g. a close() or flush()) , as the write process is delayed. I think this might be due to the allocation of free resources for the data you write. So it is not a problem of performance, but of the consistency you really want/need.

What do I do on EMR? I startup a Hadoop cluster and put everything into HDFS what is needed by the job(s). Reads are much more expensive in time on S3 and the eventual consistency makes ist basically useless for buffering items between jobs.

However S3 is great when backing up files from your HDFS or making them available for other instances or services (e.g. CloudFront).

Added:

As on 8/Dec/2020. S3 added support for strong consistancy across all Regions by default. https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/

166

answered Sep 19 '22 23:09

Thomas Jungblut

In terms of performance HDFS is better than S3

HDFS is better if your requirement is long term, requires high performance and you want to execute iterative machine learning algorithms

S3 is better if your load is variable, requires high durability and persistence with less cost.

For more information visit this link http://www.nithinkanil.com/2015/05/hdfs-vs-s3.html

answered Sep 19 '22 23:09

Nithin K Anil

Related questions
                            
                                Free Hadoop Cluster for Experiments [closed]
                            
                                Why we need to move external table to managed hive table?
                            
                                Differences between existing MapReduce and YARN (MRv2)
                            
                                spark on yarn; how to send metrics to graphite sink?
                            
                                Hadoop 2.x -- how to configure secondary namenode?
                            
                                query hive partitioned table over date/time range
                            
                                Kafka Memory requirement
                            
                                How to know the exact block size of a file on a Hadoop node?
                            
                                Hadoop HDFS - Difference between Missing replica and Under replicated blocks
                            
                                hdfs copy multiple files to same target directory
                            
                                Hadoop streaming job failure: Task process exit with nonzero status of 137
                            
                                finding mean using pig or hadoop
                            
                                Merging multiple sequence files into one sequencefile within Hadoop
                            
                                Hadoop and Amazon Web Services [closed]
                            
                                Map Reduce output to CSV or do I need Key Values?
                            
                                What kind of JBOD in hadoop? and COW with hadoop?
                            
                                How to set the VCORES in hadoop mapreduce/yarn?
                            
                                HIVE Insert overwrite into a partitioned Table
                            
                                How can I check the settings in hive CLI?
                            
                                Why declaring Mapper and Reducer classes as static?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With