In Big Data the code is pushed towards the data for execution. This makes sense, since data is huge and the code for execution is relatively small. Coming to AWS EMR, the data can be either in HDFS or in S3. In case of S3, the data has to be pulled to the core/task nodes for execution from some other nodes. This might be a bit of overhead when compared to the data in HDFS.
Recently, I noticed that when the MR job was executing there was huge latency getting the log files into S3. Sometimes it took a couple of minutes for the log files to appear even after the job has been completed.
Any thoughts on this? Does anyone have metrics for the MR job completion with the data in HDFS vs S3?
The good news is that HDFS performance is excellent. Because data is stored and processed on the same machines, access and processing speed are lightning-fast. Unfortunately, S3 doesn't perform as well as HDFS. The latency is obviously higher and the data throughput is lower.
S3 is slower to work with than HDFS, even on virtual clusters running on Amazon EC2. From a performance perspective, key points to remember are: S3 throttles bucket access across all callers: adding workers can make things worse. EC2 VMs have network IO throttled based on the VM type.
HDFS and the EMR File System (EMRFS), which uses Amazon S3, are both compatible with Amazon EMR, but they're not interchangeable. HDFS is an implementation of the Hadoop FileSystem API, which models POSIX file system behavior. EMRFS is an object store, not a file system.
HDFS and EMRFS are the two main file systems used with Amazon EMR.
That's problematic on a different level.
S3 has only eventual consistency. You don't immediately see/can read after something was written by your code (e.g. a close()
or flush()
) , as the write process is delayed. I think this might be due to the allocation of free resources for the data you write. So it is not a problem of performance, but of the consistency you really want/need.
What do I do on EMR? I startup a Hadoop cluster and put everything into HDFS what is needed by the job(s). Reads are much more expensive in time on S3 and the eventual consistency makes ist basically useless for buffering items between jobs.
However S3 is great when backing up files from your HDFS or making them available for other instances or services (e.g. CloudFront).
Added:
As on 8/Dec/2020. S3 added support for strong consistancy across all Regions by default. https://aws.amazon.com/blogs/aws/amazon-s3-update-strong-read-after-write-consistency/
In terms of performance HDFS is better than S3
HDFS is better if your requirement is long term, requires high performance and you want to execute iterative machine learning algorithms
S3 is better if your load is variable, requires high durability and persistence with less cost.
For more information visit this link http://www.nithinkanil.com/2015/05/hdfs-vs-s3.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With