Consider a scenario where Spark (or any other Hadoop framework) reads a large (say 1 TB) file from S3. How does multiple spark executors read the very large file in parallel from S3. In HDFS this very large file will be distributed across multiple nodes with each node having a block of data. In object storage I presume this entire file will be in single node (ignoring replicas). This should drastically reduce the read throughput/performance.
Similarly large file writes should also be much faster in HDFS than S3 because writes in HDFS would be spread across multiple hosts whereas all the data has to go through one host (ignoring replication for brevity) in S3.
so does this mean the performance of S3 is significantly worse when compared to HDFS in the big data world.
Server Side encryption slightly slows down performance when reading data from S3, both in the reading of data during the execution of a query, and in scanning the files prior to the actual scheduling of work.
You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. Similarly, you can scale write operations by writing to multiple prefixes.
Ideally, Amazon claims quite a bit about S3 performance benchmarks. 55,000 read requests per second, 100–200 milliseconds small object latencies, and more.
Unlike other cloud providers, Amazon S3 delivers strong read-after-write consistency for any storage request, without changes to performance or availability, without sacrificing regional isolation for applications, and at no additional cost.
Yes, S3 is slower than HDFS. but it's interesting to look at why, and how to mitigate the impact. Key thing: if you are reading a lot more data than writing, then read performance is critical; the S3A connector in Hadoop 2.8+ really helps there, as it was tuned for reading Parquet/ORC files based on traces of real benchmarks. Write performance also suffers, and the more data you generate the worse it gets. People complain about that, when they should really be worrying about the fact that without special effort, you may actually end up with invalid output. That's generally the more important issue -just less obvious.
Reading from S3 suffers due to
seek()
being awful if the HTTP connection for a read is aborted and a new one renegotiated. Without a connector which has optimised seek() for this, ORC and Parquet input suffers badly. the s3a connector in Hadoop 2.8+ does precisely this if you set fs.s3a.experimental.fadvise
to random
.Spark will split up work on file if the format is splittable, and whatever compression format is used is also splittable (gz isn't, snappy is). It will do it on block size, which is something you can configure/tune for a specific job (fs.s3a.block.size
).
If > 1 client reads the same file, then yes, you get some overload of the disk IO to that file, but generally its minor compared to the rest. One little secret: for multipart uploaded files then reading separate parts seems to avoid this, so upload and download with the same configured block size.
Write performance suffers from
fs.s3a.fast.upload
= true.When output is committed by rename() of the files written to a temporary location, the time to copy each object to its final path is 6-10 MB/S.
A bigger issue is that it very bad at handling inconsistent directory listings or failures of tasks during the commit process. You cannot safely use S3 as a direct destination of work with the normal rename-by-commit algorithm without something to give you a consistent view of the store (consistent emrfs, s3mper, s3guard).
For maximum performance and safe committing of work, you need an output committer optimised for S3. Databricks have their own thing there, Apache Hadoop 3.1 adds the "S3A output committer". EMR now apparently has something here too.
See A zero rename committer for the details on that problem. After which, hopefully, you'll either move to a safe commit mechanism or use HDFS as a destination of work.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With