Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

S3 parallel read and write performance?

Consider a scenario where Spark (or any other Hadoop framework) reads a large (say 1 TB) file from S3. How does multiple spark executors read the very large file in parallel from S3. In HDFS this very large file will be distributed across multiple nodes with each node having a block of data. In object storage I presume this entire file will be in single node (ignoring replicas). This should drastically reduce the read throughput/performance.

Similarly large file writes should also be much faster in HDFS than S3 because writes in HDFS would be spread across multiple hosts whereas all the data has to go through one host (ignoring replication for brevity) in S3.

so does this mean the performance of S3 is significantly worse when compared to HDFS in the big data world.

like image 823
rogue-one Avatar asked Jan 15 '19 19:01

rogue-one


People also ask

Does S3 encryption affect performance?

Server Side encryption slightly slows down performance when reading data from S3, both in the reading of data during the execution of a query, and in scanning the files prior to the actual scheduling of work.

How can I improve my S3 performance?

You can increase your read or write performance by using parallelization. For example, if you create 10 prefixes in an Amazon S3 bucket to parallelize reads, you could scale your read performance to 55,000 read requests per second. Similarly, you can scale write operations by writing to multiple prefixes.

How fast are reads from S3?

Ideally, Amazon claims quite a bit about S3 performance benchmarks. 55,000 read requests per second, 100–200 milliseconds small object latencies, and more.

Does S3 provides read after write consistency?

Unlike other cloud providers, Amazon S3 delivers strong read-after-write consistency for any storage request, without changes to performance or availability, without sacrificing regional isolation for applications, and at no additional cost.


1 Answers

Yes, S3 is slower than HDFS. but it's interesting to look at why, and how to mitigate the impact. Key thing: if you are reading a lot more data than writing, then read performance is critical; the S3A connector in Hadoop 2.8+ really helps there, as it was tuned for reading Parquet/ORC files based on traces of real benchmarks. Write performance also suffers, and the more data you generate the worse it gets. People complain about that, when they should really be worrying about the fact that without special effort, you may actually end up with invalid output. That's generally the more important issue -just less obvious.

Read performance

Reading from S3 suffers due to

  • bandwidth between S3 and your VM. The more you pay for an EC2 VM, the more network bandwidth you get, the better
  • latency of HEAD/GET/LIST requests, especially all those used in the work to make the object store look like a filesystem with directories. This can particularly hurt the partitioning phase of a query, when all the source files are listed and those to actually read identified.
  • Cost of seek() being awful if the HTTP connection for a read is aborted and a new one renegotiated. Without a connector which has optimised seek() for this, ORC and Parquet input suffers badly. the s3a connector in Hadoop 2.8+ does precisely this if you set fs.s3a.experimental.fadvise to random.

Spark will split up work on file if the format is splittable, and whatever compression format is used is also splittable (gz isn't, snappy is). It will do it on block size, which is something you can configure/tune for a specific job (fs.s3a.block.size). If > 1 client reads the same file, then yes, you get some overload of the disk IO to that file, but generally its minor compared to the rest. One little secret: for multipart uploaded files then reading separate parts seems to avoid this, so upload and download with the same configured block size.

Write Performance

Write performance suffers from

  • caching of some/many MB of data in blocks before upload, with the upload not starting until the write is completed. S3A on hadoop 2.8+: set fs.s3a.fast.upload = true.
  • Network upload bandwidth, again a function of the VM type you pay for.

Commit performance and correctness

When output is committed by rename() of the files written to a temporary location, the time to copy each object to its final path is 6-10 MB/S.

A bigger issue is that it very bad at handling inconsistent directory listings or failures of tasks during the commit process. You cannot safely use S3 as a direct destination of work with the normal rename-by-commit algorithm without something to give you a consistent view of the store (consistent emrfs, s3mper, s3guard).

For maximum performance and safe committing of work, you need an output committer optimised for S3. Databricks have their own thing there, Apache Hadoop 3.1 adds the "S3A output committer". EMR now apparently has something here too.

See A zero rename committer for the details on that problem. After which, hopefully, you'll either move to a safe commit mechanism or use HDFS as a destination of work.

like image 131
stevel Avatar answered Oct 19 '22 14:10

stevel