S3 parallel read and write performance?

Tags:

Consider a scenario where Spark (or any other Hadoop framework) reads a large (say 1 TB) file from S3. How does multiple spark executors read the very large file in parallel from S3. In HDFS this very large file will be distributed across multiple nodes with each node having a block of data. In object storage I presume this entire file will be in single node (ignoring replicas). This should drastically reduce the read throughput/performance.

Similarly large file writes should also be much faster in HDFS than S3 because writes in HDFS would be spread across multiple hosts whereas all the data has to go through one host (ignoring replication for brevity) in S3.

so does this mean the performance of S3 is significantly worse when compared to HDFS in the big data world.

823

asked Jan 15 '19 19:01

rogue-one

1 Answers

Yes, S3 is slower than HDFS. but it's interesting to look at why, and how to mitigate the impact. Key thing: if you are reading a lot more data than writing, then read performance is critical; the S3A connector in Hadoop 2.8+ really helps there, as it was tuned for reading Parquet/ORC files based on traces of real benchmarks. Write performance also suffers, and the more data you generate the worse it gets. People complain about that, when they should really be worrying about the fact that without special effort, you may actually end up with invalid output. That's generally the more important issue -just less obvious.

Read performance

Reading from S3 suffers due to

bandwidth between S3 and your VM. The more you pay for an EC2 VM, the more network bandwidth you get, the better
latency of HEAD/GET/LIST requests, especially all those used in the work to make the object store look like a filesystem with directories. This can particularly hurt the partitioning phase of a query, when all the source files are listed and those to actually read identified.
Cost of seek() being awful if the HTTP connection for a read is aborted and a new one renegotiated. Without a connector which has optimised seek() for this, ORC and Parquet input suffers badly. the s3a connector in Hadoop 2.8+ does precisely this if you set fs.s3a.experimental.fadvise to random.

Spark will split up work on file if the format is splittable, and whatever compression format is used is also splittable (gz isn't, snappy is). It will do it on block size, which is something you can configure/tune for a specific job (fs.s3a.block.size). If > 1 client reads the same file, then yes, you get some overload of the disk IO to that file, but generally its minor compared to the rest. One little secret: for multipart uploaded files then reading separate parts seems to avoid this, so upload and download with the same configured block size.

Write Performance

Write performance suffers from

caching of some/many MB of data in blocks before upload, with the upload not starting until the write is completed. S3A on hadoop 2.8+: set fs.s3a.fast.upload = true.
Network upload bandwidth, again a function of the VM type you pay for.

Commit performance and correctness

When output is committed by rename() of the files written to a temporary location, the time to copy each object to its final path is 6-10 MB/S.

A bigger issue is that it very bad at handling inconsistent directory listings or failures of tasks during the commit process. You cannot safely use S3 as a direct destination of work with the normal rename-by-commit algorithm without something to give you a consistent view of the store (consistent emrfs, s3mper, s3guard).

For maximum performance and safe committing of work, you need an output committer optimised for S3. Databricks have their own thing there, Apache Hadoop 3.1 adds the "S3A output committer". EMR now apparently has something here too.

See A zero rename committer for the details on that problem. After which, hopefully, you'll either move to a safe commit mechanism or use HDFS as a destination of work.

131

answered Oct 19 '22 14:10

stevel

Related questions
                            
                                Spark Dataframe Returning NULL when specifying a Schema
                            
                                What are the benefits of running multiple Spark tasks in the same JVM?
                            
                                What does "streaming" mean in Apache Spark and Apache Flink?
                            
                                PySpark, importing schema through JSON file
                            
                                Duplicated Spark Context with IntelliJ in Worksheet
                            
                                Implement a directed Graph as an undirected graph using GraphX
                            
                                How to calculate rolling median in PySpark using Window()?
                            
                                Find mean of pyspark array<double>
                            
                                How to run a spark example program in Intellij IDEA
                            
                                read files recursively from sub directories with spark from s3 or local filesystem
                            
                                Converting RDD[org.apache.spark.sql.Row] to RDD[org.apache.spark.mllib.linalg.Vector]
                            
                                Converting multiple different columns to Map column with Spark Dataframe scala
                            
                                Apache Spark: "failed to launch org.apache.spark.deploy.worker.Worker" or Master
                            
                                Change output filename prefix for DataFrame.write()
                            
                                Mode of grouped data in (py)Spark
                            
                                What does "Correlated scalar subqueries must be Aggregated" mean?
                            
                                spark on yarn, Container exited with a non-zero exit code 143
                            
                                dataframe Spark scala explode json array
                            
                                How to use XGboost in PySpark Pipeline
                            
                                Using a column value as a parameter to a spark DataFrame function

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

S3 parallel read and write performance?

Tags:

parallel-processing

amazon-s3

apache-spark

hadoop