What is the performance difference in spark reading file from S3 vs EC2 HDFS. Also Please explain how it works in both case?
Reading S3 is a matter of performing authenticating HTTPS requests with the content-range header set to point to the start of the read (0 or the location you've just done a seek to), and the end (historically the end of the file; this is now optional and should be avoided for the seek-heavy ORC and Parquet inputs).
Key performance points:
getFileStatus()
). This hurts job setup.close()
Call to do the upload, which may add delays.rename()
: really a COPY; as rename() is used for committing tasks and jobs, this hurts performance when using s3 as a destination of work. As S3 is eventually consistent, you could lose data anyway. Write to hdfs://
then copy to s3a://
How is this implemented? Look in the Apache Hadoop source tree for the implementations of the abstract org.apache.fs.FileSystem
class; HDFS and S3A are both examples. Here's the S3A one. The input stream, with the Hadoop 2.8 lazy seek and fadvise=random option for faster Random IO is S3AInputStream.
Looking at the article the other answer covers, it's a three year old article talking about S3 when it was limited to 5GB; misses out some key points on both sides of the argument.
I think the author had some bias towards S3 in the first place "S3 supports compression!":, as well as some ignorance of aspects of both. (Hint, while both parquet and ORC need seek(), we do this in the s3n and s3a S3 clients by way of the Content-Range HTTP header)
S3 is, on non-EMR systems, a dangerous place to store intermediate data, and performance wise, an inefficient destination of work. This is due to its eventual consistency meaning newly created data may not be picked up by the next stage in the workflow, and because committing work with rename()
doesn't work with big datasets. It all seems to work well in development, but production is where the scale problems hit
Looking at the example code,
ps
command, and you don't want that.If you are planning to use Spark SQL, then you might want to consider below
When your External tables are pointing to S3, SPARK SQL regresses considerably. You might even encounter memory issue like org.apache.spark.shuffle.FetchFailedException: Too large frame
, java.lang.OutOfMemoryError
Another observation, If a shuffle block is over 2GB, the shuffle fails. This issue occurs when external tables are pointing to S3.
SPARK SQL performance on HDFS is 50% faster on 50MM/ 10G dataset compared to S3
Here is beautiful article on this topic you have to go through.
storing-apache-hadoop-data-cloud-hdfs-vs-s3
To Conclude : With better scalability, built-in persistence, and lower prices, S3 is winner! Nonetheless, for better performance and no file sizes or storage formats limitations, HDFS is the way to go.
While accessing files from S3, use of URI scheme s3a gives more performance than s3n and also wit s3a there is no 5GB file size limit.
val data = sc.textFile("s3a://bucket-name/key")
You can sumbit the scala jar file for spark like this for example
spark-submit \
--master local[2] \
--packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11,org.apache.hadoop:hadoop-aws:2.7.3 \
--conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
--conf spark.hadoop.fs.s3a.access.key=xxxx \
--conf spark.hadoop.fs.s3a.secret.key=xxxxxxx \
--class org.etl.jobs.sprint.SprintBatchEtl \
target/scala-2.11/test-ingestion-assembly-0.0.1-SNAPSHOT.jar
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With