Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Apache Spark performance on AWS S3 vs EC2 HDFS

Tags:

apache-spark

What is the performance difference in spark reading file from S3 vs EC2 HDFS. Also Please explain how it works in both case?

like image 476
Nizamudeen Avatar asked Mar 14 '17 14:03

Nizamudeen


3 Answers

Reading S3 is a matter of performing authenticating HTTPS requests with the content-range header set to point to the start of the read (0 or the location you've just done a seek to), and the end (historically the end of the file; this is now optional and should be avoided for the seek-heavy ORC and Parquet inputs).

Key performance points:

  • Read: you don't get the locality of access; network bandwidth limited by the VMs you rent.
  • S3 is way slower on seeks, partly addressed in the forthcoming Hadoop 2.8
  • S3 is way, way slower on metadata operations (list, getFileStatus()). This hurts job setup.
  • Write: not so bad, except that pre Hadoop 2.8 the client waits until the close() Call to do the upload, which may add delays.
  • rename(): really a COPY; as rename() is used for committing tasks and jobs, this hurts performance when using s3 as a destination of work. As S3 is eventually consistent, you could lose data anyway. Write to hdfs:// then copy to s3a://

How is this implemented? Look in the Apache Hadoop source tree for the implementations of the abstract org.apache.fs.FileSystem class; HDFS and S3A are both examples. Here's the S3A one. The input stream, with the Hadoop 2.8 lazy seek and fadvise=random option for faster Random IO is S3AInputStream.


Looking at the article the other answer covers, it's a three year old article talking about S3 when it was limited to 5GB; misses out some key points on both sides of the argument.

I think the author had some bias towards S3 in the first place "S3 supports compression!":, as well as some ignorance of aspects of both. (Hint, while both parquet and ORC need seek(), we do this in the s3n and s3a S3 clients by way of the Content-Range HTTP header)

S3 is, on non-EMR systems, a dangerous place to store intermediate data, and performance wise, an inefficient destination of work. This is due to its eventual consistency meaning newly created data may not be picked up by the next stage in the workflow, and because committing work with rename() doesn't work with big datasets. It all seems to work well in development, but production is where the scale problems hit

Looking at the example code,

  1. You'll need the version of amazon-s3 SDK JAR to match your Hadoop versions; for Hadoop 2.7 that's 1.7.4. That's proven to be very brittle.
  2. best to put the s3a secrets into spark-defaults.conf; or leave them as AWS_ environment variables and let spark-submit automatically propagate them. Putting them on the command line makes them visible in a ps command, and you don't want that.
  3. S3a will actually use IAM authentication: if you are submitting to an EC2 VM, you should not need to provide any secrets, as it will pick up the credentials given to the VM at launch time.
like image 174
stevel Avatar answered Sep 19 '22 05:09

stevel


If you are planning to use Spark SQL, then you might want to consider below

  • When your External tables are pointing to S3, SPARK SQL regresses considerably. You might even encounter memory issue like org.apache.spark.shuffle.FetchFailedException: Too large frame, java.lang.OutOfMemoryError

  • Another observation, If a shuffle block is over 2GB, the shuffle fails. This issue occurs when external tables are pointing to S3.

  • SPARK SQL performance on HDFS is 50% faster on 50MM/ 10G dataset compared to S3

like image 26
Vikrame Avatar answered Sep 21 '22 05:09

Vikrame


Here is beautiful article on this topic you have to go through.

storing-apache-hadoop-data-cloud-hdfs-vs-s3

To Conclude : With better scalability, built-in persistence, and lower prices, S3 is winner! Nonetheless, for better performance and no file sizes or storage formats limitations, HDFS is the way to go.

While accessing files from S3, use of URI scheme s3a gives more performance than s3n and also wit s3a there is no 5GB file size limit.

val data = sc.textFile("s3a://bucket-name/key")

You can sumbit the scala jar file for spark like this for example

   spark-submit \
  --master local[2] \
  --packages datastax:spark-cassandra-connector:2.0.0-M2-s_2.11,org.apache.hadoop:hadoop-aws:2.7.3 \
  --conf spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem \
  --conf spark.hadoop.fs.s3a.access.key=xxxx \
  --conf spark.hadoop.fs.s3a.secret.key=xxxxxxx \
  --class org.etl.jobs.sprint.SprintBatchEtl \
  target/scala-2.11/test-ingestion-assembly-0.0.1-SNAPSHOT.jar
like image 29
sasubillis Avatar answered Sep 21 '22 05:09

sasubillis