Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

S3A: fails while S3: works in Spark EMR

I'm using EMR 5.5.0 with Spark. If I write a simple file to s3 using an s3://... URL it writes fine. But if I use an s3a://... address, it fails with Service: Amazon S3; Status Code: 403; Error Code: AccessDenied

Using the AWS command line I'm able to cp, mv, and rm any file in the path I'm writing to. But from spark, s3a fails on the put command.

We have Server Side Encryption Enabled, and I know spark knows because the s3 URLs work. Any ideas?

Failed PUT DEBUG logs here. Maybe its important to note, I'm doing an rdd.saveAsTextFile(path) but the put command says its trying to write to /my-bucket/tmp/carlos/testWrite/4/_temporary/0/ which it should only do in parquet? Not sure if that detail is relevant but thought I would mention.

like image 691
Carlos Bribiescas Avatar asked Aug 11 '17 14:08

Carlos Bribiescas


People also ask

Does EMR work with S3?

HDFS and the EMR File System (EMRFS), which uses Amazon S3, are both compatible with Amazon EMR, but they're not interchangeable.

Does EMR support S3A?

EMRFS provides the convenience of storing persistent data in Amazon S3 for use with Hadoop while also providing features like Amazon S3 server-side encryption, read-after-write consistency, and list consistency. Previously, Amazon EMR used the s3n and s3a file systems.

Should I use S3 or S3A?

If you are using Hadoop 2.7 or later, use s3a. If you are using Hadoop 2.5 or earlier use s3n. They are both just tools to connect to S3.

Does Spark support Amazon S3?

With Amazon EMR release version 5.17. 0 and later, you can use S3 Select with Spark on Amazon EMR.


1 Answers

s3a is the actively maintained S3 client in Apache Hadoop. AWS forked their own client off from the Apache s3n:// client many years ago & (presumably) have massively reworked theirs.

They can read and write the same data, but some bits of EMR expect extra methods in the filesystem client which only EMR s3 supports...you cannot safely use s3a.

There's also the original ASF s3:// client which is incompatible with everything else, but was the first code used to connect Hadoop with S3, way before EMR was a product from amazon.

Which is better? S3A is probably, as of Aug 2017, faster on aggressive read IO of columnar formats like ORC and Parquet. EMR S3, with emrfs probably has the edge in terms of resilience and consistency. But the open source ASF S3A client is moving to address those

like image 186
stevel Avatar answered Oct 06 '22 18:10

stevel