Spark/Hadoop - Not able to save to s3 with server side encryption

Question

I am running AWS EMR Cluster to run spark jobs. In order to work with s3 buckets, hadoop configuration is set with access-keys, secret-keys, enableServerSideEncryption and algorithm to be used for the encryption. Please see the code below

val hadoopConf = sc.hadoopConfiguration; hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") hadoopConf.set("fs.s3.awsAccessKeyId", "xxx") hadoopConf.set("fs.s3.awsSecretAccessKey", "xxx") hadoopConf.set("fs.s3.enableServerSideEncryption", "true") hadoopConf.set("fs.s3.serverSideEncryptionAlgorithm","AES256")

Under the above configuration, the spark program is able to read from s3 bucket, perform the processing. But fails when it tries to save results to s3, which enforces that the data must be encrypted. If the bucket allows unecrypted data, then it is saved sucessfully un-encrypted.

This happens even if the cluster is built with the option that enforces server side encryption --emrfs Encryption=ServerSide,Args=[fs.s3.serverSideEncryptionAlgorithm=AES256].

hadoop distcp from hdfs on the emr to s3 also fails. But, s3-dist-copy (aws version hdfs distcp) when set with --s3ServerSideEncryption option works sucessfully.

But, the ec2 instance has the required role permission to upload data to the same bucket with server side encryption without using any user access keys. Please see example command below. If -sse is omitted in the below command, it will throw an "Access denied error."

aws s3 cp test.txt s3://encrypted-bucket/ —sse

It will be helpful, if someone could help with configuration required in spark/hadoop to save data to aws s3 with Server side encryption.

stash · Accepted Answer

This is now solved. --emrfs didn't apply the configuration correctly. But the below option with aws emr create-cluster works with both spark and hadoop distcp.

--configurations '[{"Classification":"emrfs-site","Properties":{"fs.s3.enableServerSideEncryption":"true"},"Configurations":[]}]'

As the ec2 instances have been setup with the role profile to read/write from the bucket, my spark code worked without having to provide the aws access keys.

More emr configuration options are available in, which can be used with --configuration option with emr create-cluster http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html

I am not sure why aws emr is giving 2 options for doing the same thing. One works and other don't.

Spark/Hadoop - Not able to save to s3 with server side encryption

Tags:

amazon-s3

encryption

apache-spark

hadoop

emr

stash

1 Answers

stash

Recent Activity

Donate For Us

Spark/Hadoop - Not able to save to s3 with server side encryption

Tags:

amazon-s3

encryption

apache-spark

hadoop

emr

stash

1 Answers

stash

Related questions

Recent Activity

Donate For Us