Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Spark/Hadoop - Not able to save to s3 with server side encryption

I am running AWS EMR Cluster to run spark jobs. In order to work with s3 buckets, hadoop configuration is set with access-keys, secret-keys, enableServerSideEncryption and algorithm to be used for the encryption. Please see the code below

val hadoopConf = sc.hadoopConfiguration; hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem") hadoopConf.set("fs.s3.awsAccessKeyId", "xxx") hadoopConf.set("fs.s3.awsSecretAccessKey", "xxx") hadoopConf.set("fs.s3.enableServerSideEncryption", "true") hadoopConf.set("fs.s3.serverSideEncryptionAlgorithm","AES256")

Under the above configuration, the spark program is able to read from s3 bucket, perform the processing. But fails when it tries to save results to s3, which enforces that the data must be encrypted. If the bucket allows unecrypted data, then it is saved sucessfully un-encrypted.

This happens even if the cluster is built with the option that enforces server side encryption --emrfs Encryption=ServerSide,Args=[fs.s3.serverSideEncryptionAlgorithm=AES256].

hadoop distcp from hdfs on the emr to s3 also fails. But, s3-dist-copy (aws version hdfs distcp) when set with --s3ServerSideEncryption option works sucessfully.

But, the ec2 instance has the required role permission to upload data to the same bucket with server side encryption without using any user access keys. Please see example command below. If -sse is omitted in the below command, it will throw an "Access denied error."

aws s3 cp test.txt s3://encrypted-bucket/ —sse

It will be helpful, if someone could help with configuration required in spark/hadoop to save data to aws s3 with Server side encryption.

like image 516
stash Avatar asked Feb 22 '16 09:02

stash


1 Answers

This is now solved. --emrfs didn't apply the configuration correctly. But the below option with aws emr create-cluster works with both spark and hadoop distcp.

--configurations '[{"Classification":"emrfs-site","Properties":{"fs.s3.enableServerSideEncryption":"true"},"Configurations":[]}]'

As the ec2 instances have been setup with the role profile to read/write from the bucket, my spark code worked without having to provide the aws access keys.

More emr configuration options are available in, which can be used with --configuration option with emr create-cluster http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html

I am not sure why aws emr is giving 2 options for doing the same thing. One works and other don't.

like image 123
stash Avatar answered Oct 13 '22 23:10

stash