I am running AWS EMR Cluster to run spark jobs. In order to work with s3 buckets, hadoop configuration is set with access-keys, secret-keys, enableServerSideEncryption and algorithm to be used for the encryption. Please see the code below
val hadoopConf = sc.hadoopConfiguration;
hadoopConf.set("fs.s3.impl", "org.apache.hadoop.fs.s3native.NativeS3FileSystem")
hadoopConf.set("fs.s3.awsAccessKeyId", "xxx")
hadoopConf.set("fs.s3.awsSecretAccessKey", "xxx")
hadoopConf.set("fs.s3.enableServerSideEncryption", "true")
hadoopConf.set("fs.s3.serverSideEncryptionAlgorithm","AES256")
Under the above configuration, the spark program is able to read from s3 bucket, perform the processing. But fails when it tries to save results to s3, which enforces that the data must be encrypted. If the bucket allows unecrypted data, then it is saved sucessfully un-encrypted.
This happens even if the cluster is built with the option that enforces server side encryption --emrfs Encryption=ServerSide,Args=[fs.s3.serverSideEncryptionAlgorithm=AES256]
.
hadoop distcp from hdfs on the emr to s3 also fails. But, s3-dist-copy (aws version hdfs distcp) when set with --s3ServerSideEncryption option works sucessfully.
But, the ec2 instance has the required role permission to upload data to the same bucket with server side encryption without using any user access keys. Please see example command below. If -sse is omitted in the below command, it will throw an "Access denied error."
aws s3 cp test.txt s3://encrypted-bucket/ —sse
It will be helpful, if someone could help with configuration required in spark/hadoop to save data to aws s3 with Server side encryption.
This is now solved. --emrfs
didn't apply the configuration correctly. But the below option with aws emr create-cluster
works with both spark and hadoop distcp
.
--configurations '[{"Classification":"emrfs-site","Properties":{"fs.s3.enableServerSideEncryption":"true"},"Configurations":[]}]'
As the ec2 instances have been setup with the role profile to read/write from the bucket, my spark code worked without having to provide the aws access keys.
More emr configuration options are available in, which can be used with --configuration
option with emr create-cluster
http://docs.aws.amazon.com/ElasticMapReduce/latest/ReleaseGuide/emr-configure-apps.html
I am not sure why aws emr is giving 2 options for doing the same thing. One works and other don't.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With