I'm writing a pyspark job that needs to read out of two different s3 buckets. Each bucket has different credentials, which are stored on my machine as separate profiles in ~/.aws/credentials
.
Is there a way to tell pyspark which profile to use when connecting to s3?
When using a single bucket, I had set the AWS_ACCESS_KEY_ID
and AWS_SECRET_ACCESS_KEY
environment variables in conf/spark-env.sh
. Naturally, this only works for accessing 1 of the 2 buckets.
I am aware that I could set these values manually in pyspark when required, using:
sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "ABCD")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "EFGH")
But would prefer a solution where these values were not hard-coded in. Is that possible?
Different S3 buckets can be accessed with different S3A client configurations. This allows for different endpoints, data read and write strategies, as well as login details.
source http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With