Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark s3 Access with Multiple AWS Credential Profiles?

I'm writing a pyspark job that needs to read out of two different s3 buckets. Each bucket has different credentials, which are stored on my machine as separate profiles in ~/.aws/credentials.

Is there a way to tell pyspark which profile to use when connecting to s3?

When using a single bucket, I had set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables in conf/spark-env.sh. Naturally, this only works for accessing 1 of the 2 buckets.

I am aware that I could set these values manually in pyspark when required, using:

sc._jsc.hadoopConfiguration().set("fs.s3n.awsAccessKeyId", "ABCD")
sc._jsc.hadoopConfiguration().set("fs.s3n.awsSecretAccessKey", "EFGH")

But would prefer a solution where these values were not hard-coded in. Is that possible?

like image 567
neal Avatar asked May 27 '16 09:05

neal


1 Answers

Different S3 buckets can be accessed with different S3A client configurations. This allows for different endpoints, data read and write strategies, as well as login details.

  1. All fs.s3a options other than a small set of unmodifiable values (currently fs.s3a.impl) can be set on a per bucket basis.
  2. The bucket specific option is set by replacing the fs.s3a. prefix on an option with fs.s3a.bucket.BUCKETNAME., where BUCKETNAME is the name of the bucket.
  3. When connecting to a bucket, all options explicitly set will override the base fs.s3a. values.

source http://hadoop.apache.org/docs/r2.8.0/hadoop-aws/tools/hadoop-aws/index.html#Configurations_different_S3_buckets

like image 87
Mohamad Shaker Avatar answered Nov 15 '22 15:11

Mohamad Shaker