How do I use an AWS SessionToken to read from S3 in pyspark?

Tags:

Assume I'm doing this:

import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell' from pyspark import SparkConf from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SparkContext

conf = SparkConf() \
        .setMaster("local[2]") \
        .setAppName("pyspark-unittests") \
        .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf=conf)
s3File = sc.textFile("s3a://myrepo/test.csv")
print(s3File.count())
print(s3File.id())

I know that, in theory, I can do this before the 'sc.textFile(...)' call to set my credentials:

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

However; I don't have a key/secret pair, instead, I have a key/secret/token triplet (they are temporary credentials that are refreshed periodically via AssumeRole....see here for details on getting those credentials: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html)

How can I use the triplet to authenticate to AWS S3, rather than just the key and secret?

My preference would be to use com.amazonaws.auth.profile.ProfileCredentialsProvider as the credentials provider (and have the key/secret/token in ~/.aws/credentials). I would settle for providing them on the command line or hard coded.

If I try this (with my credentials in ~/.aws/credentials):

sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")

I still get this:

py4j.protocol.Py4JJavaError: An error occurred while calling o37.partitions.
: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain

How can I either load credentials from ~/.aws/credentials or otherwise use a SessionToken?

250

asked May 08 '18 21:05

Jared

2 Answers

I don't see com.amazonaws.auth.profile.ProfileCredentialsProvider in the documentation. There is, however, org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider which allows you to use the key and secret along with fs.s3a.session.token which is where the token should go.

The instructions on that page say:

To authenticate with these:

Declare org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider as the provider.

Set the session key in the property fs.s3a.session.token, and the access and secret key properties to those of this temporary session.

Example:
<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value>
</property>

<property>
  <name>fs.s3a.access.key</name>
  <value>SESSION-ACCESS-KEY</value>
</property>

<property>
  <name>fs.s3a.secret.key</name>
  <value>SESSION-SECRET-KEY</value>
</property>

<property>
  <name>fs.s3a.session.token</name>
  <value>SECRET-SESSION-TOKEN</value>
</property>

114

answered Sep 22 '22 11:09

kichik

If your current AWS role is allowed to assume the cross-account role, you can use boto3 to get temporary session credentials:

import boto3

role_session_name = "test-s3-read"
role_arn = "arn:aws:iam::1234567890:role/crossaccount-role"
duration_seconds = 60*15 # durations of the session in seconds

credentials = boto3.client("sts").assume_role(
    RoleArn=role_arn,
    RoleSessionName=role_session_name,
    DurationSeconds=duration_seconds
)['Credentials']

How can I use the triplet to authenticate to AWS S3, rather than just the key and secret?

You can pass the AccessKeyId, SecretAccessKey and SessionToken to Spark like this:

spark = SparkSession.builder \
    .appName("test bucket access") \
    .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider") \
    .config("spark.hadoop.fs.s3a.access.key", credentials['AccessKeyId']) \
    .config("spark.hadoop.fs.s3a.secret.key", credentials['SecretAccessKey']) \
    .config("spark.hadoop.fs.s3a.session.token", credentials['SessionToken']) \
    .getOrCreate()

Verified with Spark 2.4.4 and it might not work with older versions.

answered Sep 24 '22 11:09

Joost Döbken

Related questions
                            
                                Create an if-else condition column in dask dataframe
                            
                                people.connections.list not returning contacts using Python Client Library
                            
                                Python script execution time increases when executed multiple time parallely
                            
                                Keras: Use the same layer in different models (share weights)
                            
                                What is the parameter "index" in Pandas.DataFrame.rename method?
                            
                                Understanding memory behavior of Dask distributed
                            
                                How to determine the uncertainty of fit parameters with Python?
                            
                                Possible values for platform.machine()
                            
                                Flask - store object directly in a session [duplicate]
                            
                                Connecting to Amazon Aurora using SQLAlchemy
                            
                                metaclass and __prepare__ ()
                            
                                Flask's built-in server always 404 with SERVER_NAME set
                            
                                Python Abstract class with concrete methods
                            
                                python - how to docstring kwargs and their expected types
                            
                                How to bulk write TFRecords?
                            
                                Render current status only on template in StreamingHttpResponse in Django
                            
                                Django OneToOneField default value
                            
                                Call an async function in an normal function
                            
                                How to generate random numbers to satisfy a specific mean and median in python?
                            
                                How to use SMOP to convert Matlab into Python code

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How do I use an AWS SessionToken to read from S3 in pyspark?

Tags:

python

amazon-web-services

amazon-s3

pyspark

Jared

People also ask

2 Answers

kichik

Joost Döbken

Recent Activity

Donate For Us