Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I use an AWS SessionToken to read from S3 in pyspark?

Assume I'm doing this:

import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell' from pyspark import SparkConf from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SparkContext

conf = SparkConf() \
        .setMaster("local[2]") \
        .setAppName("pyspark-unittests") \
        .set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf=conf)
s3File = sc.textFile("s3a://myrepo/test.csv")
print(s3File.count())
print(s3File.id())

I know that, in theory, I can do this before the 'sc.textFile(...)' call to set my credentials:

sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')

However; I don't have a key/secret pair, instead, I have a key/secret/token triplet (they are temporary credentials that are refreshed periodically via AssumeRole....see here for details on getting those credentials: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html)

How can I use the triplet to authenticate to AWS S3, rather than just the key and secret?

My preference would be to use com.amazonaws.auth.profile.ProfileCredentialsProvider as the credentials provider (and have the key/secret/token in ~/.aws/credentials). I would settle for providing them on the command line or hard coded.

If I try this (with my credentials in ~/.aws/credentials):

sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")

I still get this:

py4j.protocol.Py4JJavaError: An error occurred while calling o37.partitions.
: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain

How can I either load credentials from ~/.aws/credentials or otherwise use a SessionToken?

like image 250
Jared Avatar asked May 08 '18 21:05

Jared


People also ask

How do I get my S3 to read spark?

spark. read. text() method is used to read a text file from S3 into DataFrame. like in RDD, we can also use this method to read multiple files at a time, reading patterns matching files and finally reading all files from a directory.

How do I connect my AWS spark to my S3?

Configuring the Spark ShellStart the Spark shell with the dataframes spark-csv package. Set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, so Spark can communicate with S3. Once the environment variables are set, restart the Spark shell and enter the following commands. The System.

What is S3 session token?

AWS uses the session token to validate the temporary security credentials. Temporary credentials expire after a specified interval. After temporary credentials expire, any calls that you make with those credentials will fail, so you must generate a new set of temporary credentials.


2 Answers

I don't see com.amazonaws.auth.profile.ProfileCredentialsProvider in the documentation. There is, however, org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider which allows you to use the key and secret along with fs.s3a.session.token which is where the token should go.

The instructions on that page say:

To authenticate with these:

  1. Declare org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider as the provider.
  2. Set the session key in the property fs.s3a.session.token, and the access and secret key properties to those of this temporary session.

Example:

<property>
  <name>fs.s3a.aws.credentials.provider</name>
  <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value>
</property>

<property>
  <name>fs.s3a.access.key</name>
  <value>SESSION-ACCESS-KEY</value>
</property>

<property>
  <name>fs.s3a.secret.key</name>
  <value>SESSION-SECRET-KEY</value>
</property>

<property>
  <name>fs.s3a.session.token</name>
  <value>SECRET-SESSION-TOKEN</value>
</property>
like image 114
kichik Avatar answered Sep 22 '22 11:09

kichik


If your current AWS role is allowed to assume the cross-account role, you can use boto3 to get temporary session credentials:

import boto3

role_session_name = "test-s3-read"
role_arn = "arn:aws:iam::1234567890:role/crossaccount-role"
duration_seconds = 60*15 # durations of the session in seconds

credentials = boto3.client("sts").assume_role(
    RoleArn=role_arn,
    RoleSessionName=role_session_name,
    DurationSeconds=duration_seconds
)['Credentials']

How can I use the triplet to authenticate to AWS S3, rather than just the key and secret?

You can pass the AccessKeyId, SecretAccessKey and SessionToken to Spark like this:

spark = SparkSession.builder \
    .appName("test bucket access") \
    .config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
    .config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider") \
    .config("spark.hadoop.fs.s3a.access.key", credentials['AccessKeyId']) \
    .config("spark.hadoop.fs.s3a.secret.key", credentials['SecretAccessKey']) \
    .config("spark.hadoop.fs.s3a.session.token", credentials['SessionToken']) \
    .getOrCreate()

Verified with Spark 2.4.4 and it might not work with older versions.

like image 31
Joost Döbken Avatar answered Sep 24 '22 11:09

Joost Döbken