Assume I'm doing this:
import os os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages "org.apache.hadoop:hadoop-aws:2.7.3" pyspark-shell' from pyspark import SparkConf from pyspark import SparkContext
from pyspark import SparkConf
from pyspark import SparkContext
conf = SparkConf() \
.setMaster("local[2]") \
.setAppName("pyspark-unittests") \
.set("spark.sql.parquet.compression.codec", "snappy")
sc = SparkContext(conf=conf)
s3File = sc.textFile("s3a://myrepo/test.csv")
print(s3File.count())
print(s3File.id())
I know that, in theory, I can do this before the 'sc.textFile(...)' call to set my credentials:
sc._jsc.hadoopConfiguration().set('fs.s3a.access.key', 'awsKey')
sc._jsc.hadoopConfiguration().set('fs.s3a.secret.key', 'awsSecret')
However; I don't have a key/secret pair, instead, I have a key/secret/token triplet (they are temporary credentials that are refreshed periodically via AssumeRole....see here for details on getting those credentials: https://docs.aws.amazon.com/IAM/latest/UserGuide/id_credentials_temp_request.html)
How can I use the triplet to authenticate to AWS S3, rather than just the key and secret?
My preference would be to use com.amazonaws.auth.profile.ProfileCredentialsProvider
as the credentials provider (and have the key/secret/token in ~/.aws/credentials). I would settle for providing them on the command line or hard coded.
If I try this (with my credentials in ~/.aws/credentials):
sc._jsc.hadoopConfiguration().set("fs.s3a.aws.credentials.provider", "com.amazonaws.auth.profile.ProfileCredentialsProvider")
I still get this:
py4j.protocol.Py4JJavaError: An error occurred while calling o37.partitions.
: com.amazonaws.AmazonClientException: Unable to load AWS credentials from any provider in the chain
How can I either load credentials from ~/.aws/credentials or otherwise use a SessionToken?
spark. read. text() method is used to read a text file from S3 into DataFrame. like in RDD, we can also use this method to read multiple files at a time, reading patterns matching files and finally reading all files from a directory.
Configuring the Spark ShellStart the Spark shell with the dataframes spark-csv package. Set the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables, so Spark can communicate with S3. Once the environment variables are set, restart the Spark shell and enter the following commands. The System.
AWS uses the session token to validate the temporary security credentials. Temporary credentials expire after a specified interval. After temporary credentials expire, any calls that you make with those credentials will fail, so you must generate a new set of temporary credentials.
I don't see com.amazonaws.auth.profile.ProfileCredentialsProvider
in the documentation. There is, however, org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
which allows you to use the key and secret along with fs.s3a.session.token
which is where the token should go.
The instructions on that page say:
To authenticate with these:
- Declare
org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
as the provider.- Set the session key in the property
fs.s3a.session.token
, and the access and secret key properties to those of this temporary session.Example:
<property> <name>fs.s3a.aws.credentials.provider</name> <value>org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider</value> </property> <property> <name>fs.s3a.access.key</name> <value>SESSION-ACCESS-KEY</value> </property> <property> <name>fs.s3a.secret.key</name> <value>SESSION-SECRET-KEY</value> </property> <property> <name>fs.s3a.session.token</name> <value>SECRET-SESSION-TOKEN</value> </property>
If your current AWS role is allowed to assume the cross-account role, you can use boto3 to get temporary session credentials:
import boto3
role_session_name = "test-s3-read"
role_arn = "arn:aws:iam::1234567890:role/crossaccount-role"
duration_seconds = 60*15 # durations of the session in seconds
credentials = boto3.client("sts").assume_role(
RoleArn=role_arn,
RoleSessionName=role_session_name,
DurationSeconds=duration_seconds
)['Credentials']
How can I use the triplet to authenticate to AWS S3, rather than just the key and secret?
You can pass the AccessKeyId
, SecretAccessKey
and SessionToken
to Spark like this:
spark = SparkSession.builder \
.appName("test bucket access") \
.config("spark.hadoop.fs.s3.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem") \
.config("spark.hadoop.fs.s3a.aws.credentials.provider", "org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider") \
.config("spark.hadoop.fs.s3a.access.key", credentials['AccessKeyId']) \
.config("spark.hadoop.fs.s3a.secret.key", credentials['SecretAccessKey']) \
.config("spark.hadoop.fs.s3a.session.token", credentials['SessionToken']) \
.getOrCreate()
Verified with Spark 2.4.4 and it might not work with older versions.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With