Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

PySpark using IAM roles to access S3

I'm wondering if PySpark supports S3 access using IAM roles. Specifically, I have a business constraint where I have to assume an AWS role in order to access a given bucket. This is fine when using boto (as it's part of the API), but I can't find a definitive answer as to if PySpark supports this out of the box.

Ideally, I'd like to be able to assume a role when running in standalone mode locally and point my SparkContext to that s3 path. I've seen that non-IAM calls usually follow :

spark_conf = SparkConf().setMaster('local[*]').setAppName('MyApp')
sc = SparkContext(conf=spark_conf)
rdd = sc.textFile('s3://<MY-ID>:<MY-KEY>@some-bucket/some-key')

Does something like this exist for providing IAM info? :

rdd = sc.textFile('s3://<MY-ID>:<MY-KEY>:<MY-SESSION>@some-bucket/some-key')

or

rdd = sc.textFile('s3://<ROLE-ARN>:<ROLE-SESSION-NAME>@some-bucket/some-key')

If not, what are the best practices for working with IAM creds? Is it even possible?

I'm using Python 1.7 and PySpark 1.6.0

Thanks!

like image 625
Nick Avatar asked Mar 22 '16 21:03

Nick


3 Answers

IAM role for accessing s3 is only support by s3a, because it is using AWS SDK.

You need to put hadoop-aws JAR and aws-java-sdk JAR (and third-party Jars in its package) into your CLASSPATH.

hadoop-aws link.

aws-java-sdk link.

Then set this in core-site.xml:

<property>
    <name>fs.s3.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
    <name>fs.s3a.impl</name>
    <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
like image 153
chutium Avatar answered Sep 28 '22 08:09

chutium


Hadoop 2.8+'s s3a connector supports IAM roles via a new credential provider; It's not in the Hadoop 2.7 release.

To use it you need to change the credential provider.

fs.s3a.aws.credentials.provider = org.apache.hadoop.fs.s3a.TemporaryAWSCredentialsProvider
fs.s3a.access.key = <your access key>
fs.s3a.secret.key = <session secret>
fs.s3a.session.token = <session token>

What is in Hadoop 2.7 (and enabled by default) is the picking up of the AWS_ environment variables.

If you set the AWS env vars for session login on your local system and the remote ones then they should get picked up.

I know its a pain, but as far as the Hadoop team are concerned Hadoop 2.7 shipped mid-2016 and we've done a lot since then, stuff which we aren't going to backport

like image 26
stevel Avatar answered Sep 28 '22 06:09

stevel


IAM Role-based access to files in S3 is supported by Spark, you just need to be careful with your config. Specifically, you need:

  • Compatible versions of aws-java-sdk and hadoop-aws. This is quite brittle so only specific combinations work.
  • You must use the S3AFileSystem, not NativeS3FileSystem. The former permits role based access, whereas the later only allows user credentials.

To find out which combinations work, go to hadoop-aws on mvnrepository here. Click through the version of hadoop-aws you have look for the version of the aws-java-sdk compile dependency.

To find out what version of hadoop-aws you are using, in PySpark you can execute:

sc._gateway.jvm.org.apache.hadoop.util.VersionInfo.getVersion()

where sc is the SparkContext

This is what worked for me:

import os
import pyspark
from pyspark import SparkContext
from pyspark.sql import SparkSession

os.environ['PYSPARK_SUBMIT_ARGS'] = '--packages com.amazonaws:aws-java-sdk:1.7.4,org.apache.hadoop:hadoop-aws:2.7.1 pyspark-shell'

sc = SparkContext.getOrCreate()

hadoopConf = sc._jsc.hadoopConfiguration()
hadoopConf.set("fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")

spark = SparkSession(sc)

df = spark.read.csv("s3a://mybucket/spark/iris/",header=True)
df.show()

It's the specific combination of aws-java-sdk:1.7.4 and hadoop-aws:2.7.1 that made it work. There is good guidance on troubleshooting s3a access here

Specially note that

Randomly changing hadoop- and aws- JARs in the hope of making a problem "go away" or to gain access to a feature you want, will not lead to the outcome you desire.

Here is a useful post containing further information.

Here's some more useful information about compatibility between the java libraries

I was trying to get this to work in the jupyter pyspark notebook. Note that the aws-hadoop version had to match the hadoop install in the Dockerfile i.e. here.

like image 29
RobinL Avatar answered Sep 28 '22 06:09

RobinL