Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Running EMR Spark With Multiple S3 Accounts

I have an EMR Spark Job that needs to read data from S3 on one account and write to another.
I split my job into two steps.

  1. read data from the S3 (no credentials required because my EMR cluster is in the same account).

  2. read data in the local HDFS created by step 1 and write it to an S3 bucket in another account.

I've attempted setting the hadoopConfiguration:

sc.hadoopConfiguration.set("fs.s3n.awsAccessKeyId", "<your access key>")
sc.hadoopConfiguration.set("fs.s3n.awsSecretAccessKey","<your secretkey>")

And exporting the keys on the cluster:

$ export AWS_SECRET_ACCESS_KEY=
$ export AWS_ACCESS_KEY_ID=

I've tried both cluster and client mode as well as spark-shell with no luck.

Each of them returns an error:

ERROR ApplicationMaster: User class threw exception: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: 
Access Denied
like image 276
jspooner Avatar asked Nov 01 '16 16:11

jspooner


People also ask

Does Spark support Amazon S3?

With Amazon EMR release version 5.17. 0 and later, you can use S3 Select with Spark on Amazon EMR.

How many EMR clusters can be run simultaneously?

Q: How many EMR clusters can be run simultaneously? Users may begin as many clusters as they wish. Users are limited to 20 instances across all of the clusters when we first start.

Does EMR use S3?

HDFS and the EMR File System (EMRFS), which uses Amazon S3, are both compatible with Amazon EMR, but they're not interchangeable.

Can Spark access S3 data?

If you are using PySpark to access S3 buckets, you must pass the Spark engine the right packages to use, specifically aws-java-sdk and hadoop-aws . It'll be important to identify the right package version to use.


1 Answers

The solution is actually quite simple.

Firstly, EMR clusters have two roles:

  • A service role (EMR_DefaultRole) that grants permissions to the EMR service (eg for launching Amazon EC2 instances)
  • An EC2 role (EMR_EC2_DefaultRole) that is attached to EC2 instances launched in the cluster, giving them access to AWS credentials (see Using an IAM Role to Grant Permissions to Applications Running on Amazon EC2 Instances)

These roles are explained in: Default IAM Roles for Amazon EMR

Therefore, each EC2 instance launched in the cluster is assigned the EMR_EC2_DefaultRole role, which makes temporary credentials available via the Instance Metadata service. (For an explanation of how this works, see: IAM Roles for Amazon EC2.) Amazon EMR nodes use these credentials to access AWS services such as S3, SNS, SQS, CloudWatch and DynamoDB.

Secondly, you will need to add permissions to the Amazon S3 bucket in the other account to permit access via the EMR_EC2_DefaultRole role. This can be done by adding a bucket policy to the S3 bucket (here named other-account-bucket) like this:

{
    "Id": "Policy1",
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1",
            "Action": "s3:*",
            "Effect": "Allow",
            "Resource": [
                "arn:aws:s3:::other-account-bucket",
                "arn:aws:s3:::other-account-bucket/*"
            ],
            "Principal": {
                "AWS": [
                    "arn:aws:iam::ACCOUNT-NUMBER:role/EMR_EC2_DefaultRole"
                ]
            }
        }
    ]
}

This policy grants all S3 permissions (s3:*) to the EMR_EC2_DefaultRole role that belongs to the account matching the ACCOUNT-NUMBER in the policy, which should be the account in which the EMR cluster was launched. Be careful when granting such permissions -- you might want to grant permissions only to GetObject rather than granting all S3 permissions.

That's all! The bucket in the other account will now accept requests from the EMR nodes because they are using the EMR_EC2_DefaultRole role.

Disclaimer: I tested the above by creating a bucket in Account-A and assigning permissions (as shown above) to a role in Account-B. An EC2 instance was launched in Account-B with that role. I was able to access the bucket from the EC2 instance via the AWS Command-Line Interface (CLI). I did not test it within EMR, however it should work the same way.

like image 135
John Rotenstein Avatar answered Sep 18 '22 14:09

John Rotenstein