Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

org.apache.hadoop.security.AccessControlException: Permission denied when trying to access S3 bucket through s3n URI using Hadoop Java APIs on EC2

Scenario

I create an AWS IAM role called "my-role" specifying EC2 as trusted entity, i.e. using the trust relationship policy document:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "",
      "Effect": "Allow",
      "Principal": {
        "Service": "ec2.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

The role has the following policy:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "s3:AbortMultipartUpload",
        "s3:DeleteObject",
        "s3:DeleteObjectVersion",
        "s3:GetBucketAcl",
        "s3:GetBucketCORS",
        "s3:GetBucketLocation",
        "s3:GetBucketLogging",
        "s3:GetBucketNotification",
        "s3:GetBucketPolicy",
        "s3:GetBucketRequestPayment",
        "s3:GetBucketTagging",
        "s3:GetBucketVersioning",
        "s3:GetBucketWebsite",
        "s3:GetLifecycleConfiguration",
        "s3:GetObject",
        "s3:GetObjectAcl",
        "s3:GetObjectTorrent",
        "s3:GetObjectVersion",
        "s3:GetObjectVersionAcl",
        "s3:GetObjectVersionTorrent",
        "s3:ListBucket",
        "s3:ListBucketMultipartUploads",
        "s3:ListBucketVersions",
        "s3:ListMultipartUploadParts",
        "s3:PutObject",
        "s3:PutObjectAcl",
        "s3:PutObjectVersionAcl",
        "s3:RestoreObject"
      ],
      "Resource": [
        "arn:aws:s3:::my-bucket/*"
      ]
    }
  ]
}

I launch an EC2 instance (Amazon Linux 2014.09.1) from the command line using AWS CLI, specifying "my-role" as instance profile and everything works out fine. I verify that the instance effectively assumes "my-role", by running:

  • curl http://169.254.169.254/latest/meta-data/iam/security-credentials/ to query for instance metadata, from which I get the response my-role;
  • curl http://169.254.169.254/latest/meta-data/iam/security-credentials/my-role from which I get temporary credentials associated to "my-role".

An example of such credentials retrieval response is something like:

{
  "Code" : "Success",
  "LastUpdated" : "2015-01-19T10:37:35Z",
  "Type" : "AWS-HMAC",
  "AccessKeyId" : "an-access-key-id",
  "SecretAccessKey" : "a-secret-access-key",
  "Token" : "a-token",
  "Expiration" : "2015-01-19T16:47:09Z"
}
  • aws s3 ls s3://my-bucket/ from which I correctly get a list containing the first subdirectory(ies) under "my-bucket". (The AWS CLI comes installed and configured by default when launching this AMI. EC2 instance and S3 bucket are within the same AWS account)

I run/install a Tomcat7 server and container on such instance, on which I deploy a J2EE 1.7 servlet with no issues.

Such servlet should download on the local file system a file from an S3 bucket, in particular from s3://my-bucket/custom-path/file.tar.gz using Hadoop Java APIs. (Please, note that I tried hadoop-common artifact 2.4.x, 2.5.x, 2.6.x with no positive results. I'm gonna post below the exception I get when using 2.5.x)

Within the servlet, I retrieve fresh credentials from the instance metadata URL above mentioned and use them to configure my Hadoop Java API instance:

... 
Path path = new Path("s3n://my-bucket/");
Configuration conf = new Configuration();
conf.set("fs.defaultFS", path.toString());
conf.set("fs.s3n.awsAccessKeyId", myAwsAccessKeyId);
conf.set("fs.s3n.awsSecretAccessKey", myAwsSecretAccessKey);
conf.set("fs.s3n.awsSessionToken", mySessionToken);
...

Obviously, myAwsAccessKeyId, myAwsSecretAccessKey, and mySessionToken are Java variables that I previously set with the actual values. Then, I effectively get a FileSystem instance, using:

FileSystem fs = path.getFileSystem(conf);

I am able to retrieve all the configuration related to the FileSystem (fs.getconf().get(key-name)) and verify everything is configured as assumed.

Problem

I cannot download s3://my-bucket/custom-path/file.tar.gz using:

...
fs.copyToLocalFile(false, new Path(path.toString()+"custom-path/file.tar.gz"), outputLocalPath);
...

If I use hadoop-common 2.5.x I get the IOException:

org.apache.hadoop.security.AccessControlException: Permission denied: s3n://my-bucket/custom-path/file.tar.gz at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:449) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.processException(Jets3tNativeFileSystemStore.java:427) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.handleException(Jets3tNativeFileSystemStore.java:411) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.retrieveMetadata(Jets3tNativeFileSystemStore.java:181) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:187) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at org.apache.hadoop.fs.s3native.$Proxy12.retrieveMetadata(Unknown Source) at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:467) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1968) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1937) ...

If I use hadoop-common 2.4.x, I get a NullPointerException:

java.lang.NullPointerException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.getFileStatus(NativeS3FileSystem.java:433) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:337) at org.apache.hadoop.fs.FileUtil.copy(FileUtil.java:289) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1968) at org.apache.hadoop.fs.FileSystem.copyToLocalFile(FileSystem.java:1937) ...

Just for the records, if DON'T set any aws credential, I get:

AWS Access Key ID and Secret Access Key must be specified as the username or password (respectively) of a s3n URL, or by setting the fs.s3n.awsAccessKeyId or fs.s3n.awsSecretAccessKey properties (respectively).

Final notes

  • If I try to download the file from the very same URI (but s3 in place of s3n) using AWS CLI commands from the instance, I have NO PROBLEMS AT ALL.
  • If I try to download an Hadoop distribution (like 2.4.1 from https://archive.apache.org/dist/hadoop/core/hadoop-2.4.1/), unzip it, retrieve the temporary AWS credentials from the instance metadata URL and try to run <hadoop-dir>/bin/hadoop fs -cp s3n://<aws-access-key-id>:<aws-secret-access-key>@my-bucket/custom-path/file.tar.gz . I get, once again, a NPE:

Fatal internal error java.lang.NullPointerException at org.apache.hadoop.fs.s3native.NativeS3FileSystem.listStatus(NativeS3FileSystem.java:479) at org.apache.hadoop.fs.shell.PathData.getDirectoryContents(PathData.java:268) at org.apache.hadoop.fs.shell.Command.recursePath(Command.java:347) at org.apache.hadoop.fs.shell.Ls.processPathArgument(Ls.java:96) at org.apache.hadoop.fs.shell.Command.processArgument(Command.java:260) at org.apache.hadoop.fs.shell.Command.processArguments(Command.java:244) at org.apache.hadoop.fs.shell.Command.processRawArguments(Command.java:190) at org.apache.hadoop.fs.shell.Command.run(Command.java:154) at org.apache.hadoop.fs.FsShell.run(FsShell.java:255) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84) at org.apache.hadoop.fs.FsShell.main(FsShell.java:308)

Sorry for the long post, I just tried to be as much detailed as I could. Thanks for any eventual help out here.

like image 223
erond Avatar asked Feb 12 '23 03:02

erond


1 Answers

You are using STS/temporary AWS credentials; these do not appear to be currently supported by the s3 or s3n FileSystem implementations in hadoop.

AWS STS/temporary credentials include not only an (access key, secret key), but additionally a session token. The hadoop s3 and s3n FileSystem(s) do not yet support inclusion of the session token (i.e. your configuration of fs.s3n.awsSessionToken is unsupported and ignored by the s3n FileSystem.

From AmazonS3 - Hadoop Wiki...
(Note there is no mention of fs.s3.awsSessionToken):

Configuring to use s3/ s3n filesystems

Edit your core-site.xml file to include your S3 keys

   <property>
     <name>fs.s3.awsAccessKeyId</name>
     <value>ID</value>
   </property>

   <property>
     <name>fs.s3.awsSecretAccessKey</name>
     <value>SECRET</value>
   </property>


If you take a look at S3Credentials.java from apache/hadoop on github.com, you'll notice that the notion of a session token is completely missing from the representation of S3 credentials.

There was a patch submitted to address this limitation (detailed here); however, it hasn't been integrated.


If you are using AWS IAM Instance Roles, you may want to explore using the new s3a FileSystem that was added in Hadoop 2.6.0. It claims to have support for IAM role-based authentication (i.e. you wouldn't have to explicitly specify the keys at all).

A Hadoop JIRA ticket describes how to configure the s3a FileSystem:

From https://issues.apache.org/jira/browse/HADOOP-10400 :

fs.s3a.access.key - Your AWS access key ID (omit for role authentication)
fs.s3a.secret.key - Your AWS secret key (omit for role authentication)

like image 91
liggysmalls Avatar answered Feb 13 '23 19:02

liggysmalls