Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Hadoop distcp No AWS Credentials provided

I have a huge bucket of S3files that I want to put on HDFS. Given the amount of files involved my preferred solution is to use 'distributed copy'. However for some reason I can't get hadoop distcp to take my Amazon S3 credentials. The command I use is:

hadoop distcp -update s3a://[bucket]/[folder]/[filename] hdfs:///some/path/ -D fs.s3a.awsAccessKeyId=[keyid] -D fs.s3a.awsSecretAccessKey=[secretkey] -D fs.s3a.fast.upload=true

However that acts the same as if the '-D' arguments aren't there.

ERROR tools.DistCp: Exception encountered
java.io.InterruptedIOException: doesBucketExist on [bucket]: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Unable to load credentials from service endpoint

I've looked at the hadoop distcp documentation, but can't find a solution there on why this isn't working. I've tried -Dfs.s3n.awsAccessKeyId as a flag which didn't work either. I've read how explicitly passing credentials isn't good practice, so maybe this is just some gentil suggestion to do it some other way?

How is one supposed to pass S3 credentials with distcp? Anyone knows?

like image 464
KDC Avatar asked Nov 23 '17 13:11

KDC


People also ask

How do you connect Hadoop to AWS S3 which client should you use?

Introducing the Hadoop S3A client. Hadoop's “S3A” client offers high-performance IO against Amazon S3 object store and compatible implementations. Directly reads and writes S3 objects. Compatible with standard S3 clients. Compatible with files created by the older s3n:// client and Amazon EMR's s3:// client.

Is S3 Hadoop compatible?

There are several options to access S3 as a Hadoop filesystem (see the Apache doc). The S3 dataset in DSS has native support for using Hadoop software layers whenever needed, including for fast read/write from Spark and Parquet support. Using a Hadoop dataset for accessing S3 is not usually required.

What is S3A in AWS?

S3A (URI scheme: s3a) A successor to the S3 Native, s3n fs, the S3a: system uses Amazon's libraries to interact with S3. This allows S3a to support larger files (no more 5GB limit), higher performance operations and more.


1 Answers

It appears the format of credentials flags has changed since the previous version. The following command works:

hadoop distcp \
  -Dfs.s3a.access.key=[accesskey] \
  -Dfs.s3a.secret.key=[secretkey] \
  -Dfs.s3a.fast.upload=true \
  -update \
  s3a://[bucket]/[folder]/[filename] hdfs:///some/path
like image 171
KDC Avatar answered Sep 26 '22 07:09

KDC