I have a huge bucket of S3files that I want to put on HDFS. Given the amount of files involved my preferred solution is to use 'distributed copy'. However for some reason I can't get hadoop distcp to take my Amazon S3 credentials. The command I use is:
hadoop distcp -update s3a://[bucket]/[folder]/[filename] hdfs:///some/path/ -D fs.s3a.awsAccessKeyId=[keyid] -D fs.s3a.awsSecretAccessKey=[secretkey] -D fs.s3a.fast.upload=true
However that acts the same as if the '-D' arguments aren't there.
ERROR tools.DistCp: Exception encountered
java.io.InterruptedIOException: doesBucketExist on [bucket]: com.amazonaws.AmazonClientException: No AWS Credentials provided by BasicAWSCredentialsProvider EnvironmentVariableCredentialsProvider SharedInstanceProfileCredentialsProvider : com.amazonaws.SdkClientException: Unable to load credentials from service endpoint
I've looked at the hadoop distcp documentation, but can't find a solution there on why this isn't working. I've tried -Dfs.s3n.awsAccessKeyId as a flag which didn't work either. I've read how explicitly passing credentials isn't good practice, so maybe this is just some gentil suggestion to do it some other way?
How is one supposed to pass S3 credentials with distcp? Anyone knows?
Introducing the Hadoop S3A client. Hadoop's “S3A” client offers high-performance IO against Amazon S3 object store and compatible implementations. Directly reads and writes S3 objects. Compatible with standard S3 clients. Compatible with files created by the older s3n:// client and Amazon EMR's s3:// client.
There are several options to access S3 as a Hadoop filesystem (see the Apache doc). The S3 dataset in DSS has native support for using Hadoop software layers whenever needed, including for fast read/write from Spark and Parquet support. Using a Hadoop dataset for accessing S3 is not usually required.
S3A (URI scheme: s3a) A successor to the S3 Native, s3n fs, the S3a: system uses Amazon's libraries to interact with S3. This allows S3a to support larger files (no more 5GB limit), higher performance operations and more.
It appears the format of credentials flags has changed since the previous version. The following command works:
hadoop distcp \
-Dfs.s3a.access.key=[accesskey] \
-Dfs.s3a.secret.key=[secretkey] \
-Dfs.s3a.fast.upload=true \
-update \
s3a://[bucket]/[folder]/[filename] hdfs:///some/path
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With