Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DistCp from Local Hadoop to Amazon S3

I'm trying to use distcp to copy a folder from my local hadoop cluster (cdh4) to my Amazon S3 bucket.

I use the following command:

hadoop distcp -log /tmp/distcplog-s3/ hdfs://nameserv1/tmp/data/sampledata  s3n://hdfsbackup/

hdfsbackup is the name of my Amazon S3 Bucket.

DistCp fails with unknown host exception:

13/05/31 11:22:33 INFO tools.DistCp: srcPaths=[hdfs://nameserv1/tmp/data/sampledata]
13/05/31 11:22:33 INFO tools.DistCp: destPath=s3n://hdfsbackup/
        No encryption was performed by peer.
        No encryption was performed by peer.
13/05/31 11:22:35 INFO hdfs.DFSClient: Created HDFS_DELEGATION_TOKEN token 54 for hadoopuser on ha-hdfs:nameserv1
13/05/31 11:22:35 INFO security.TokenCache: Got dt for hdfs://nameserv1; Kind: HDFS_DELEGATION_TOKEN, Service: ha-hdfs:nameserv1, Ident: (HDFS_DELEGATION_TOKEN token 54 for hadoopuser)
        No encryption was performed by peer.
java.lang.IllegalArgumentException: java.net.UnknownHostException: hdfsbackup
    at org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:414)
    at org.apache.hadoop.security.SecurityUtil.buildDTServiceName(SecurityUtil.java:295)
    at org.apache.hadoop.fs.FileSystem.getCanonicalServiceName(FileSystem.java:282)
    at org.apache.hadoop.fs.FileSystem.collectDelegationTokens(FileSystem.java:503)
    at org.apache.hadoop.fs.FileSystem.addDelegationTokens(FileSystem.java:487)
    at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:130)
    at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodesInternal(TokenCache.java:111)
    at org.apache.hadoop.mapreduce.security.TokenCache.obtainTokensForNamenodes(TokenCache.java:85)
    at org.apache.hadoop.tools.DistCp.setup(DistCp.java:1046)
    at org.apache.hadoop.tools.DistCp.copy(DistCp.java:666)
    at org.apache.hadoop.tools.DistCp.run(DistCp.java:881)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
    at org.apache.hadoop.tools.DistCp.main(DistCp.java:908)
Caused by: java.net.UnknownHostException: hdfsbackup
    ... 14 more

I have the AWS ID/Secret configured in the core-site.xml of all nodes.

<!-- Amazon S3 -->
<property>
    <name>fs.s3.awsAccessKeyId</name>
    <value>MY-ID</value>
</property>

<property>
    <name>fs.s3.awsSecretAccessKey</name>
    <value>MY-SECRET</value>
</property>


<!-- Amazon S3N -->
<property>
    <name>fs.s3n.awsAccessKeyId</name>
    <value>MY-ID</value>
</property>

<property>
    <name>fs.s3n.awsSecretAccessKey</name>
    <value>MY-SECRET</value>
</property>

I'm able to copy files from hdfs using the cp command without any problem. The below command successfully copied the hdfs folder to S3

hadoop fs -cp hdfs://nameserv1/tmp/data/sampledata  s3n://hdfsbackup/

I know there is Amazon S3 optimized distcp (s3distcp) available, but I don't want to use it as it doesn't support update/overwrite options.

like image 241
Mohamed Avatar asked Nov 02 '22 21:11

Mohamed


1 Answers

It looks like you are using Kerberos security, and unfortunately Map/Reduce jobs cannot access Amazon S3 currently if Kerberos is enabled. You can see more details in MAPREDUCE-4548.

They actually have a patch that should fix it but is not currently part of any Hadoop distribution, so if you have an opportunity to modify and build Hadoop from source here is what you should do:


Index: core/org/apache/hadoop/security/SecurityUtil.java
===================================================================
--- core/org/apache/hadoop/security/SecurityUtil.java   (révision 1305278)
+++ core/org/apache/hadoop/security/SecurityUtil.java   (copie de travail)
@@ -313,6 +313,9 @@
     if (authority == null || authority.isEmpty()) {
       return null;
     }
+    if (uri.getScheme().equals("s3n") || uri.getScheme().equals("s3")) {
+      return null;
+    }
     InetSocketAddress addr = NetUtils.createSocketAddr(authority, defPort);
     return buildTokenService(addr).toString();
    }

The ticket was last updated a couple days ago, so hopefully this will be officially patched soon.

An easier solution would be to just disable Kerberos, but that might not be possible in your environment.

I've seen that you might be able to do this if your bucket is named like a domain name, but I haven't tried it and even if this works this sounds like a hack.

like image 69
Charles Menguy Avatar answered Nov 08 '22 08:11

Charles Menguy