Problems with Hadoop distcp from HDFS to Amazon S3

Tags:

I am trying to move data from HDFS to S3 using distcp. The distcp job seems to succeed, but on S3 the files are not being created correctly. There are two issues:

The file names and paths are not replicated. All files end up as block_<some number> at the root of the bucket.
It creates bunch of extra files on S3 with some meta data and logs.

I could not find any documentation/examples for this. What am I missing? How can I debug?

Here are some more details:

$ hadoop version 
Hadoop 0.20.2-cdh3u0
Subversion  -r 
Compiled by diego on Sun May  1 15:42:11 PDT 2011
From source with checksum 
hadoop fs –ls hdfs://hadoopmaster/data/paramesh/
…<bunch of files>…

hadoop distcp  hdfs://hadoopmaster/data/paramesh/ s3://<id>:<key>@paramesh-test/
$ ./s3cmd-1.1.0-beta3/s3cmd ls s3://paramesh-test

                       DIR   s3://paramesh-test//
                       DIR   s3://paramesh-test/test/
2012-05-10 02:20         0   s3://paramesh-test/block_-1067032400066050484
2012-05-10 02:20      8953   s3://paramesh-test/block_-183772151151054731
2012-05-10 02:20     11209   s3://paramesh-test/block_-2049242382445148749
2012-05-10 01:40      1916   s3://paramesh-test/block_-5404926129840434651
2012-05-10 01:40      8953   s3://paramesh-test/block_-6515202635859543492
2012-05-10 02:20     48051   s3://paramesh-test/block_1132982570595970987
2012-05-10 01:40     48052   s3://paramesh-test/block_3632190765594848890
2012-05-10 02:20      1160   s3://paramesh-test/block_363439138801598558
2012-05-10 01:40      1160   s3://paramesh-test/block_3786390805575657892
2012-05-10 01:40     11876   s3://paramesh-test/block_4393980661686993969

828

asked May 10 '12 06:05

Paramesh

2 Answers

You should use s3n instead of s3.

s3n is the native file system implementation (ie - regular files), using s3 imposes hdfs block structure on the files so you can't really read them without going through hdfs libraries.

Thus:

hadoop distcp hdfs://file/1 s3n://bucket/destination

157

answered Oct 22 '22 18:10

Matthew Rathbone

In the event that your files in HDFS are larger than 5GB, you will encounter errors in your distcp job that look like:

Caused by: org.jets3t.service.S3ServiceException: S3 Error Message. -- ResponseCode: 400, ResponseStatus: Bad Request, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>EntityTooLarge</Code><Message>Your proposed upload exceeds the maximum allowed size</Message><ProposedSize>23472570134</ProposedSize><MaxSizeAllowed>5368709120</MaxSizeAllowed><RequestId>5BDA6B12B9E99CE9</RequestId><HostId>vmWvS3Ynp35bpIi7IjB7mv1waJSBu5gfrqF9U2JzUYsXg0L7/sy42liEO4m0+lh8V6CqU7PU2uo=</HostId></Error> at org.jets3t.service.S3Service.putObject(S3Service.java:2267) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:122) ... 27 more Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143

To fix this, use either the s3n filesystem as @matthew-rathbone suggested, but with -Dfs.s3n.multipart.uploads.enabled=true like:

hadoop distcp -Dfs.s3n.multipart.uploads.enabled=true hdfs://file/1 s3n://bucket/destination

Use the "next generation" s3 filesystem, s3a like:

hadoop distcp -Dfs.s3a.endpoint=apigateway.us-east-1.amazonaws.com hdfs://file/1 s3a://bucket/destination

Options and documentation for these live here: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

answered Oct 22 '22 19:10

Sean

Related questions
                            
                                number of reducers for 1 task in MapReduce
                            
                                How to run Hbase Java example?
                            
                                HDFS Reduced Replication Factor
                            
                                Which files are ignored as input by mapper?
                            
                                Difference between fs.defaultFS and fs.default.name
                            
                                How to optimize shuffling/sorting phase in a hadoop job
                            
                                Broken Pipe Error causes streaming Elastic MapReduce job on AWS to fail
                            
                                Hadoop streaming - remove trailing tab from reducer output
                            
                                Invalid URI for NameNode address
                            
                                Confusion about distributed cache in Hadoop
                            
                                hdfs Datanode denied communication with namenode because hostname cannot be resolved
                            
                                Oozie Job Error - java.io.IOException: configuration is not specified
                            
                                Get Columns in a specific Column Family for a row HBase
                            
                                Read a text file from HDFS line by line in mapper
                            
                                Connect Hive through Java JDBC
                            
                                Hive table locks
                            
                                Difference between job, application, task, task attempt logs in Hadoop, Oozie
                            
                                Namenode high availability client request
                            
                                How to pick random (small) data samples using Map/Reduce?
                            
                                Can I write a plain text HDFS (or local) file from a Spark program, not from an RDD?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Problems with Hadoop distcp from HDFS to Amazon S3

Tags:

amazon-web-services

amazon-s3

hadoop

Paramesh

People also ask

2 Answers

Matthew Rathbone

Sean

Recent Activity

Donate For Us