Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Problems with Hadoop distcp from HDFS to Amazon S3

I am trying to move data from HDFS to S3 using distcp. The distcp job seems to succeed, but on S3 the files are not being created correctly. There are two issues:

  1. The file names and paths are not replicated. All files end up as block_<some number> at the root of the bucket.
  2. It creates bunch of extra files on S3 with some meta data and logs.

I could not find any documentation/examples for this. What am I missing? How can I debug?

Here are some more details:

$ hadoop version 
Hadoop 0.20.2-cdh3u0
Subversion  -r 
Compiled by diego on Sun May  1 15:42:11 PDT 2011
From source with checksum 
hadoop fs –ls hdfs://hadoopmaster/data/paramesh/
…<bunch of files>…

hadoop distcp  hdfs://hadoopmaster/data/paramesh/ s3://<id>:<key>@paramesh-test/
$ ./s3cmd-1.1.0-beta3/s3cmd ls s3://paramesh-test

                       DIR   s3://paramesh-test//
                       DIR   s3://paramesh-test/test/
2012-05-10 02:20         0   s3://paramesh-test/block_-1067032400066050484
2012-05-10 02:20      8953   s3://paramesh-test/block_-183772151151054731
2012-05-10 02:20     11209   s3://paramesh-test/block_-2049242382445148749
2012-05-10 01:40      1916   s3://paramesh-test/block_-5404926129840434651
2012-05-10 01:40      8953   s3://paramesh-test/block_-6515202635859543492
2012-05-10 02:20     48051   s3://paramesh-test/block_1132982570595970987
2012-05-10 01:40     48052   s3://paramesh-test/block_3632190765594848890
2012-05-10 02:20      1160   s3://paramesh-test/block_363439138801598558
2012-05-10 01:40      1160   s3://paramesh-test/block_3786390805575657892
2012-05-10 01:40     11876   s3://paramesh-test/block_4393980661686993969
like image 828
Paramesh Avatar asked May 10 '12 06:05

Paramesh


People also ask

Can I use Amazon S3 for Hadoop storage instead of HDFS?

You can't configure Amazon EMR to use Amazon S3 instead of HDFS for the Hadoop storage layer. HDFS and the EMR File System (EMRFS), which uses Amazon S3, are both compatible with Amazon EMR, but they're not interchangeable.

Does S3 support HDFS?

While Apache Hadoop has traditionally worked with HDFS, S3 also meets Hadoop's file system requirements. Companies such as Netflix have used this compatibility to build Hadoop data warehouses that store information in S3, rather than HDFS.

Is S3 better than HDFS?

To summarize, S3 and cloud storage provide elasticity, with an order of magnitude better availability and durability and 2X better performance, at 10X lower cost than traditional HDFS data storage clusters. Hadoop and HDFS commoditized big data storage by making it cheap to store and distribute a large amount of data.


2 Answers

You should use s3n instead of s3.

s3n is the native file system implementation (ie - regular files), using s3 imposes hdfs block structure on the files so you can't really read them without going through hdfs libraries.

Thus:

hadoop distcp hdfs://file/1 s3n://bucket/destination
like image 157
Matthew Rathbone Avatar answered Oct 22 '22 18:10

Matthew Rathbone


In the event that your files in HDFS are larger than 5GB, you will encounter errors in your distcp job that look like:

Caused by: org.jets3t.service.S3ServiceException: S3 Error Message. -- ResponseCode: 400, ResponseStatus: Bad Request, XML Error Message: <?xml version="1.0" encoding="UTF-8"?><Error><Code>EntityTooLarge</Code><Message>Your proposed upload exceeds the maximum allowed size</Message><ProposedSize>23472570134</ProposedSize><MaxSizeAllowed>5368709120</MaxSizeAllowed><RequestId>5BDA6B12B9E99CE9</RequestId><HostId>vmWvS3Ynp35bpIi7IjB7mv1waJSBu5gfrqF9U2JzUYsXg0L7/sy42liEO4m0+lh8V6CqU7PU2uo=</HostId></Error> at org.jets3t.service.S3Service.putObject(S3Service.java:2267) at org.apache.hadoop.fs.s3native.Jets3tNativeFileSystemStore.storeFile(Jets3tNativeFileSystemStore.java:122) ... 27 more Container killed by the ApplicationMaster. Container killed on request. Exit code is 143 Container exited with a non-zero exit code 143

To fix this, use either the s3n filesystem as @matthew-rathbone suggested, but with -Dfs.s3n.multipart.uploads.enabled=true like:

hadoop distcp -Dfs.s3n.multipart.uploads.enabled=true hdfs://file/1 s3n://bucket/destination

OR

Use the "next generation" s3 filesystem, s3a like:

hadoop distcp -Dfs.s3a.endpoint=apigateway.us-east-1.amazonaws.com hdfs://file/1 s3a://bucket/destination

Options and documentation for these live here: https://hadoop.apache.org/docs/stable/hadoop-aws/tools/hadoop-aws/index.html

like image 33
Sean Avatar answered Oct 22 '22 19:10

Sean