Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Remove directory level when transferring from HDFS to S3 using S3DistCp

I have a Pig script (using a slightly modified MultiStorage) that transforms some data. Once the script runs, I have data in the following format on HDFS:

/tmp/data/identifier1/indentifier1-0,0001  
/tmp/data/identifier1/indentifier1-0,0002  
/tmp/data/identifier2/indentifier2-0,0001  
/tmp/data/identifier3/indentifier3-0,0001

I'm attempting to use S3DistCp to copy these files to S3. I am using the --groupBy .*(identifier[0-9]).* option to combine files based on the identifier. The combination works, but when copying to S3, the folders are also copied. The end output is:

/s3bucket/identifier1/identifier1
/s3bucket/identifier2/identifier2
/s3bucket/identifier3/identifier3

Is there a way to copy these files without that first folder? Ideally, my output in S3 would look like:

/s3bucket/identifier1
/s3bucket/identifier2
/s3bucket/identifier3

Another solution I've considered is to use HDFS commands to pull those files out of their directories before copying to S3. Is that a reasonable solution?

Thanks!

like image 400
NolanDC Avatar asked Sep 28 '22 10:09

NolanDC


2 Answers

The solution I've arrived upon is to use distcp to bring these files out of the directories before using s3distcp:

hadoop distcp -update /tmp/data/** /tmp/grouped

Then, I changed the s3distcp script to move data from /tmp/grouped into my S3 bucket.

like image 184
NolanDC Avatar answered Oct 06 '22 20:10

NolanDC


Using distcp before s3distcp is really expensive. One other option you have is to create a manifest file with all your files in it and give its path to s3distcp. In this manifest you can define the "base name" of each file. If you need an example of a manifest file just run s3distcp on any folder with argument --outputManifest. more information can be found here

like image 34
Eitan Illuz Avatar answered Oct 06 '22 20:10

Eitan Illuz