Remove directory level when transferring from HDFS to S3 using S3DistCp

Question

I have a Pig script (using a slightly modified MultiStorage) that transforms some data. Once the script runs, I have data in the following format on HDFS:

/tmp/data/identifier1/indentifier1-0,0001  
/tmp/data/identifier1/indentifier1-0,0002  
/tmp/data/identifier2/indentifier2-0,0001  
/tmp/data/identifier3/indentifier3-0,0001

I'm attempting to use S3DistCp to copy these files to S3. I am using the --groupBy .*(identifier[0-9]).* option to combine files based on the identifier. The combination works, but when copying to S3, the folders are also copied. The end output is:

/s3bucket/identifier1/identifier1
/s3bucket/identifier2/identifier2
/s3bucket/identifier3/identifier3

Is there a way to copy these files without that first folder? Ideally, my output in S3 would look like:

/s3bucket/identifier1
/s3bucket/identifier2
/s3bucket/identifier3

Another solution I've considered is to use HDFS commands to pull those files out of their directories before copying to S3. Is that a reasonable solution?

Thanks!

NolanDC · Accepted Answer

The solution I've arrived upon is to use distcp to bring these files out of the directories before using s3distcp:

hadoop distcp -update /tmp/data/** /tmp/grouped

Then, I changed the s3distcp script to move data from /tmp/grouped into my S3 bucket.

Eitan Illuz · Answer

Using distcp before s3distcp is really expensive. One other option you have is to create a manifest file with all your files in it and give its path to s3distcp. In this manifest you can define the "base name" of each file. If you need an example of a manifest file just run s3distcp on any folder with argument --outputManifest. more information can be found here

Remove directory level when transferring from HDFS to S3 using S3DistCp

Tags:

amazon-s3

hadoop

apache-pig

hdfs

emr

NolanDC

2 Answers

NolanDC

Eitan Illuz

Recent Activity

Donate For Us

Remove directory level when transferring from HDFS to S3 using S3DistCp

Tags:

amazon-s3

hadoop

apache-pig

hdfs

emr

NolanDC

2 Answers

NolanDC

Eitan Illuz

Related questions

Recent Activity

Donate For Us