I have a Pig script (using a slightly modified MultiStorage) that transforms some data. Once the script runs, I have data in the following format on HDFS:
/tmp/data/identifier1/indentifier1-0,0001
/tmp/data/identifier1/indentifier1-0,0002
/tmp/data/identifier2/indentifier2-0,0001
/tmp/data/identifier3/indentifier3-0,0001
I'm attempting to use S3DistCp to copy these files to S3. I am using the --groupBy .*(identifier[0-9]).*
option to combine files based on the identifier. The combination works, but when copying to S3, the folders are also copied. The end output is:
/s3bucket/identifier1/identifier1
/s3bucket/identifier2/identifier2
/s3bucket/identifier3/identifier3
Is there a way to copy these files without that first folder? Ideally, my output in S3 would look like:
/s3bucket/identifier1
/s3bucket/identifier2
/s3bucket/identifier3
Another solution I've considered is to use HDFS commands to pull those files out of their directories before copying to S3. Is that a reasonable solution?
Thanks!
The solution I've arrived upon is to use distcp
to bring these files out of the directories before using s3distcp
:
hadoop distcp -update /tmp/data/** /tmp/grouped
Then, I changed the s3distcp
script to move data from /tmp/grouped
into my S3 bucket.
Using distcp
before s3distcp
is really expensive. One other option you have is to create a manifest file with all your files in it and give its path to s3distcp
. In this manifest you can define the "base name" of each file. If you need an example of a manifest file just run s3distcp
on any folder with argument --outputManifest
.
more information can be found here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With