I want to copy test.tar.gz file from S3 to HDFS. This can be done by distcp or s3distcp. But my requirement is when I transferring files to HDFS it should be extracted on the fly and in HDFS I should have only the extracted files not tar.gz.
Any suggestions please.
When you transfer by network, it's usually best that the files remain compressed. Imagine transferring a 100GB over instead of transferring a 20GB bz2 compressed file. I would suggest you to use a Hadoop API based code or a MapReduce program to extract your compressed files once the transfer is done to HDFS. Once in HDFS, you have all the power to extract the files without copying them over to the local file system.
One solution would be to use a simple Hadoop API based code or a MapReduce code (updated) that decompresses in parallel.
Addendum: For ZIP you can follow this link. And, you can come up with something similar for tar.gz.
In case you file size is huge 100GB.zip, you can probably use a Hadoop API based program which reads a stream of the Zip archive, extracts(check this link for how it was done in the ZipFileRecordReader in addendum above) and then write it back to HDFS. I think, a single ZIP file is not splittable and extractable in parallel (If I'm not mistaken). So, if you have a single zip archive of 100GB, you'll probably not be able to anyway unleash the full potential of a MapReduce program. Hence, not point using it.
Other solution is to not decompress at all. For various built-in compressed formats, Hadoop has a command line utility that helps you view the compressed files as is if that is your intention to keep it uncompressed in HDFS.
hadoop fs -text /path/fileinHDFS.bz2"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With