i have a log files that is in a tarball (access.logs.tar.gz) loaded into my hadoop cluster. I was wondering is their way to directly load it to pig with out untaring it?
Tarballs are often used to back up personal or system files in place to create an archive, especially prior to making changes that might have to be reversed.
tar itself does not support compression directly. It is most commonly used in tandem with an external compression utility such as gzip or bzip2. These compression utilities generally only compress a single file, hence the pairing with tar, which can produce a single file from many files.
@ChrisWhite's answer is technically correct and you should accept his answer instead of mine (IMO at least).
You need to get away from tar.gz
files with Hadoop. Gzip files are not splittable, so you get in the situation where if your gzip files are large, you're going to see hotspotting in your mappers. For example, if you have a .tar.gz
file that is 100gb, you aren't going to be able to split the computation.
Let's say on the other hand that they are tiny. In which case, Pig will do a nice job of collecting them together and the splitting problem goes away. This has the downside of the fact that now you are dealing with tons of tiny files with the NameNode. Also, since the files are tiny, it should be relatively cheap computationally to reform the files into a more reasonable format.
So what format should you reformulate the files into? Good question!
I think it would be completely reasonable to write some sort of tarball loader into piggybank, but I personally would just rather lay the data out differently.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With