Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to load a tarball to pig

i have a log files that is in a tarball (access.logs.tar.gz) loaded into my hadoop cluster. I was wondering is their way to directly load it to pig with out untaring it?

like image 303
nashr rafeeg Avatar asked Apr 17 '12 04:04

nashr rafeeg


People also ask

What can you do with a tarball?

Tarballs are often used to back up personal or system files in place to create an archive, especially prior to making changes that might have to be reversed.

Is tarball compressed?

tar itself does not support compression directly. It is most commonly used in tandem with an external compression utility such as gzip or bzip2. These compression utilities generally only compress a single file, hence the pairing with tar, which can produce a single file from many files.


1 Answers

@ChrisWhite's answer is technically correct and you should accept his answer instead of mine (IMO at least).

You need to get away from tar.gz files with Hadoop. Gzip files are not splittable, so you get in the situation where if your gzip files are large, you're going to see hotspotting in your mappers. For example, if you have a .tar.gz file that is 100gb, you aren't going to be able to split the computation.

Let's say on the other hand that they are tiny. In which case, Pig will do a nice job of collecting them together and the splitting problem goes away. This has the downside of the fact that now you are dealing with tons of tiny files with the NameNode. Also, since the files are tiny, it should be relatively cheap computationally to reform the files into a more reasonable format.

So what format should you reformulate the files into? Good question!

  • Just concatenating them all into one large block-level compressed sequence file might be the most challenging but the most rewarding in terms of performance.
  • The other is to just ignore compression entirely and just explode those files out, or at least concatenate them (you do see performance hits without compression).
  • Finally, you could blob files into ~100MB chunks and then gzip them.

I think it would be completely reasonable to write some sort of tarball loader into piggybank, but I personally would just rather lay the data out differently.

like image 50
Donald Miner Avatar answered Nov 15 '22 09:11

Donald Miner