how to load a tarball to pig

1 Answers

@ChrisWhite's answer is technically correct and you should accept his answer instead of mine (IMO at least).

You need to get away from tar.gz files with Hadoop. Gzip files are not splittable, so you get in the situation where if your gzip files are large, you're going to see hotspotting in your mappers. For example, if you have a .tar.gz file that is 100gb, you aren't going to be able to split the computation.

Let's say on the other hand that they are tiny. In which case, Pig will do a nice job of collecting them together and the splitting problem goes away. This has the downside of the fact that now you are dealing with tons of tiny files with the NameNode. Also, since the files are tiny, it should be relatively cheap computationally to reform the files into a more reasonable format.

So what format should you reformulate the files into? Good question!

Just concatenating them all into one large block-level compressed sequence file might be the most challenging but the most rewarding in terms of performance.
The other is to just ignore compression entirely and just explode those files out, or at least concatenate them (you do see performance hits without compression).
Finally, you could blob files into ~100MB chunks and then gzip them.

I think it would be completely reasonable to write some sort of tarball loader into piggybank, but I personally would just rather lay the data out differently.

answered Nov 15 '22 09:11

Donald Miner

Related questions
                            
                                How can get memory and CPU usage of hadoop yarn application?
                            
                                How to inspect the format of a file on HDFS?
                            
                                storing pig output into Hive table in a single instance
                            
                                What does container/resource allocation mean in Hadoop and in Spark when running on Yarn?
                            
                                Relationship between HDFS, HBase, Pig, Hive and Azkaban?
                            
                                Cannot compile parquet-tools
                            
                                Connecting to remote HBase service using Java
                            
                                How to access hadoop web UI in linux?
                            
                                How do Spark scheduler pools work when running on YARN?
                            
                                Can sqoop run without hadoop?
                            
                                Hive permission denied for user anonymous using beeline shell
                            
                                Impala command to know DB table size
                            
                                "start-all.sh" and "start-dfs.sh" from master node do not start the slave node services?
                            
                                ERROR : User did not initialize spark context
                            
                                How to copy a file from a GCS bucket in Dataproc to HDFS using google cloud?
                            
                                Using the Apache Mahout machine learning libraries [closed]
                            
                                How to use Hadoop Streaming with LZO-compressed Sequence Files?
                            
                                How to get Filename/File Contents as key/value input for MAP when running a Hadoop MapReduce Job?
                            
                                Declaring a variable and schema in PIG
                            
                                How do I format and add files to hadoop after it crashed?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

how to load a tarball to pig

Tags:

hadoop

apache-pig

nashr rafeeg

People also ask

1 Answers

Donald Miner

Recent Activity

Donate For Us