Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do I store gzipped files using PigStorage in Apache Pig?

Tags:

apache-pig

Apache Pig v0.7 can read gzipped files with no extra effort on my part, e.g.:

MyData = LOAD '/tmp/data.csv.gz' USING PigStorage(',') AS (timestamp, user, url);

I can process that data and output it to disk okay:

PerUser = GROUP MyData BY user;
UserCount = FOREACH PerUser GENERATE group AS user, COUNT(MyData) AS count;
STORE UserCount INTO '/tmp/usercount' USING PigStorage(',');

But the output file isn't compressed:

/tmp/usercount/part-r-00000

Is there a way of telling the STORE command to output content in gzip format? Note that ideally I'd like an answer applicable for Pig 0.6 as I wish to use Amazon Elastic MapReduce; but if there's a solution for any version of Pig I'd like to hear it.

like image 861
PP. Avatar asked Feb 11 '11 12:02

PP.


People also ask

How do I load a file from local to pig?

Now load the data from the file student_data. txt into Pig by executing the following Pig Latin statement in the Grunt shell. grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );

Which pig operator is used to save data into a file?

You can store the loaded data in the file system using the store operator. This chapter explains how to store data in Apache Pig using the Store operator.

What is store command in pig?

Apache Pig - PigStorage()The PigStorage() function loads and stores data as structured text files. It takes a delimiter using which each entity of a tuple is separated as a parameter. By default, it takes '\t' as a parameter.

What is the default storage class in pig called?

PigStorage: PigStorage() is the default load/store function in pig. PigStorage expects data to be formatted using field delimiters and the default delimiter is '\t'.


2 Answers

There are two ways:

  1. As mentioned above in the storage you can say the output directory as

    usercount.gz STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');

  2. Set compression method in your script.

    set output.compression.enabled true; set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;

like image 152
ysr Avatar answered Mar 15 '23 20:03

ysr


For Pig r0.8.0 the answer is as simple as giving your output path an extension of ".gz" (or ".bz" should you prefer bzip).

The last line of your code should be amended to read:

STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');

Per your example, your output file would then be found as

/tmp/usercount.gz/part-r-00000.gz

For more information, see: https://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#PigStorage

like image 41
medriscoll Avatar answered Mar 15 '23 21:03

medriscoll