Apache Pig v0.7 can read gzipped files with no extra effort on my part, e.g.:
MyData = LOAD '/tmp/data.csv.gz' USING PigStorage(',') AS (timestamp, user, url);
I can process that data and output it to disk okay:
PerUser = GROUP MyData BY user;
UserCount = FOREACH PerUser GENERATE group AS user, COUNT(MyData) AS count;
STORE UserCount INTO '/tmp/usercount' USING PigStorage(',');
But the output file isn't compressed:
/tmp/usercount/part-r-00000
Is there a way of telling the STORE
command to output content in gzip format? Note that ideally I'd like an answer applicable for Pig 0.6 as I wish to use Amazon Elastic MapReduce; but if there's a solution for any version of Pig I'd like to hear it.
Now load the data from the file student_data. txt into Pig by executing the following Pig Latin statement in the Grunt shell. grunt> student = LOAD 'hdfs://localhost:9000/pig_data/student_data.txt' USING PigStorage(',') as ( id:int, firstname:chararray, lastname:chararray, phone:chararray, city:chararray );
You can store the loaded data in the file system using the store operator. This chapter explains how to store data in Apache Pig using the Store operator.
Apache Pig - PigStorage()The PigStorage() function loads and stores data as structured text files. It takes a delimiter using which each entity of a tuple is separated as a parameter. By default, it takes '\t' as a parameter.
PigStorage: PigStorage() is the default load/store function in pig. PigStorage expects data to be formatted using field delimiters and the default delimiter is '\t'.
There are two ways:
As mentioned above in the storage you can say the output directory as
usercount.gz
STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');
Set compression method in your script.
set output.compression.enabled true;
set output.compression.codec org.apache.hadoop.io.compress.GzipCodec;
For Pig r0.8.0 the answer is as simple as giving your output path an extension of ".gz" (or ".bz" should you prefer bzip).
The last line of your code should be amended to read:
STORE UserCount INTO '/tmp/usercount.gz' USING PigStorage(',');
Per your example, your output file would then be found as
/tmp/usercount.gz/part-r-00000.gz
For more information, see: https://pig.apache.org/docs/r0.8.1/piglatin_ref2.html#PigStorage
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With