Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Storing results of UNION in PIG in a single file

I have a PIG Script which produces four results I want to store all of them in a single file. I tries using UNION, however when I use UNION I get four files part-m-00000, part-m-00001, part-m-00002, part-m-00003. Cant I get a single file?

Here is the PIG script

A = UNION Message_1,Message_2,Message_3,Message_4 into 'AA';

Inside the AA folder I get 4 files as mentioned above. Can't I get a single file with all entries in it?

like image 566
Uno Avatar asked Jun 08 '12 19:06

Uno


People also ask

Which function is used to store the output in Pig?

Pig's store function is, in many ways, a mirror image of the load function. It is built on top of Hadoop's OutputFormat . It takes Pig Tuple s and creates key-value pairs that its associated output format writes to storage.

Which Pig operator is used to save data into a file?

You can store the loaded data in the file system using the store operator. This chapter explains how to store data in Apache Pig using the Store operator.

How do I load multiple files in Pig?

Pig supports providing file names as regular expressions. So you can provide something like: A = LOAD '/home/mrinmoy/Desktop/Sampath Project/Household/HLPCA*' Using PigStorage(','); and it will load all files with names starting from 'HLPCA' in Household directory.


1 Answers

Pig is doing the right thing here and is unioning the data sets. All being one file doesn't mean one data set in Hadoop... one data set in Hadoop is usually a folder. Since it doesn't need to run a reduce here, it's not going to.

You need to fool Pig to run a Map AND Reduce. The way I usually do this is:

set default_parallel 1

...
A = UNION Message_1,Message_2,Message_3,Message_4;
B = GROUP A BY 1; -- group ALL of the records together
C = FOREACH B GENERATE FLATTEN(A);
...

The GROUP BY groups all of the records together, and then the FLATTEN explodes that list back out.


One thing to note here is that this isn't much different from doing:

$ hadoop fs -cat msg1.txt msg2.txt msg3.txt msg4.txt | hadoop fs -put - union.txt

(this is concatenating all of the text, and then writing it back out to HDFS as a new file)

This isn't parallel at all, but neither is funneling all of your data through one reducer.

like image 83
Donald Miner Avatar answered Sep 23 '22 14:09

Donald Miner