I have a PIG Script which produces four results I want to store all of them in a single file. I tries using <code>UNION</code>, however when I use <code>UNION</code> I get four files part-m-00000, part-m-00001, part-m-00002, part-m-00003. Cant I get a single file? Here is the PIG script <pre class="prettyprint"><code>A = UNION Message_1,Message_2,Message_3,Message_4 into 'AA'; </code></pre> Inside the AA folder I get 4 files as mentioned above. Can't I get a single file with all entries in it?

Pig is doing the right thing here and is unioning the data sets. All being one file doesn't mean one data set in Hadoop... one data set in Hadoop is usually a folder. Since it doesn't need to run a reduce here, it's not going to. You need to fool Pig to run a Map AND Reduce. The way I usually do this is: <pre class="prettyprint"><code>set default_parallel 1 ... A = UNION Message_1,Message_2,Message_3,Message_4; B = GROUP A BY 1; -- group ALL of the records together C = FOREACH B GENERATE FLATTEN(A); ... </code></pre> The <code>GROUP BY</code> groups all of the records together, and then the <code>FLATTEN</code> explodes that list back out. <hr> One thing to note here is that this isn't much different from doing: <pre class="prettyprint"><code>$ hadoop fs -cat msg1.txt msg2.txt msg3.txt msg4.txt | hadoop fs -put - union.txt </code></pre> (this is concatenating all of the text, and then writing it back out to HDFS as a new file) This isn't parallel at all, but neither is funneling all of your data through one reducer.

Storing results of UNION in PIG in a single file

Tags:

hadoop

apache-pig

hdfs

I have a PIG Script which produces four results I want to store all of them in a single file. I tries using UNION, however when I use UNION I get four files part-m-00000, part-m-00001, part-m-00002, part-m-00003. Cant I get a single file?

Here is the PIG script

A = UNION Message_1,Message_2,Message_3,Message_4 into 'AA';

Inside the AA folder I get 4 files as mentioned above. Can't I get a single file with all entries in it?

566

asked Jun 08 '12 19:06

Uno

1 Answers

Pig is doing the right thing here and is unioning the data sets. All being one file doesn't mean one data set in Hadoop... one data set in Hadoop is usually a folder. Since it doesn't need to run a reduce here, it's not going to.

You need to fool Pig to run a Map AND Reduce. The way I usually do this is:

set default_parallel 1

...
A = UNION Message_1,Message_2,Message_3,Message_4;
B = GROUP A BY 1; -- group ALL of the records together
C = FOREACH B GENERATE FLATTEN(A);
...

The GROUP BY groups all of the records together, and then the FLATTEN explodes that list back out.

One thing to note here is that this isn't much different from doing:

$ hadoop fs -cat msg1.txt msg2.txt msg3.txt msg4.txt | hadoop fs -put - union.txt

(this is concatenating all of the text, and then writing it back out to HDFS as a new file)

This isn't parallel at all, but neither is funneling all of your data through one reducer.

answered Sep 23 '22 14:09

Donald Miner

Related questions
                            
                                Forward fill missing values in Spark/Python
                            
                                Hive Data to Pandas Data frame
                            
                                Stream data into hdfs directly without copying
                            
                                org.apache.maven.plugin.MojoExecutionException: protoc failure
                            
                                Deleting files from HDFS does not free up disk space
                            
                                How does Apache Spark handles system failure when deployed in YARN?
                            
                                Why YARN java heap space memory error?
                            
                                Hive Internal Error: java.lang.ClassNotFoundException(org.apache.atlas.hive.hook.HiveHook)
                            
                                Running yarn with spark not working with Java 8
                            
                                Hive join set number of reducers
                            
                                Hadoop: job runs okay on smaller set of data but fails with large dataset
                            
                                More than 120 counters in hadoop
                            
                                Compute differences between succesive records in Hadoop with Hive Queries
                            
                                Convert string to timestamp in Hive
                            
                                Could not find or load main class when trying to format namenode; hadoop installation on MAC OS X 10.9.2
                            
                                How to install RHadoop packages (Rmr, Rhdfs, Rhbase)?
                            
                                How to access hdfs by URI consisting of H/A namenodes in Spark which is outer hadoop cluster?
                            
                                How to extract selected values from json string in Hive
                            
                                hadoop aws versions compatibility
                            
                                Max/Min for whole sets of records in PIG

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With