Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Merging two datasets in Pig

I have a pig script in which I am loading a dataset, diving it into two separate datasets and then performing some calculations and finally adding another computed field to it. Now I want to join back these two datasets.

A = LOAD '/user/hdfs/file1' AS (a:int, b:int);

A1 = FILTER A BY a > 100;
A2 = FILTER A BY a <= 100 AND b > 100;

-- Now I do some calculation on A1 and A2

So essentially, after the calculation, here is schema for both:

{A1 : {a:int, b:int, type:chararray}}
{A2:  {a:int, b:int, type:chararray}}

Now, before I dump this back to HDFS, I want to merge the two data sets back. Something like UNION ALL in SQL. How can I do that?

like image 351
divinedragon Avatar asked Jan 11 '13 12:01

divinedragon


1 Answers

UNION should work for you - but your original schema does not match the output shown (b is loaded as a chararray and later on becomes a int) - i'm assuming this is a typo.

If the tuples have fields in differing orders, you can use the ONSCHEMA keyword when performing the UNION:

A_MERGED = UNION ONSCHEMA A1, A2;

EDIT Link to the PigLatin docs for UNION

like image 193
Chris White Avatar answered Sep 20 '22 21:09

Chris White