Merging two datasets in Pig

Question

I have a pig script in which I am loading a dataset, diving it into two separate datasets and then performing some calculations and finally adding another computed field to it. Now I want to join back these two datasets.

A = LOAD '/user/hdfs/file1' AS (a:int, b:int);

A1 = FILTER A BY a > 100;
A2 = FILTER A BY a <= 100 AND b > 100;

-- Now I do some calculation on A1 and A2

So essentially, after the calculation, here is schema for both:

{A1 : {a:int, b:int, type:chararray}}
{A2:  {a:int, b:int, type:chararray}}

Now, before I dump this back to HDFS, I want to merge the two data sets back. Something like UNION ALL in SQL. How can I do that?

Chris White · Accepted Answer

UNION should work for you - but your original schema does not match the output shown (b is loaded as a chararray and later on becomes a int) - i'm assuming this is a typo.

If the tuples have fields in differing orders, you can use the ONSCHEMA keyword when performing the UNION:

A_MERGED = UNION ONSCHEMA A1, A2;

EDIT Link to the PigLatin docs for UNION

Merging two datasets in Pig

Tags:

hadoop

apache-pig

piglet

divinedragon

1 Answers

Chris White

Recent Activity

Donate For Us

Merging two datasets in Pig

Tags:

hadoop

apache-pig

piglet

divinedragon

1 Answers

Chris White

Related questions

Recent Activity

Donate For Us