I have a pig script in which I am loading a dataset, diving it into two separate datasets and then performing some calculations and finally adding another computed field to it. Now I want to join back these two datasets.
A = LOAD '/user/hdfs/file1' AS (a:int, b:int);
A1 = FILTER A BY a > 100;
A2 = FILTER A BY a <= 100 AND b > 100;
-- Now I do some calculation on A1 and A2
So essentially, after the calculation, here is schema for both:
{A1 : {a:int, b:int, type:chararray}}
{A2: {a:int, b:int, type:chararray}}
Now, before I dump this back to HDFS, I want to merge the two data sets back. Something like UNION ALL
in SQL. How can I do that?
UNION should work for you - but your original schema does not match the output shown (b is loaded as a chararray and later on becomes a int) - i'm assuming this is a typo.
If the tuples have fields in differing orders, you can use the ONSCHEMA keyword when performing the UNION:
A_MERGED = UNION ONSCHEMA A1, A2;
EDIT Link to the PigLatin docs for UNION
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With