I am beginner, learning Pig latin. Need to extract the records from the file. Have created two files T1 and T2, Some tuples are common to both the files, So need to extract the tuples present only in T1 and need to omit the common tuples between T1 & T2. Can someone please help me...
Thanks
Firstly you'll want to take a look at this Venn Diagram
. What you want is everything but the middle bit. So first you need to do a full outer JOIN
on the data. Then, since nulls
are created in an outer JOIN when the key is not common, you will want to filter the result of the JOIN to only contain lines that have one null (the non-intersecting part of the Venn Diagram).
This is how it would look in a pig script:
-- T1 and T2 are the two sets of tuples you are using, their schemas are:
-- T1: {t: (num1: int, num2: int)}
-- T2: {t: (num1: int, num2: int)}
-- Yours will be different, but the principle is the same
B = JOIN T1 BY t FULL, T2 BY t ;
C = FILTER B BY T1::t is null OR T2::t is null ;
D = FOREACH C GENERATE (T1::t is not null? T1::t : A2::t) ;
Walking through the steps using this sample input:
T1: T2:
(1,2) (4,5)
(3,4) (1,2)
B
does the full outer JOIN resulting in:
B: {T1::t: (num1: int,num2: int),T2::t: (num1: int,num2: int)}
((1,2),(1,2))
(,(4,5))
((3.4),)
T1
is the left tuple, and T2
is the right tuple. We have to use ::
to identify which t
, since they have the same name.
Now, C
filters B
so that only lines with a null are kept. Resulting in:
C: {T1::t: (num1: int,num2: int),T2::t: (num1: int,num2: int)}
(,(4,5))
((3.4),)
This is the output you want, but it is a little messy to use. D
uses a bincond
(the ?:) to remove the null. So the final output will be:
D: {T1::t: (num1: int,num2: int)}
((4,5))
((3.4))
Update:
If you want to keep only the left (T1) (or right (T2) if you switch things around) side of the join. You can do this:
-- B is the same
-- We only want to keep tuples where the T2 tuple is null
C = FILTER B BY T2::t is null ;
-- Generate T1::t to get rid of the null T2::t
D = FOREACH C GENERATE T1::t ;
However, looking back at the original Venn Diagram, using a full JOIN
is unnecessary. If you look at this different Venn Diagram
, you can see that this covers the set you want without any extra operations. Therefore, you should change B
to:
B = JOIN T1 BY t LEFT, T2 BY t ;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With