Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extracting non-matching records between files in Pig Latin

I am beginner, learning Pig latin. Need to extract the records from the file. Have created two files T1 and T2, Some tuples are common to both the files, So need to extract the tuples present only in T1 and need to omit the common tuples between T1 & T2. Can someone please help me...

Thanks

like image 872
user2639239 Avatar asked Feb 16 '23 17:02

user2639239


1 Answers

Firstly you'll want to take a look at this Venn Diagram. What you want is everything but the middle bit. So first you need to do a full outer JOIN on the data. Then, since nulls are created in an outer JOIN when the key is not common, you will want to filter the result of the JOIN to only contain lines that have one null (the non-intersecting part of the Venn Diagram).

This is how it would look in a pig script:

-- T1 and T2 are the two sets of tuples you are using, their schemas are:
-- T1: {t: (num1: int, num2: int)}
-- T2: {t: (num1: int, num2: int)}
-- Yours will be different, but the principle is the same

B = JOIN T1 BY t FULL, T2 BY t ;
C = FILTER B BY T1::t is null OR T2::t is null ;
D = FOREACH C GENERATE (T1::t is not null? T1::t : A2::t) ;

Walking through the steps using this sample input:

T1:      T2:
(1,2)    (4,5)
(3,4)    (1,2)

B does the full outer JOIN resulting in:

B: {T1::t: (num1: int,num2: int),T2::t: (num1: int,num2: int)}
((1,2),(1,2))
(,(4,5))
((3.4),)

T1 is the left tuple, and T2 is the right tuple. We have to use :: to identify which t, since they have the same name.

Now, C filters B so that only lines with a null are kept. Resulting in:

C: {T1::t: (num1: int,num2: int),T2::t: (num1: int,num2: int)}
(,(4,5))
((3.4),)

This is the output you want, but it is a little messy to use. D uses a bincond (the ?:) to remove the null. So the final output will be:

D: {T1::t: (num1: int,num2: int)}
((4,5))
((3.4))

Update:
If you want to keep only the left (T1) (or right (T2) if you switch things around) side of the join. You can do this:

-- B is the same

-- We only want to keep tuples where the T2 tuple is null
C = FILTER B BY T2::t is null ;
-- Generate T1::t to get rid of the null T2::t
D = FOREACH C GENERATE T1::t ;

However, looking back at the original Venn Diagram, using a full JOIN is unnecessary. If you look at this different Venn Diagram, you can see that this covers the set you want without any extra operations. Therefore, you should change B to:

B = JOIN T1 BY t LEFT, T2 BY t ;
like image 124
mr2ert Avatar answered Mar 23 '23 01:03

mr2ert