Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Self cross-join in pig is disregarded

If one have data like those:

A = LOAD 'data' AS (a1:int,a2:int,a3:int);

DUMP A;
(1,2,3)
(4,2,1)

And then a cross-join is done on A, A:

B = CROSS A, A;

DUMP B;
(1,2,3)
(4,2,1)

Why is second A optimized out from the query?

info: pig version 0.11

== UPDATE ==

If I sort A like:

C = ORDER A BY a1;
D = CROSS A, C;

It will give a correct cross-join.

like image 998
Artem Oboturov Avatar asked Mar 06 '13 19:03

Artem Oboturov


People also ask

How to join 2 tables in Pig?

Here is how you can perform a JOIN operation on two tables using multiple keys. grunt> Relation3_name = JOIN Relation2_name BY (key1, key2), Relation3_name BY (key1, key2);

How do you create a relationship with a pig?

Now, you can use 'C' as the 'empty relation' that has one empty tuple. Show activity on this post. DEFINE GenerateRelationFromString(string) RETURNS relation { temp = LOAD 'somefile'; tempLimit1 = LIMIT temp 1; $relation = FOREACH tempLimit1 GENERATE FLATTEN(TOKENIZE('$string', ',')); };

How do you make a pig table?

There is no table concept in PIG. The closest to a table is a Relation. "A Pig relation is similar to a table in a relational database, where the tuples in the bag correspond to the rows in a table.


Video Answer


2 Answers

davek is correct -- you cannot CROSS (or JOIN) a relation with itself. If you wish to do this, you must create a copy of the data. In this case, you can use another LOAD statement. If you want to do this with a relation further down a pipeline, you'll need to duplicate it using FOREACH.

I have several macros that I use frequently and IMPORT by default in all of my Pig scripts in case I need them. One is used for just this purpose:

DEFINE DUPLICATE(in) RETURNS out
{
        $out = FOREACH $in GENERATE *;
};

This will work for you wherever in your pipeline you need a duplicate:

A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = DUPLICATE(A1);
B = CROSS A1, A2;

Note that even though A1 and A2 are identical, you cannot assume that the records are in the same order. But if you are doing a CROSS or JOIN, this probably doesn't matter.

like image 73
reo katoa Avatar answered Sep 25 '22 14:09

reo katoa


I think you have to load the data twice to achieve what you want.

i.e.

A1 = LOAD 'data' AS (a1:int,a2:int,a3:int);
A2 = LOAD 'data' AS (a1:int,a2:int,a3:int);
B = CROSS A1, A2;
like image 26
davek Avatar answered Sep 23 '22 14:09

davek