the manual/documentation uses the language of 'inner bag' and 'outer bag' extensively (say: http://pig.apache.org/docs/r0.11.1/basic.html ), and yet I haven't been able to pin out clearly the precise definition separating the terms.
e.g. all inherently interrelated:
to create a discussable example:
grunt> dump A;
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
grunt> W1 = GROUP A ALL;
grunt> W2 = GROUP W1 ALL;
grunt> W3 = GROUP W2 ALL;
grunt> W4 = GROUP W3 ALL;
grunt> describe W4;
W4: {group: chararray,W3: {(group: chararray,W2: {(group: chararray,W1: {(group: chararray,A: {(f1: int,f2: int,f3: int)})})})}}
grunt> illustrate W4;
(1,2,3)
---------------------------------------------------
| A | f1:int | f2:int | f3:int |
---------------------------------------------------
| | 1 | 2 | 3 |
| | 8 | 3 | 4 |
---------------------------------------------------
------------------------------------------------------------------------------------------------
| W1 | group:chararray | A:bag{:tuple(f1:int,f2:int,f3:int)} |
------------------------------------------------------------------------------------------------
| | all | {(1, 2, 3), (8, 3, 4)} |
------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------
| W2 | group:chararray | W1:bag{:tuple(group:chararray,A:bag{:tuple(f1:int,f2:int,f3:int)})} |
-----------------------------------------------------------------------------------------------------------------------------------------------
| | all | {(all, {(1, 2, 3), (8, 3, 4)})} |
-----------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| W3 | group:chararray | W2:bag{:tuple(group:chararray,W1:bag{:tuple(group:chararray,A:bag{:tuple(f1:int,f2:int,f3:int)})})} |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| | all | {(all, {(all, {(1, 2, 3), (8, 3, 4)})})} |
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| W4 | group:chararray | W3:bag{:tuple(group:chararray,W2:bag{:tuple(group:chararray,W1:bag{:tuple(group:chararray,A:bag{:tuple(f1:int,f2:int,f3:int)})})})} |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| | all | {(all, {(all, {(all, {(1, 2, 3), (8, 3, 4)})})})} |
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
grunt> dump W4;
(all,{(all,{(all,{(all,{(1,2,3),(4,2,1),(8,3,4),(4,3,3)})})})})
amongst the bags - W1, W2, W3, W4 -- which is inner, which is outer?
Outer Bag: An outer bag is nothing but a relation. In the below example, A is a relation or bag of tuples. You can think of this bag as an outer bag. A = LOAD 'data' as (f1:int, f2:int, f3:int); DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3)
Pig Latin – Data Model A bag is a collection of tuples. A tuple is an ordered set of fields. A field is a piece of data.
DEFINE GenerateRelationFromString(string) RETURNS relation { temp = LOAD 'somefile'; tempLimit1 = LIMIT temp 1; $relation = FOREACH tempLimit1 GENERATE FLATTEN(TOKENIZE('$string', ',')); };
The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a textual language that abstracts the programming from the Java MapReduce idiom into a notation.
The outer bag is actually relation A
. This is a little weird, but it'll become clear once you know what an inner bag is. Let's just look at W1
, for readability, since having the nested bags does not change the answer.
Schema and output for W1
:
W1: {group:chararray, A:bag{:tuple(f1:int,f2:int,f3:int)}}
(all,{(1, 2, 3), (8, 3, 4)})
We can see their is a field in W1
named A
which is a bag. This is an inner bag because the bag is a field in the relation.
Remember that bags are just unordered sets of tuples, and we can see this is the output for W1
. Now, look at the output of relation A
:
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)
Pig does not guarantee the order of these tuples (unless you ORDER
or something). So, if you think about it, relation A
is really just an unordered set of tuples. This is an outer bag.
You can find some examples of this here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With