Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

what is the distinction between an 'outer bag' and an 'inner bag' in pigLatin?

Tags:

apache-pig

the manual/documentation uses the language of 'inner bag' and 'outer bag' extensively (say: http://pig.apache.org/docs/r0.11.1/basic.html ), and yet I haven't been able to pin out clearly the precise definition separating the terms.

e.g. all inherently interrelated:

  • If I give you a bag 'foo,' what would you need to know to label foo as an 'inner bag' vs. an 'outer bag'?
  • Is 'any bag' who is not the most outer-bag then ' an inner bag' ?
  • Are the labels of inner and outer always exclusive?
  • In PigLatin, are all 'bags' 'relations' -- or is only 'the most outer bag' a relation? (and inner bags are not relations)

to create a discussable example:

grunt> dump A;      
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)


grunt> W1 = GROUP A   ALL;         
grunt> W2 = GROUP W1  ALL;
grunt> W3 = GROUP W2  ALL;
grunt> W4 = GROUP W3  ALL;

grunt> describe W4;
W4: {group: chararray,W3: {(group: chararray,W2: {(group: chararray,W1: {(group: chararray,A: {(f1: int,f2: int,f3: int)})})})}}


grunt> illustrate W4;
(1,2,3)
---------------------------------------------------
| A     | f1:int      | f2:int      | f3:int      | 
---------------------------------------------------
|       | 1           | 2           | 3           | 
|       | 8           | 3           | 4           | 
---------------------------------------------------
------------------------------------------------------------------------------------------------
| W1     | group:chararray      | A:bag{:tuple(f1:int,f2:int,f3:int)}                          | 
------------------------------------------------------------------------------------------------
|        | all                  | {(1, 2, 3), (8, 3, 4)}                                       | 
------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------
| W2     | group:chararray      | W1:bag{:tuple(group:chararray,A:bag{:tuple(f1:int,f2:int,f3:int)})}                                         | 
-----------------------------------------------------------------------------------------------------------------------------------------------
|        | all                  | {(all, {(1, 2, 3), (8, 3, 4)})}                                                                             | 
-----------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| W3     | group:chararray      | W2:bag{:tuple(group:chararray,W1:bag{:tuple(group:chararray,A:bag{:tuple(f1:int,f2:int,f3:int)})})}                                                        | 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|        | all                  | {(all, {(all, {(1, 2, 3), (8, 3, 4)})})}                                                                                                                   | 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| W4     | group:chararray      | W3:bag{:tuple(group:chararray,W2:bag{:tuple(group:chararray,W1:bag{:tuple(group:chararray,A:bag{:tuple(f1:int,f2:int,f3:int)})})})}                                                                       | 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|        | all                  | {(all, {(all, {(all, {(1, 2, 3), (8, 3, 4)})})})}                                                                                                                                                         | 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

grunt> dump W4;
(all,{(all,{(all,{(all,{(1,2,3),(4,2,1),(8,3,4),(4,3,3)})})})})

amongst the bags - W1, W2, W3, W4 -- which is inner, which is outer?

like image 594
Matt S. Avatar asked Oct 08 '13 01:10

Matt S.


People also ask

What is an outer bag?

Outer Bag: An outer bag is nothing but a relation. In the below example, A is a relation or bag of tuples. You can think of this bag as an outer bag. A = LOAD 'data' as (f1:int, f2:int, f3:int); DUMP A; (1,2,3) (4,2,1) (8,3,4) (4,3,3)

What is a bag in pig Latin?

Pig Latin – Data Model A bag is a collection of tuples. A tuple is an ordered set of fields. A field is a piece of data.

How do pigs create relationships?

DEFINE GenerateRelationFromString(string) RETURNS relation { temp = LOAD 'somefile'; tempLimit1 = LIMIT temp 1; $relation = FOREACH tempLimit1 GENERATE FLATTEN(TOKENIZE('$string', ',')); };

What is pig Latin in Hadoop?

The Pig Latin is a data flow language used by Apache Pig to analyze the data in Hadoop. It is a textual language that abstracts the programming from the Java MapReduce idiom into a notation.


1 Answers

The outer bag is actually relation A. This is a little weird, but it'll become clear once you know what an inner bag is. Let's just look at W1, for readability, since having the nested bags does not change the answer.

Schema and output for W1:

W1: {group:chararray, A:bag{:tuple(f1:int,f2:int,f3:int)}}
(all,{(1, 2, 3), (8, 3, 4)})

We can see their is a field in W1 named A which is a bag. This is an inner bag because the bag is a field in the relation.

Remember that bags are just unordered sets of tuples, and we can see this is the output for W1. Now, look at the output of relation A:

(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)

Pig does not guarantee the order of these tuples (unless you ORDER or something). So, if you think about it, relation A is really just an unordered set of tuples. This is an outer bag.

You can find some examples of this here.

like image 178
mr2ert Avatar answered Oct 05 '22 06:10

mr2ert