the manual/documentation uses the language of 'inner bag' and 'outer bag' extensively (say: http://pig.apache.org/docs/r0.11.1/basic.html ), and yet I haven't been able to pin out clearly the precise definition separating the terms. e.g. all inherently interrelated: <ul> <li>If I give you a bag 'foo,' what would you need to know to label foo as an 'inner bag' vs. an 'outer bag'?</li> <li>Is 'any bag' who is not the most outer-bag then ' an inner bag' ?</li> <li>Are the labels of inner and outer always exclusive? </li> <li>In PigLatin, are all 'bags' 'relations' -- or is only 'the most outer bag' a relation? (and inner bags are not relations)</li> </ul> to create a discussable example: <pre class="prettyprint"><code>grunt> dump A; (1,2,3) (4,2,1) (8,3,4) (4,3,3) grunt> W1 = GROUP A ALL; grunt> W2 = GROUP W1 ALL; grunt> W3 = GROUP W2 ALL; grunt> W4 = GROUP W3 ALL; grunt> describe W4; W4: {group: chararray,W3: {(group: chararray,W2: {(group: chararray,W1: {(group: chararray,A: {(f1: int,f2: int,f3: int)})})})}} grunt> illustrate W4; (1,2,3) --------------------------------------------------- | A | f1:int | f2:int | f3:int | --------------------------------------------------- | | 1 | 2 | 3 | | | 8 | 3 | 4 | --------------------------------------------------- ------------------------------------------------------------------------------------------------ | W1 | group:chararray | A:bag{:tuple(f1:int,f2:int,f3:int)} | ------------------------------------------------------------------------------------------------ | | all | {(1, 2, 3), (8, 3, 4)} | ------------------------------------------------------------------------------------------------ ----------------------------------------------------------------------------------------------------------------------------------------------- | W2 | group:chararray | W1:bag{:tuple(group:chararray,A:bag{:tuple(f1:int,f2:int,f3:int)})} | ----------------------------------------------------------------------------------------------------------------------------------------------- | | all | {(all, {(1, 2, 3), (8, 3, 4)})} | ----------------------------------------------------------------------------------------------------------------------------------------------- ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | W3 | group:chararray | W2:bag{:tuple(group:chararray,W1:bag{:tuple(group:chararray,A:bag{:tuple(f1:int,f2:int,f3:int)})})} | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | all | {(all, {(all, {(1, 2, 3), (8, 3, 4)})})} | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | W4 | group:chararray | W3:bag{:tuple(group:chararray,W2:bag{:tuple(group:chararray,W1:bag{:tuple(group:chararray,A:bag{:tuple(f1:int,f2:int,f3:int)})})})} | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | all | {(all, {(all, {(all, {(1, 2, 3), (8, 3, 4)})})})} | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- grunt> dump W4; (all,{(all,{(all,{(all,{(1,2,3),(4,2,1),(8,3,4),(4,3,3)})})})}) </code></pre> amongst the bags - W1, W2, W3, W4 -- which is inner, which is outer?

The outer bag is actually relation <code>A</code>. This is a little weird, but it'll become clear once you know what an inner bag is. Let's just look at <code>W1</code>, for readability, since having the nested bags does not change the answer. Schema and output for <code>W1</code>: <pre class="prettyprint"><code>W1: {group:chararray, A:bag{:tuple(f1:int,f2:int,f3:int)}} (all,{(1, 2, 3), (8, 3, 4)}) </code></pre> We can see their is a field in <code>W1</code> named <code>A</code> which is a bag. This is an inner bag because the bag is a field in the relation. Remember that bags are just unordered sets of tuples, and we can see this is the output for <code>W1</code>. Now, look at the output of relation <code>A</code>: <pre class="prettyprint"><code>(1,2,3) (4,2,1) (8,3,4) (4,3,3) </code></pre> Pig does not guarantee the order of these tuples (unless you <code>ORDER</code> or something). So, if you think about it, relation <code>A</code> is really just an unordered set of tuples. This is an outer bag. You can find some examples of this here.

what is the distinction between an 'outer bag' and an 'inner bag' in pigLatin?

Tags:

apache-pig

the manual/documentation uses the language of 'inner bag' and 'outer bag' extensively (say: http://pig.apache.org/docs/r0.11.1/basic.html ), and yet I haven't been able to pin out clearly the precise definition separating the terms.

e.g. all inherently interrelated:

If I give you a bag 'foo,' what would you need to know to label foo as an 'inner bag' vs. an 'outer bag'?
Is 'any bag' who is not the most outer-bag then ' an inner bag' ?
Are the labels of inner and outer always exclusive?
In PigLatin, are all 'bags' 'relations' -- or is only 'the most outer bag' a relation? (and inner bags are not relations)

to create a discussable example:

grunt> dump A;      
(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)


grunt> W1 = GROUP A   ALL;         
grunt> W2 = GROUP W1  ALL;
grunt> W3 = GROUP W2  ALL;
grunt> W4 = GROUP W3  ALL;

grunt> describe W4;
W4: {group: chararray,W3: {(group: chararray,W2: {(group: chararray,W1: {(group: chararray,A: {(f1: int,f2: int,f3: int)})})})}}


grunt> illustrate W4;
(1,2,3)
---------------------------------------------------
| A     | f1:int      | f2:int      | f3:int      | 
---------------------------------------------------
|       | 1           | 2           | 3           | 
|       | 8           | 3           | 4           | 
---------------------------------------------------
------------------------------------------------------------------------------------------------
| W1     | group:chararray      | A:bag{:tuple(f1:int,f2:int,f3:int)}                          | 
------------------------------------------------------------------------------------------------
|        | all                  | {(1, 2, 3), (8, 3, 4)}                                       | 
------------------------------------------------------------------------------------------------
-----------------------------------------------------------------------------------------------------------------------------------------------
| W2     | group:chararray      | W1:bag{:tuple(group:chararray,A:bag{:tuple(f1:int,f2:int,f3:int)})}                                         | 
-----------------------------------------------------------------------------------------------------------------------------------------------
|        | all                  | {(all, {(1, 2, 3), (8, 3, 4)})}                                                                             | 
-----------------------------------------------------------------------------------------------------------------------------------------------
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| W3     | group:chararray      | W2:bag{:tuple(group:chararray,W1:bag{:tuple(group:chararray,A:bag{:tuple(f1:int,f2:int,f3:int)})})}                                                        | 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|        | all                  | {(all, {(all, {(1, 2, 3), (8, 3, 4)})})}                                                                                                                   | 
----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
| W4     | group:chararray      | W3:bag{:tuple(group:chararray,W2:bag{:tuple(group:chararray,W1:bag{:tuple(group:chararray,A:bag{:tuple(f1:int,f2:int,f3:int)})})})}                                                                       | 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
|        | all                  | {(all, {(all, {(all, {(1, 2, 3), (8, 3, 4)})})})}                                                                                                                                                         | 
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

grunt> dump W4;
(all,{(all,{(all,{(all,{(1,2,3),(4,2,1),(8,3,4),(4,3,3)})})})})

amongst the bags - W1, W2, W3, W4 -- which is inner, which is outer?

594

asked Oct 08 '13 01:10

Matt S.

1 Answers

The outer bag is actually relation A. This is a little weird, but it'll become clear once you know what an inner bag is. Let's just look at W1, for readability, since having the nested bags does not change the answer.

Schema and output for W1:

W1: {group:chararray, A:bag{:tuple(f1:int,f2:int,f3:int)}}
(all,{(1, 2, 3), (8, 3, 4)})

We can see their is a field in W1 named A which is a bag. This is an inner bag because the bag is a field in the relation.

Remember that bags are just unordered sets of tuples, and we can see this is the output for W1. Now, look at the output of relation A:

(1,2,3)
(4,2,1)
(8,3,4)
(4,3,3)

Pig does not guarantee the order of these tuples (unless you ORDER or something). So, if you think about it, relation A is really just an unordered set of tuples. This is an outer bag.

You can find some examples of this here.

178

answered Oct 05 '22 06:10

mr2ert

Related questions
                            
                                Pig UDF for iso to yyyy-mm-dd hh:mm:ss.000
                            
                                What is the difference between GROUP and COGROUP in PIG?
                            
                                Is there a canonical problem that provably can't be aided with map/reduce?
                            
                                How Can I Load Every File In a Folder Using PIG?
                            
                                Hadoop PIG Max of Tuple
                            
                                How do I add a column, preserving the existing columns, without listing them all?
                            
                                Hadoop Pig count number
                            
                                Flatten tuple like a bag
                            
                                How to compute sum of a field in all the rows from an alias
                            
                                pig - split, lack of default or if/else
                            
                                finding mean using pig or hadoop
                            
                                StrSplit in Pig functions
                            
                                Convert "3" to 3 with PigLatin
                            
                                Pig: apply a FOREACH operator to each element within a bag
                            
                                Pig local mode, group, or join = java.lang.OutOfMemoryError: Java heap space
                            
                                How to change Tez job name when running query in HIVE
                            
                                Apache Sqoop/Pig Consistent Data Representation/Processing
                            
                                Exit pig shell command safely

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With