Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to flatten a group into a single tuple in Pig?

From this:

(1, {(1,2), (1,3), (1,4)} )
(2, {(2,5), (2,6), (2,7)} )

...How could we generate this?

((1,2),(1,3),(1,4))
((2,5),(2,6),(2,7))

...And how could we generate this?

(1, 2, 3, 4)
(2, 5, 6, 7)

For a single row I know how to do. The problem is when I have to iterate over many rows AND manipulate internal groups at the same time.

like image 627
user2730009 Avatar asked Aug 31 '13 04:08

user2730009


People also ask

Can we use Flatten to convert a bag into tuples?

The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that changes the structure of tuples and bags in a way that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the same, but the operation and result is different for each type of structure.

What is the use of flatten in Pig?

flatten can also be applied to a tuple. In this case, it does not produce a cross product; instead, it elevates each field in the tuple to a top-level field. Again, empty tuples will remove the entire record. If the fields in a bag or tuple that is being flattened have names, Pig will carry those names along.

What is the tuple data type in Pig?

A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into fields, with each field containing one data element. These elements can be of any type—they do not all need to be the same type. A tuple is analogous to a row in SQL, with the fields being SQL columns.

How do you group all in pigs?

Group all is used to group a relation by all the columns as shown below. grunt> group_all = GROUP student_details All; Now, verify the content of the relation group_all as shown below.


2 Answers

For your question, I prepared the following file:

1,2
1,3
1,4
2,5
2,6
2,7

At first, I used the following script to get the input r3 which you described in your question:

r1 = load 'test_file' using PigStorage(',') as (a:int, b:int);
r2 = group r1 by a;
r3 = foreach r2 generate group as a, r1 as b;
describe r3;
-- r3: {a: int,b: {(a: int,b: int)}}
-- r3 is like (1, {(1,2), (1,3), (1,4)} )

If we want to generate the following content,

(1, 2, 3, 4)
(2, 5, 6, 7)

we can use the following script:

r4 = foreach r3 generate a, FLATTEN(BagToTuple(b.b));
dump r4;

For the following content,

((1,2),(1,3),(1,4))
((2,5),(2,6),(2,7))

I can not find any helpful builtin function. Maybe you need to write your custom BagToTuple. Here is the builtin BagToTuple source codes: http://www.grepcode.com/file/repo1.maven.org/maven2/org.apache.pig/pig/0.11.1/org/apache/pig/builtin/BagToTuple.java#BagToTuple.getOuputTupleSize%28org.apache.pig.data.DataBag%29

like image 101
zsxwing Avatar answered Sep 23 '22 07:09

zsxwing


In order to obtain :

((1,2),(1,3),(1,4))
((2,5),(2,6),(2,7))

You can do this :

r4 = foreach r3 {
    Tmp=foreach $1 generate (a,b);
    generate FLATTEN(BagToTuple(Tmp));
};
like image 21
Samoht-Sann Avatar answered Sep 25 '22 07:09

Samoht-Sann