From this:
(1, {(1,2), (1,3), (1,4)} )
(2, {(2,5), (2,6), (2,7)} )
...How could we generate this?
((1,2),(1,3),(1,4))
((2,5),(2,6),(2,7))
...And how could we generate this?
(1, 2, 3, 4)
(2, 5, 6, 7)
For a single row I know how to do. The problem is when I have to iterate over many rows AND manipulate internal groups at the same time.
The FLATTEN operator looks like a UDF syntactically, but it is actually an operator that changes the structure of tuples and bags in a way that a UDF cannot. Flatten un-nests tuples as well as bags. The idea is the same, but the operation and result is different for each type of structure.
flatten can also be applied to a tuple. In this case, it does not produce a cross product; instead, it elevates each field in the tuple to a top-level field. Again, empty tuples will remove the entire record. If the fields in a bag or tuple that is being flattened have names, Pig will carry those names along.
A tuple is a fixed-length, ordered collection of Pig data elements. Tuples are divided into fields, with each field containing one data element. These elements can be of any type—they do not all need to be the same type. A tuple is analogous to a row in SQL, with the fields being SQL columns.
Group all is used to group a relation by all the columns as shown below. grunt> group_all = GROUP student_details All; Now, verify the content of the relation group_all as shown below.
For your question, I prepared the following file:
1,2
1,3
1,4
2,5
2,6
2,7
At first, I used the following script to get the input r3
which you described in your question:
r1 = load 'test_file' using PigStorage(',') as (a:int, b:int);
r2 = group r1 by a;
r3 = foreach r2 generate group as a, r1 as b;
describe r3;
-- r3: {a: int,b: {(a: int,b: int)}}
-- r3 is like (1, {(1,2), (1,3), (1,4)} )
If we want to generate the following content,
(1, 2, 3, 4)
(2, 5, 6, 7)
we can use the following script:
r4 = foreach r3 generate a, FLATTEN(BagToTuple(b.b));
dump r4;
For the following content,
((1,2),(1,3),(1,4))
((2,5),(2,6),(2,7))
I can not find any helpful builtin function. Maybe you need to write your custom BagToTuple. Here is the builtin BagToTuple source codes: http://www.grepcode.com/file/repo1.maven.org/maven2/org.apache.pig/pig/0.11.1/org/apache/pig/builtin/BagToTuple.java#BagToTuple.getOuputTupleSize%28org.apache.pig.data.DataBag%29
In order to obtain :
((1,2),(1,3),(1,4))
((2,5),(2,6),(2,7))
You can do this :
r4 = foreach r3 {
Tmp=foreach $1 generate (a,b);
generate FLATTEN(BagToTuple(Tmp));
};
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With