Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

pig - how to reference columns in a FOREACH after a JOIN?

Tags:

apache-pig

A = load 'a.txt' as (id, a1);
B = load 'b.txt as (id, b1);
C = join A by id, B by id;
D = foreach C generate id,a1,b1;
dump D;

4th line fails on: Invalid field projection. Projected field [id] does not exist in schema

I tried to change to A.id but then the last line fails on: ERROR 0: Scalar has more than one row in the output.

like image 904
ihadanny Avatar asked Nov 08 '11 13:11

ihadanny


People also ask

What is the use of foreach generate operator in pig scripts?

The FOREACH operator is used to generate specified data transformations based on the column data.

Which of the following types of joins Does Pig support?

Self-join is used to join a table with itself as if the table were two relations, temporarily renaming at least one relation. Generally, in Apache Pig, to perform self-join, we will load the same data multiple times, under different aliases (names). Therefore let us load the contents of the file customers.

What is bag in Pig?

Grouping Within a Bag Pig has a GROUP operation that can be applied to a relation. It produces a new relation where the input tuples are grouped by a particular key. A bag in the relation contains the grouped tuples for that key. The key is represented by a group parameter. BagGroup mimics the GROUP operation from Pig.

Which operator provides reduce side parallelism and can be attached with any relational operators?

The parallel clause can be attached to any relational operator in Pig Latin. However, it controls only reduce-side parallelism, so it makes sense only for operators that force a reduce phase.


1 Answers

What you are looking for is the "Disambiguate Operator". What you want is A::id, not A.id.

A.id says "there is a relation/bag A and there is a column called id in its schema"

A::id says "there is a record from A and that has a column called id"

So, you would do:

A = load 'a.txt' as (id, a1);
B = load 'b.txt as (id, b1);
C = join A by id, B by id;
D = foreach C generate A::id,a1,b1;
dump D;

A dirty alternative:

Just because I'm lazy, and disambiguation gets really weird when you start doing multiple joins one after another: use unique identifiers.

A = load 'a.txt' as (ida, a1);
B = load 'b.txt as (idb, b1);
C = join A by ida, B by idb;
D = foreach C generate ida,a1,b1;
dump D;
like image 140
Donald Miner Avatar answered Oct 16 '22 08:10

Donald Miner