Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can I generate nested bags using nested FOREACH statements in Pig Latin?

Tags:

apache-pig

Let's say I have a data set of restaurant reviews:

User,City,Restaurant,Rating
Jim,New York,Mecurials,3
Jim,New York,Whapme,4.5
Jim,London,Pint Size,2
Lisa,London,Pint Size,4
Lisa,London,Rabbit Whole,3.5

And I want to produce a list by user and city of average review. I.e. output:

User,City,AverageRating
Jim,New York,3.75
Jim,London,2
Lisa,London,3.75

I could write a Pig script as follows:

Data = LOAD 'data.txt' USING PigStorage(',') AS (
    user:chararray, city:chararray, restaurant:charray, rating:float
);

PerUserCity = GROUP Data BY (user, city);

ResultSet = FOREACH PerUserCity {
    GENERATE group.user, group.city, AVG(Data.rating);
}

However I'm curious whether I can first group the higher level group (the users) and then sub group the next level (the cities) later: i.e.

PerUser = GROUP Data BY user;

Intermediate = FOREACH PerUser {
    B = GROUP Data BY city;
    GENERATE group AS user, B;
}

I get:

Error during parsing.
Invalid alias: GROUP in {
  group: chararray,
  Data: {
    user: chararray,
    city: chararray,
    restaurant: chararray,
    rating: float
  }
}

Has anyone tried this with success? Is it simply not possible to GROUP within a FOREACH?

My goal is to do something like:

ResultSet = FOREACH PerUser {
    FOREACH City {
        GENERATE user, city, AVG(City.rating)
    }
}
like image 409
PP. Avatar asked Feb 08 '11 11:02

PP.


People also ask

What is foreach used for in Pig Latin scripts?

The FOREACH operator is used to generate specified data transformations based on the column data.

Can we use flatten to convert bag into tuples?

Flatten un-nests bags and tuples. For tuples, the Flatten operator will substitute the fields of a tuple in place of a tuple whereas un-nesting bags is a little complex because it requires creating new tuples.

Which operator is used for creates aggregations for all combinations of specified columns in relation under Pig Latin?

The GROUP operator is used to group the data in one or more relations.

What do you mean by a bag in Pig?

A bag is a collection of tuples. A tuple is an ordered set of fields. A field is a piece of data.


1 Answers

Currently the allowed operations are DISTINCT, FILTER, LIMIT, and ORDER BY inside a FOREACH.

For now grouping directly by (user, city) is the good way to do as you said.

like image 65
Romain Avatar answered Sep 21 '22 10:09

Romain