Let's say I have a data set of restaurant reviews: <pre class="prettyprint"><code>User,City,Restaurant,Rating Jim,New York,Mecurials,3 Jim,New York,Whapme,4.5 Jim,London,Pint Size,2 Lisa,London,Pint Size,4 Lisa,London,Rabbit Whole,3.5 </code></pre> And I want to produce a list by user and city of average review. I.e. output: <pre class="prettyprint"><code>User,City,AverageRating Jim,New York,3.75 Jim,London,2 Lisa,London,3.75 </code></pre> I could write a Pig script as follows: <pre class="prettyprint"><code>Data = LOAD 'data.txt' USING PigStorage(',') AS ( user:chararray, city:chararray, restaurant:charray, rating:float ); PerUserCity = GROUP Data BY (user, city); ResultSet = FOREACH PerUserCity { GENERATE group.user, group.city, AVG(Data.rating); } </code></pre> However I'm curious whether I can first group the higher level group (the users) and then sub group the next level (the cities) later: i.e. <pre class="prettyprint"><code>PerUser = GROUP Data BY user; Intermediate = FOREACH PerUser { B = GROUP Data BY city; GENERATE group AS user, B; } </code></pre> I get: <pre class="prettyprint"><code>Error during parsing. Invalid alias: GROUP in { group: chararray, Data: { user: chararray, city: chararray, restaurant: chararray, rating: float } } </code></pre> Has anyone tried this with success? Is it simply not possible to GROUP within a FOREACH? My goal is to do something like: <pre class="prettyprint"><code>ResultSet = FOREACH PerUser { FOREACH City { GENERATE user, city, AVG(City.rating) } } </code></pre>

Currently the allowed operations are <code>DISTINCT</code>, <code>FILTER</code>, <code>LIMIT</code>, and <code>ORDER BY</code> inside a FOREACH. For now grouping directly by (user, city) is the good way to do as you said.

Can I generate nested bags using nested FOREACH statements in Pig Latin?

Tags:

apache-pig

Let's say I have a data set of restaurant reviews:

User,City,Restaurant,Rating
Jim,New York,Mecurials,3
Jim,New York,Whapme,4.5
Jim,London,Pint Size,2
Lisa,London,Pint Size,4
Lisa,London,Rabbit Whole,3.5

And I want to produce a list by user and city of average review. I.e. output:

User,City,AverageRating
Jim,New York,3.75
Jim,London,2
Lisa,London,3.75

I could write a Pig script as follows:

Data = LOAD 'data.txt' USING PigStorage(',') AS (
    user:chararray, city:chararray, restaurant:charray, rating:float
);

PerUserCity = GROUP Data BY (user, city);

ResultSet = FOREACH PerUserCity {
    GENERATE group.user, group.city, AVG(Data.rating);
}

However I'm curious whether I can first group the higher level group (the users) and then sub group the next level (the cities) later: i.e.

PerUser = GROUP Data BY user;

Intermediate = FOREACH PerUser {
    B = GROUP Data BY city;
    GENERATE group AS user, B;
}

I get:

Error during parsing.
Invalid alias: GROUP in {
  group: chararray,
  Data: {
    user: chararray,
    city: chararray,
    restaurant: chararray,
    rating: float
  }
}

Has anyone tried this with success? Is it simply not possible to GROUP within a FOREACH?

My goal is to do something like:

ResultSet = FOREACH PerUser {
    FOREACH City {
        GENERATE user, city, AVG(City.rating)
    }
}

409

asked Feb 08 '11 11:02

PP.

1 Answers

Currently the allowed operations are DISTINCT, FILTER, LIMIT, and ORDER BY inside a FOREACH.

For now grouping directly by (user, city) is the good way to do as you said.

answered Sep 21 '22 10:09

Romain

Related questions
                            
                                Hadoop Pig - Removing csv header
                            
                                Reference manual for Apache Pig Latin [closed]
                            
                                Calculate count of distinct values of a field using pig script
                            
                                Usage of Apache Pig rank function
                            
                                Apache Pig permissions issue
                            
                                Junit External Resource @Rule Order
                            
                                How can I debug a pig script
                            
                                Filter a string on the basis of a word
                            
                                How can I add a header row to files created from Pig (Hadoop)?
                            
                                Hadoop and Stata
                            
                                How does Pig use Hadoop Globs in a 'load' statement?
                            
                                How to : Python UDF dictionary return schema in PIG
                            
                                Check if an element is present in a bag?
                            
                                ERROR 1066: Unable to open iterator for alias - Pig
                            
                                Projecting Grouped Tuples in Pig
                            
                                Exception in type casting Chararry to double in PIG
                            
                                Pig - ERROR 1045: AVG as multiple or none of them fit. Please use an explicit cast
                            
                                Pig keeps trying to connect to job history server (and fails)

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With