In Apache Pig, select DISTINCT rows based on a single column

Tags:

Let's say I have a table such as the one below, that may or may not contain duplicates for a given field:

ID     URL
---    ------------------
001    http://example.com/adam
002    http://example.com/beth
002    http://example.com/beth?extra=blah
003    http://example.com/charlie

I would like to write a Pig script to find only DISTINCT rows, based on the value of a single field. For instance, filtering the table above by ID should return something like the following:

ID     URL
---    ------------------
001    http://example.com/adam
002    http://example.com/beth
003    http://example.com/charlie

The Pig GROUP BY operator returns a bag of tuples grouped by ID, which would work if I knew how to get just the first tuple per bag (perhaps a separate question).

The Pig DISTINCT operator works on the entire row, so in this case all four rows would be considered unique, which is not what I want.

For my purposes, I do not care which of the rows with ID 002 are returned.

947

asked May 27 '14 23:05

Arel

2 Answers

I found one way to do this, using the GROUP BY and the TOP operators:

my_table = LOAD 'my_table_file' AS (A, B);

my_table_grouped = GROUP my_table BY A;

my_table_distinct = FOREACH my_table_grouped {

    -- For each group $0 refers to the group name, (A)
    -- and $1 refers to a bag of entire rows {(A, B), (A, B), ...}.
    -- Here, we take only the first (top 1) row in the bag:

    result = TOP(1, 0, $1);
    GENERATE FLATTEN(result);

}

DUMP my_table_distinct;

This results in one distinct row per ID column:

(001,http://example.com/adam)
(002,http://example.com/beth?extra=blah)
(003,http://example.com/charlie)

I don't know if there is a better approach, but this works for me. I hope this helps others starting out with Pig.

(Reference: http://pig.apache.org/docs/r0.12.1/func.html#topx)

answered Oct 08 '22 14:10

Arel

I have found that you can do this with a nested grouping and using LIMIT So using Arel's example:

my_table = LOAD 'my_table_file' AS (A, B);

-- Nested foreach grouping generates bags with same A,
-- limit bags to 1

my_table_distinct = FOREACH (GROUP my_table BY A) {
  result = LIMIT my_table 1;
  GENERATE FLATTEN(result);
}

DUMP my_table_distinct;

answered Oct 08 '22 13:10

Michael Papile

Related questions
                            
                                Concat String by Group
                            
                                SQL Server Weird Grouping Scenario by multiple columns and OR
                            
                                GROUP_CONCAT() row count when grouping by a text field
                            
                                C# get minute number from date in linq group by
                            
                                Add a column with a groupby on a hierarchical dataframe
                            
                                How to select max of count in PostgreSQL
                            
                                SQL to produce Top 10 and Other
                            
                                how to structure an index for group by in Sql Server
                            
                                How do I group by on calculated columns?
                            
                                How to use pandas to group pivot table results by week?
                            
                                Does MySQL eliminate common subexpressions between SELECT and HAVING/GROUP BY clause
                            
                                bigquery group by all columns except a few
                            
                                Is there a way to simulate GROUP BY WITH CUBE in MySql?
                            
                                Is there an established pattern for SQL queries which group by a range?
                            
                                Select row with largest timestamp in each category
                            
                                SQL GROUP BY - Using COUNT() function
                            
                                Sql Nested group by
                            
                                max(), group by and order by
                            
                                How to optimize this in MySQL?
                            
                                MapReduce and SQL GROUP BY

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

In Apache Pig, select DISTINCT rows based on a single column

Tags:

group-by

distinct

apache-pig

Arel

People also ask

2 Answers

Arel

Michael Papile

Recent Activity

Donate For Us