Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

In Apache Pig, select DISTINCT rows based on a single column

Let's say I have a table such as the one below, that may or may not contain duplicates for a given field:

ID     URL
---    ------------------
001    http://example.com/adam
002    http://example.com/beth
002    http://example.com/beth?extra=blah
003    http://example.com/charlie

I would like to write a Pig script to find only DISTINCT rows, based on the value of a single field. For instance, filtering the table above by ID should return something like the following:

ID     URL
---    ------------------
001    http://example.com/adam
002    http://example.com/beth
003    http://example.com/charlie

The Pig GROUP BY operator returns a bag of tuples grouped by ID, which would work if I knew how to get just the first tuple per bag (perhaps a separate question).

The Pig DISTINCT operator works on the entire row, so in this case all four rows would be considered unique, which is not what I want.

For my purposes, I do not care which of the rows with ID 002 are returned.

like image 947
Arel Avatar asked May 27 '14 23:05

Arel


People also ask

How do I Group data in Apache Pig?

a. Group Operator To group the data in one or more relations, we use the GROUP operator. Let’s suppose that we have a file named Employee_details.txt in the HDFS directory /pig_data/. Also, with the relation name Employee_details, we have loaded this file into Apache Pig. Further, let’s group the records/tuples in the relation by age.

How to select required tuples from a relation in Apache Pig?

Filtering: Apache Pig Operators To select the required tuples from a relation based on a condition, we use the FILTER operator. Also, with the relation name Employee_details we have loaded this file into Pig.

What is order by in Apache Pig?

Apache Pig - Order By. The ORDER BY operator is used to display the contents of a relation in a sorted order based on one or more fields.

What are the different types of Apache Pig operators?

We have a huge set of Apache Pig Operators, for performing several types of Operations. Let’s discuss types of Apache Pig Operators: So, let’s discuss each type of Apache Pig Operators in detail. i. Diagnostic Operators: Apache Pig Operators Basically, we use Diagnostic Operators to verify the execution of the Load statement.


2 Answers

I found one way to do this, using the GROUP BY and the TOP operators:

my_table = LOAD 'my_table_file' AS (A, B);

my_table_grouped = GROUP my_table BY A;

my_table_distinct = FOREACH my_table_grouped {

    -- For each group $0 refers to the group name, (A)
    -- and $1 refers to a bag of entire rows {(A, B), (A, B), ...}.
    -- Here, we take only the first (top 1) row in the bag:

    result = TOP(1, 0, $1);
    GENERATE FLATTEN(result);

}

DUMP my_table_distinct;

This results in one distinct row per ID column:

(001,http://example.com/adam)
(002,http://example.com/beth?extra=blah)
(003,http://example.com/charlie)

I don't know if there is a better approach, but this works for me. I hope this helps others starting out with Pig.

(Reference: http://pig.apache.org/docs/r0.12.1/func.html#topx)

like image 57
Arel Avatar answered Oct 08 '22 14:10

Arel


I have found that you can do this with a nested grouping and using LIMIT So using Arel's example:

my_table = LOAD 'my_table_file' AS (A, B);

-- Nested foreach grouping generates bags with same A,
-- limit bags to 1

my_table_distinct = FOREACH (GROUP my_table BY A) {
  result = LIMIT my_table 1;
  GENERATE FLATTEN(result);
}

DUMP my_table_distinct;
like image 30
Michael Papile Avatar answered Oct 08 '22 13:10

Michael Papile