Let's say I have a table such as the one below, that may or may not contain duplicates for a given field:
ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
002 http://example.com/beth?extra=blah
003 http://example.com/charlie
I would like to write a Pig script to find only DISTINCT rows, based on the value of a single field. For instance, filtering the table above by ID
should return something like the following:
ID URL
--- ------------------
001 http://example.com/adam
002 http://example.com/beth
003 http://example.com/charlie
The Pig GROUP BY
operator returns a bag of tuples grouped by ID, which would work if I knew how to get just the first tuple per bag (perhaps a separate question).
The Pig DISTINCT
operator works on the entire row, so in this case all four rows would be considered unique, which is not what I want.
For my purposes, I do not care which of the rows with ID 002
are returned.
a. Group Operator To group the data in one or more relations, we use the GROUP operator. Let’s suppose that we have a file named Employee_details.txt in the HDFS directory /pig_data/. Also, with the relation name Employee_details, we have loaded this file into Apache Pig. Further, let’s group the records/tuples in the relation by age.
Filtering: Apache Pig Operators To select the required tuples from a relation based on a condition, we use the FILTER operator. Also, with the relation name Employee_details we have loaded this file into Pig.
Apache Pig - Order By. The ORDER BY operator is used to display the contents of a relation in a sorted order based on one or more fields.
We have a huge set of Apache Pig Operators, for performing several types of Operations. Let’s discuss types of Apache Pig Operators: So, let’s discuss each type of Apache Pig Operators in detail. i. Diagnostic Operators: Apache Pig Operators Basically, we use Diagnostic Operators to verify the execution of the Load statement.
I found one way to do this, using the GROUP BY
and the TOP
operators:
my_table = LOAD 'my_table_file' AS (A, B);
my_table_grouped = GROUP my_table BY A;
my_table_distinct = FOREACH my_table_grouped {
-- For each group $0 refers to the group name, (A)
-- and $1 refers to a bag of entire rows {(A, B), (A, B), ...}.
-- Here, we take only the first (top 1) row in the bag:
result = TOP(1, 0, $1);
GENERATE FLATTEN(result);
}
DUMP my_table_distinct;
This results in one distinct row per ID column:
(001,http://example.com/adam)
(002,http://example.com/beth?extra=blah)
(003,http://example.com/charlie)
I don't know if there is a better approach, but this works for me. I hope this helps others starting out with Pig.
(Reference: http://pig.apache.org/docs/r0.12.1/func.html#topx)
I have found that you can do this with a nested grouping and using LIMIT
So using Arel's example:
my_table = LOAD 'my_table_file' AS (A, B);
-- Nested foreach grouping generates bags with same A,
-- limit bags to 1
my_table_distinct = FOREACH (GROUP my_table BY A) {
result = LIMIT my_table 1;
GENERATE FLATTEN(result);
}
DUMP my_table_distinct;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With