Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to remove duplicate columns after a JOIN in Pig?

Let's say I JOIN two relations like:

-- part looks like:
-- 1,5.3
-- 2,4.9
-- 3,4.9

-- original looks like:
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,akhila,3.3,IT,C,1.3,0.3

jnd = JOIN part BY $0, original BY $0;

The output will be:

1,5.3,1,Anju,3.6,IT,A,1.6,0.3
2,4.9,2,Remya,3.3,EEE,B,1.6,0.3
3,4.9,3,akhila,3.3,IT,C,1.3,0.3

Notice that $0 is shown twice in each tuple. EG:

1,5.3,1,Anju,3.6,IT,A,1.6,0.3
^     ^
|-----|

I can remove the duplicate key manually by doing:

jnd = foreach jnd generate $0,$1,$3,$4 ..;

Is there a way to remove this dynamically? Like remove(the duplicate key joiner).

like image 224
Unmesha Sreeveni U.B Avatar asked Apr 20 '14 05:04

Unmesha Sreeveni U.B


People also ask

How remove duplicate columns after join spark?

Method 1: Using drop() function dataframe is the first dataframe. dataframe1 is the second dataframe. inner specifies inner join. drop() will delete the common column and delete first dataframe column.

How do I remove duplicates in pig?

The Apache Pig DISTINCT operator is used to remove duplicate tuples in a relation. Initially, Pig sorts the given data and then eliminates duplicates.

How do I drop a column after join?

Removing duplicate columns after join in PySpark If we want to drop the duplicate column, then we have to specify the duplicate column in the join function. Here we are simply using join to join two dataframes and then drop duplicate columns.


1 Answers

Have faced the same kind of issue while working on Data Set Joining and other data processing techniques where in output the column names get repeated.

So was working on UDF which will remove the duplicates column by using schema name of that field and retaining the first unique column occurrence data.

Pre-Requisite:

Name of all the fields should be present

You need to download this UDF file and make it jar so as to use it.

UDF file location from GitHub : GitHub UDF Java File Location

We will take the above question as example.

--Data Set A contains this data
-- 1,5.3
-- 2,4.9
-- 3,4.9

--Data Set B contains this data
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,Akhila,3.3,IT,C,1.3,0.3

PIG Script:

REGISTER /home/user/
DSA = LOAD '/home/user/DSALOC' AS (ROLLNO:int,CGPA:float);
DSB = LOAD '/home/user/DSBLOC' AS (ROLLNO:int,NAME:chararray,SUB1:float,BRANCH:chararray,GRADE:chararray,SUB2:float);
JOINOP = JOIN DSA BY ROLLNO,DSB BY ROLLNO;

We will get column name after joining as DSA::ROLLNO:int,DSA::CGPA:float,DSB::ROLLNO:int,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float

For making it to DSA::ROLLNO:int,DSA::CGPA:float,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float

DSB::ROLLNO:int is removed.

We need to use the UDF as

JOINOP_NODUPLICATES = FOREACH JOINOP GENERATE FLATTEN(org.imagine.REMOVEDUPLICATECOLUMNS(*));

Where org.imagine.REMOVEDUPLICATECOLUMNS is the UDF.

This UDF removes duplicate columns by using Name in schema.So DSA::ROLLNO:int is retained and DSB::ROLLNO:int is removed from the dataset.

like image 198
Ajay Gupta Avatar answered Oct 18 '22 15:10

Ajay Gupta