How to remove duplicate columns after a JOIN in Pig?

Tags:

Let's say I JOIN two relations like:

-- part looks like:
-- 1,5.3
-- 2,4.9
-- 3,4.9

-- original looks like:
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,akhila,3.3,IT,C,1.3,0.3

jnd = JOIN part BY $0, original BY $0;

The output will be:

1,5.3,1,Anju,3.6,IT,A,1.6,0.3
2,4.9,2,Remya,3.3,EEE,B,1.6,0.3
3,4.9,3,akhila,3.3,IT,C,1.3,0.3

Notice that $0 is shown twice in each tuple. EG:

1,5.3,1,Anju,3.6,IT,A,1.6,0.3
^     ^
|-----|

I can remove the duplicate key manually by doing:

jnd = foreach jnd generate $0,$1,$3,$4 ..;

Is there a way to remove this dynamically? Like remove(the duplicate key joiner).

224

asked Apr 20 '14 05:04

Unmesha Sreeveni U.B

1 Answers

Have faced the same kind of issue while working on Data Set Joining and other data processing techniques where in output the column names get repeated.

So was working on UDF which will remove the duplicates column by using schema name of that field and retaining the first unique column occurrence data.

Pre-Requisite:

Name of all the fields should be present

You need to download this UDF file and make it jar so as to use it.

UDF file location from GitHub : GitHub UDF Java File Location

We will take the above question as example.

--Data Set A contains this data
-- 1,5.3
-- 2,4.9
-- 3,4.9

--Data Set B contains this data
-- 1,Anju,3.6,IT,A,1.6,0.3
-- 2,Remya,3.3,EEE,B,1.6,0.3
-- 3,Akhila,3.3,IT,C,1.3,0.3

PIG Script:

REGISTER /home/user/
DSA = LOAD '/home/user/DSALOC' AS (ROLLNO:int,CGPA:float);
DSB = LOAD '/home/user/DSBLOC' AS (ROLLNO:int,NAME:chararray,SUB1:float,BRANCH:chararray,GRADE:chararray,SUB2:float);
JOINOP = JOIN DSA BY ROLLNO,DSB BY ROLLNO;

We will get column name after joining as DSA::ROLLNO:int,DSA::CGPA:float,DSB::ROLLNO:int,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float

For making it to DSA::ROLLNO:int,DSA::CGPA:float,DSB::NAME:chararray,DSB::SUB1:float,DSB::BRANCH:chararray,DSB::GRADE:chararray,DSB::SUB2:float

DSB::ROLLNO:int is removed.

We need to use the UDF as

JOINOP_NODUPLICATES = FOREACH JOINOP GENERATE FLATTEN(org.imagine.REMOVEDUPLICATECOLUMNS(*));

Where org.imagine.REMOVEDUPLICATECOLUMNS is the UDF.

This UDF removes duplicate columns by using Name in schema.So DSA::ROLLNO:int is retained and DSB::ROLLNO:int is removed from the dataset.

198

answered Oct 18 '22 15:10

Ajay Gupta

Related questions
                            
                                add permissions attribute in manifest third party jars using maven
                            
                                Gson deserialize complex object with recursive dependencies
                            
                                Retrieve inserted identity value from AWS Redshift via JDBC
                            
                                How to Keep Listener Thread Alive
                            
                                How to specify display name for web app configured without web.xml
                            
                                Optimal settings for Cassandra Java driver to write to the local data centre only
                            
                                Hashmap and how this works behind the scene [duplicate]
                            
                                Integer partition into sums and products
                            
                                Why is java.net.SocksSocketImpl the default java.net.Socket implementation in Java?
                            
                                CGLIB not able to intercept methods in a superclass/superinterface
                            
                                Return list or modify by reference
                            
                                Is this an off-by-one bug in Java 7?
                            
                                JFrame freezes on Windows 8.1 when resized
                            
                                Why do extensions of inner classes get duplicate outer class references?
                            
                                Unit test on rabbitMQ
                            
                                MediaPlayer TimeoutException
                            
                                JMXProxy from JBoss to Tomcat
                            
                                very slow highlight performance in lucene
                            
                                Calling python module from Java
                            
                                Does SAF(Storage Access Framework) solve the SD card WRITE issue in Android 4.4 (KitKat)?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to remove duplicate columns after a JOIN in Pig?

Tags:

java

join

hadoop

apache-pig

Unmesha Sreeveni U.B

People also ask

1 Answers

Ajay Gupta

Recent Activity

Donate For Us