Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an easy way to dedupe a Hive table?

I have a set of Hive tables on Elastic Map-Reduce which have some duplicate elements. Is there an easy way of deduping these tables?

What comes to mind is dumping to a set of pig-digestible files, firing up pig and using a DISTINCT query to regenerate the table. That seems like quite a bit of work, though, so I'm wondering if there's an easier way.

like image 916
rongenre Avatar asked Nov 17 '25 17:11

rongenre


1 Answers

One query should remove duplicates:

INSERT OVERWRITE TABLE table
SELECT DISTINCT Col1, Col2 , ..., ColN FROM table
like image 62
www Avatar answered Nov 19 '25 13:11

www



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!