I have a set of Hive tables on Elastic Map-Reduce which have some duplicate elements. Is there an easy way of deduping these tables?
What comes to mind is dumping to a set of pig-digestible files, firing up pig and using a DISTINCT query to regenerate the table. That seems like quite a bit of work, though, so I'm wondering if there's an easier way.
One query should remove duplicates:
INSERT OVERWRITE TABLE table
SELECT DISTINCT Col1, Col2 , ..., ColN FROM table
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With