Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to delete duplicate records from Hive table?

Tags:

hadoop

hive

I am trying to learn about deleting duplicate records from a Hive table.

My Hive table: 'dynpart' with columns: Id, Name, Technology

Id  Name  Technology
1   Abcd  Hadoop
2   Efgh  Java
3   Ijkl  MainFrames
2   Efgh  Java

We have options like 'Distinct' to use in a select query, but a select query just retrieves data from the table. Could anyone tell how to use a delete query to remove the duplicate rows from a Hive table.

Sure that it is not recommended or not the standard to Delete/Update records in Hive. But I want to learn how do we do it.

like image 613
Metadata Avatar asked Apr 07 '17 13:04

Metadata


People also ask

How do I remove duplicate rows from Hive table?

To remove duplicate values, you can use insert overwrite table in Hive using the DISTINCT keyword while selecting from the original table. The DISTINCT keyword returns unique records from the table.

How do I find duplicate records in Hive?

My second approach to find duplicate is: select primary_key1, primary_key2, count(*) from mytable group by primary_key1, primary_key2 having count(*) > 1; Above query should list of rows which are duplicated and how many times particular row is duplicated.

How do you delete duplicate rows in SQL?

To delete the duplicate rows from the table in SQL Server, you follow these steps: Find duplicate rows using GROUP BY clause or ROW_NUMBER() function. Use DELETE statement to remove the duplicate rows.


1 Answers

You can use insert overwrite statement to update data

insert overwrite table dynpart select distinct * from dynpart;
like image 118
fi11er Avatar answered Oct 12 '22 17:10

fi11er