I am trying to learn about deleting duplicate records from a Hive table.
My Hive table: 'dynpart' with columns: Id, Name, Technology
Id Name Technology
1 Abcd Hadoop
2 Efgh Java
3 Ijkl MainFrames
2 Efgh Java
We have options like 'Distinct' to use in a select query, but a select query just retrieves data from the table. Could anyone tell how to use a delete query to remove the duplicate rows from a Hive table.
Sure that it is not recommended or not the standard to Delete/Update records in Hive. But I want to learn how do we do it.
To remove duplicate values, you can use insert overwrite table in Hive using the DISTINCT keyword while selecting from the original table. The DISTINCT keyword returns unique records from the table.
My second approach to find duplicate is: select primary_key1, primary_key2, count(*) from mytable group by primary_key1, primary_key2 having count(*) > 1; Above query should list of rows which are duplicated and how many times particular row is duplicated.
To delete the duplicate rows from the table in SQL Server, you follow these steps: Find duplicate rows using GROUP BY clause or ROW_NUMBER() function. Use DELETE statement to remove the duplicate rows.
You can use insert overwrite statement to update data
insert overwrite table dynpart select distinct * from dynpart;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With