How to find duplicate rows in Hive?

Tags:

hive

I want to find duplicate rows from one of the Hive table for which I was given two approaches.

First approach is to use following two queries:

select count(*) from mytable; // this will give total row count

second query is as below which will give count of distinct rows

select count(distinct primary_key1, primary_key2) from mytable;

With this approach, for one of my table total row count derived using first query is 3500 and second query gives row count 2700. So it tells us that 3500 - 2700 = 800 rows are duplicate. But this query doesn't tell which rows are duplicated.

My second approach to find duplicate is:

select primary_key1, primary_key2, count(*)
from mytable
group by primary_key1, primary_key2
having count(*) > 1;

Above query should list of rows which are duplicated and how many times particular row is duplicated. but this query shows zero rows which means there are no duplicate rows in that table.

So I would like to know:

If my first approach is correct - if yes then how do I find which rows are duplicated
Why second approach is not providing list of rows which are duplicated?
Is there any other way to find the duplicates?

336

asked Oct 14 '17 18:10

Shekhar

2 Answers

Hive does not validate primary and foreign key constraints.

Since these constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive.

That means that Hive allows duplicates in Primary Keys.

To solve your issue, you should do something like this:

select [every column], count(*)
from mytable
group by [every column]
having count(*) > 1;

This way you will get list of duplicated rows.

173

answered Sep 20 '22 12:09

Alex

Suppose your want get duplicate rows based on a particular column ID here. Below query will give you all the IDs which are duplicate in table in hive.

SELECT "ID"
FROM TABLE
GROUP BY "ID"
HAVING count(ID) > 1

answered Sep 19 '22 12:09

Maneesh K Bishnoi

Related questions
                            
                                Update a column for all the rows
                            
                                How would you add a column that only has a set choice of values?
                            
                                Infinite loop in CTE when parsing self-referencing table
                            
                                JDBC Transaction vs Connection Clarification
                            
                                Putting JSON string as field data on MySQL
                            
                                How to find top-X highest values in column using Django Queryset without cutting off ties at the bottom?
                            
                                SQL exercises/queries with sample database [closed]
                            
                                ERROR: failed to find conversion function from unknown to text
                            
                                Missing IN or OUT parameter at index:: 1 error in Java, Oracle
                            
                                Database index on a column with duplicate values
                            
                                Update query if statement for Oracle
                            
                                Get pointer to a struct field value
                            
                                Cut string after first occurrence of a character
                            
                                postgresql update multiple tables in single query
                            
                                ExecuteReader() in Powershell script
                            
                                Remove duplicate sub-query
                            
                                How to filter rows for a specific aggregate with spark sql?
                            
                                ORA-32795: cannot insert into a generated always identity column
                            
                                How to aggregate over rolling time window with groups in Spark
                            
                                SSDT failing to publish: "Unable to connect to master or target server"

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With