I want to find duplicate rows from one of the Hive table for which I was given two approaches.
First approach is to use following two queries:
select count(*) from mytable; // this will give total row count
second query is as below which will give count of distinct rows
select count(distinct primary_key1, primary_key2) from mytable;
With this approach, for one of my table total row count derived using first query is 3500 and second query gives row count 2700. So it tells us that 3500 - 2700 = 800
rows are duplicate. But this query doesn't tell which rows are duplicated.
My second approach to find duplicate is:
select primary_key1, primary_key2, count(*)
from mytable
group by primary_key1, primary_key2
having count(*) > 1;
Above query should list of rows which are duplicated and how many times particular row is duplicated. but this query shows zero rows which means there are no duplicate rows in that table.
So I would like to know:
My second approach to find duplicate is: select primary_key1, primary_key2, count(*) from mytable group by primary_key1, primary_key2 having count(*) > 1; Above query should list of rows which are duplicated and how many times particular row is duplicated.
One way to find duplicate records from the table is the GROUP BY statement. The GROUP BY statement in SQL is used to arrange identical data into groups with the help of some functions. i.e if a particular column has the same values in different rows then it will arrange these rows in a group.
To remove duplicate values, you can use insert overwrite table in Hive using the DISTINCT keyword while selecting from the original table. The DISTINCT keyword returns unique records from the table.
Hive does not validate primary and foreign key constraints.
Since these constraints are not validated, an upstream system needs to ensure data integrity before it is loaded into Hive.
That means that Hive allows duplicates in Primary Keys.
To solve your issue, you should do something like this:
select [every column], count(*)
from mytable
group by [every column]
having count(*) > 1;
This way you will get list of duplicated rows.
Suppose your want get duplicate rows based on a particular column ID
here. Below query will give you all the IDs which are duplicate in table in hive.
SELECT "ID"
FROM TABLE
GROUP BY "ID"
HAVING count(ID) > 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With