Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What is the faster way to calculate number of duplicate rows present in Redshift Table

There are millions of record in table. And need to calculate number of duplicate rows present in my table in Redshift. I could achieve it by using below query,

select 
    sum(cnt) from (select <primary_key>
    , count(*)-1 as cnt 
from 
    table_name 
group by 
    <primary_key> having count(*)>1
  1. Is there a faster way to achieve the same ?
  2. Is there a way do achieve this in a single query without using subquery ?

Thanks.

like image 244
Priyadarshini Avatar asked Nov 21 '25 01:11

Priyadarshini


2 Answers

You can try the following query:

SELECT Column_name, COUNT(*) Count_Duplicate
FROM Table_name
 GROUP BY Column_name
 HAVING COUNT(*) > 1
 ORDER BY COUNT(*) DESC 
like image 95
kazzi Avatar answered Nov 22 '25 15:11

kazzi


If the criteria of duplication is only repeating primary key then

SELECT count(1)-count(distinct <primary_key>) FROM your_table

would work, except if you have specified your column as primary key in Redshift (it doesn't enforce constraint but if you mark a column as primary key count(distinct <primary_key>) will return the same as count(1) even if there are duplicate values in this column

like image 21
AlexYes Avatar answered Nov 22 '25 13:11

AlexYes



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!