What is the faster way to calculate number of duplicate rows present in Redshift Table

Question

There are millions of record in table. And need to calculate number of duplicate rows present in my table in Redshift. I could achieve it by using below query,

select 
    sum(cnt) from (select <primary_key>
    , count(*)-1 as cnt 
from 
    table_name 
group by 
    <primary_key> having count(*)>1

Is there a faster way to achieve the same ?
Is there a way do achieve this in a single query without using subquery ?

Thanks.

kazzi · Accepted Answer

You can try the following query:

SELECT Column_name, COUNT(*) Count_Duplicate
FROM Table_name
 GROUP BY Column_name
 HAVING COUNT(*) > 1
 ORDER BY COUNT(*) DESC

AlexYes · Answer

If the criteria of duplication is only repeating primary key then

SELECT count(1)-count(distinct <primary_key>) FROM your_table

would work, except if you have specified your column as primary key in Redshift (it doesn't enforce constraint but if you mark a column as primary key count(distinct <primary_key>) will return the same as count(1) even if there are duplicate values in this column

What is the faster way to calculate number of duplicate rows present in Redshift Table

Tags:

sql

duplicates

count

amazon-redshift

Priyadarshini

2 Answers

kazzi

AlexYes

Recent Activity

Donate For Us

What is the faster way to calculate number of duplicate rows present in Redshift Table

Tags:

sql

duplicates

count

amazon-redshift

Priyadarshini

2 Answers

kazzi

AlexYes

Related questions

Recent Activity

Donate For Us