Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL query to find duplicate rows, in any table

Tags:

sql

tsql

I'm looking for a schema-independent query. That is, if I have a users table or a purchases table, the query should be equally capable of catching duplicate rows in either table without any modification (other than the from clause, of course).

I'm using T-SQL, but I'm guessing there should be a general solution.

like image 779
Richard Pianka Avatar asked Jul 13 '11 18:07

Richard Pianka


2 Answers

I believe that this should work for you. Keep in mind that CHECKSUM() isn't 100% perfect - it's theoretically possible to get a false positive here (I think), but otherwise you can just change the table name and this should work:

;WITH cte AS (
    SELECT
        *,
        CHECKSUM(*) AS chksum,
        ROW_NUMBER() OVER(ORDER BY GETDATE()) AS row_num
    FROM
        My_Table
)
SELECT
    *
FROM
    CTE T1
INNER JOIN CTE T2 ON
    T2.chksum = T1.chksum AND
    T2.row_num <> T1.row_num

The ROW_NUMBER() is needed so that you have some way of distinguishing rows. It requires an ORDER BY and that can't be a constant, so GETDATE() was my workaround for that.

Simply change the table name in the CTE and it should work without spelling out the columns.

like image 178
Tom H Avatar answered Oct 04 '22 01:10

Tom H


I'm still confused about what "detecting them might be" but I'll give it a shot.

Excluding them is easy

e.g.

SELECT DISTINCT * FROM USERS

However if you wanted to only include them and a duplicate is all the fields than you have to do

SELECT 
   [Each and every field]
FROM
   USERS
GROUP BY
   [Each and every field]
HAVING COUNT(*) > 1  

You can't get away with just using (*) because you can't GROUP BY * so this requirement from your comments is difficult

a schema-independent means I don't want to specify all of the columns in the query

Unless that is you want to use dynamic SQL and read the columns from sys.columns or information_schema.columns

For example

DECLARE @colunns nvarchar(max)
SET  @colunns = ''

SELECT @colunns = @colunns  + '[' +  COLUMN_NAME  +'], ' 
FROM INFORMATION_SCHEMA.columns  
WHERE table_name = 'USERS'

SET  @colunns  = left(@colunns,len(@colunns ) - 1)


DECLARE @SQL nvarchar(max)
SET @SQL = 'SELECT '  + @colunns 
          + 'FROM  USERS' + 'GROUP BY ' 
          + @colunns 
           + ' Having Count(*) > 1'


exec sp_executesql @SQL

Please note you should read this The Curse and Blessings of Dynamic SQL if you haven't already

like image 20
Conrad Frix Avatar answered Oct 04 '22 03:10

Conrad Frix