Deduping SQL Server table

Question

I have an issue. I have a table with almost 2 billion rows (yeah I know...) and has a lot of duplicate data in it which I'd like to delete from it. I was wondering how to do that exactly?

The columns are: first, last, dob, address, city, state, zip, telephone and are in a table called PF_main. Each record does have a unique ID thankfully, and its in column called ID.

How can I dedupe this and leave 1 unique entry (row) within the pf_main table for each person??

Thank you all in advance for your responses...

Mark Sowul · Accepted Answer

SELECT 
   ID, first, last, dob, address, city, state, zip, telephone, 
   ROW_NUMBER() OVER (PARTITION BY first, last, dob, address, city, state, zip, telephone ORDER BY ID) AS RecordInstance
FROM PF_main

will give you the "number" of each unique entry (sorted by Id)

so if you have the following records:

id, last, first, dob, address, city, state, zip, telephone
006, trevelyan, alec, '1954-05-15', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0064
007, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0074
008, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074
009, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074

you will get the following results (note last column)

006, trevelyan, alec, '1954-05-15', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0064, 1
007, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0074, 1
008, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074, 2
009, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074, 3

So you can just delete records with RecordInstance > 1:

WITH Records AS
(
   SELECT 
      ID, first, last, dob, address, city, state, zip, telephone,
      ROW_NUMBER() OVER (PARTITION BY first, last, dob, address, city, state, zip, telephone ORDER BY ID) AS RecordInstance
   FROM PF_main
)
DELETE FROM Records
WHERE RecordInstance > 1

Gordon Linoff · Answer

A 2 billion row table is quite big. Let me assume that first, last, and dob constitutes a "person". My suggestion is to build an index on the "person" and then do the truncate/re-insert approach.

In practice, this looks like:

create index idx_pf_main_first_last_dob on pf_main(first, last, dob);

select m.*
into temp_pf_main
from pf_main m
where not exists (select 1
                  from pf_main m2
                  where m2.first = m.first and m2.last = m.last and m2.dob = m.dob and
                        m2.id < m.id
                 );

truncate table pf_main;

insert into pf_main
    select *
    from temp_pf_main;

Deduping SQL Server table

Tags:

sql

sql-server

deduplication

Sal

2 Answers

Mark Sowul

Gordon Linoff

Recent Activity

Donate For Us

Deduping SQL Server table

Tags:

sql

sql-server

deduplication

Sal

2 Answers

Mark Sowul

Gordon Linoff

Related questions

Recent Activity

Donate For Us