I have an issue. I have a table with almost 2 billion rows (yeah I know...) and has a lot of duplicate data in it which I'd like to delete from it. I was wondering how to do that exactly?
The columns are: first, last, dob, address, city, state, zip, telephone and are in a table called PF_main. Each record does have a unique ID thankfully, and its in column called ID.
How can I dedupe this and leave 1 unique entry (row) within the pf_main table for each person??
Thank you all in advance for your responses...
SELECT
ID, first, last, dob, address, city, state, zip, telephone,
ROW_NUMBER() OVER (PARTITION BY first, last, dob, address, city, state, zip, telephone ORDER BY ID) AS RecordInstance
FROM PF_main
will give you the "number" of each unique entry (sorted by Id)
so if you have the following records:
id, last, first, dob, address, city, state, zip, telephone
006, trevelyan, alec, '1954-05-15', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0064
007, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0074
008, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074
009, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074
you will get the following results (note last column)
006, trevelyan, alec, '1954-05-15', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0064, 1
007, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', '1SE1 7TP', 0074, 1
008, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074, 2
009, bond, james, '1957-02-08', '85 Albert Embankment', 'London', 'UK', 'SE1 7TP', 0074, 3
So you can just delete records with RecordInstance > 1:
WITH Records AS
(
SELECT
ID, first, last, dob, address, city, state, zip, telephone,
ROW_NUMBER() OVER (PARTITION BY first, last, dob, address, city, state, zip, telephone ORDER BY ID) AS RecordInstance
FROM PF_main
)
DELETE FROM Records
WHERE RecordInstance > 1
A 2 billion row table is quite big. Let me assume that first, last, and dob constitutes a "person". My suggestion is to build an index on the "person" and then do the truncate/re-insert approach.
In practice, this looks like:
create index idx_pf_main_first_last_dob on pf_main(first, last, dob);
select m.*
into temp_pf_main
from pf_main m
where not exists (select 1
from pf_main m2
where m2.first = m.first and m2.last = m.last and m2.dob = m.dob and
m2.id < m.id
);
truncate table pf_main;
insert into pf_main
select *
from temp_pf_main;
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With