How to prevent duplicate records being inserted with SqlBulkCopy when there is no primary key

Tags:

I receive a daily XML file that contains thousands of records, each being a business transaction that I need to store in an internal database for use in reporting and billing. I was under the impression that each day's file contained only unique records, but have discovered that my definition of unique is not exactly the same as the provider's.

The current application that imports this data is a C#.Net 3.5 console application, it does so using SqlBulkCopy into a MS SQL Server 2008 database table where the columns exactly match the structure of the XML records. Each record has just over 100 fields, and there is no natural key in the data, or rather the fields I can come up with making sense as a composite key end up also having to allow nulls. Currently the table has several indexes, but no primary key.

Basically the entire row needs to be unique. If one field is different, it is valid enough to be inserted. I looked at creating an MD5 hash of the entire row, inserting that into the database and using a constraint to prevent SqlBulkCopy from inserting the row,but I don't see how to get the MD5 Hash into the BulkCopy operation and I'm not sure if the whole operation would fail and roll back if any one record failed, or if it would continue.

The file contains a very large number of records, going row by row in the XML, querying the database for a record that matches all fields, and then deciding to insert is really the only way I can see being able to do this. I was just hoping not to have to rewrite the application entirely, and the bulk copy operation is so much faster.

Does anyone know of a way to use SqlBulkCopy while preventing duplicate rows, without a primary key? Or any suggestion for a different way to do this?

391

asked Apr 07 '10 15:04

kscott

2 Answers

I'd upload the data into a staging table then deal with duplicates afterwards on copy to the final table.

For example, you can create a (non-unique) index on the staging table to deal with the "key"

165

answered Sep 25 '22 05:09

gbn

Given that you're using SQL 2008, you have two options to solve the problem easily without needing to change your application much (if at all).

The first possible solution is create a second table like the first one but with a surrogate identity key and a uniqueness constraint added using the ignore_dup_key option which will do all the heavy lifting of eliminating the duplicates for you.

Here's an example you can run in SSMS to see what's happening:

if object_id( 'tempdb..#test1' ) is not null drop table #test1;
if object_id( 'tempdb..#test2' ) is not null drop table #test2;
go


-- example heap table with duplicate record

create table #test1
(
     col1 int
    ,col2 varchar(50)
    ,col3 char(3)
);
insert #test1( col1, col2, col3 )
values
     ( 250, 'Joe''s IT Consulting and Bait Shop', null )
    ,( 120, 'Mary''s Dry Cleaning and Taxidermy', 'ACK' )
    ,( 250, 'Joe''s IT Consulting and Bait Shop', null )    -- dup record
    ,( 666, 'The Honest Politician', 'LIE' )
    ,( 100, 'My Invisible Friend', 'WHO' )
;
go


-- secondary table for removing duplicates

create table #test2
(
     sk int not null identity primary key
    ,col1 int
    ,col2 varchar(50)
    ,col3 char(3)

    -- add a uniqueness constraint to filter dups
    ,constraint UQ_test2 unique ( col1, col2, col3 ) with ( ignore_dup_key = on )
);
go


-- insert all records from original table
-- this should generate a warning if duplicate records were ignored

insert #test2( col1, col2, col3 )
select col1, col2, col3
from #test1;
go

Alternatively, you can also remove the duplicates in-place without a second table, but the performance may be too slow for your needs. Here's the code for that example, also runnable in SSMS:

if object_id( 'tempdb..#test1' ) is not null drop table #test1;
go


-- example heap table with duplicate record

create table #test1
(
     col1 int
    ,col2 varchar(50)
    ,col3 char(3)
);
insert #test1( col1, col2, col3 )
values
     ( 250, 'Joe''s IT Consulting and Bait Shop', null )
    ,( 120, 'Mary''s Dry Cleaning and Taxidermy', 'ACK' )
    ,( 250, 'Joe''s IT Consulting and Bait Shop', null )    -- dup record
    ,( 666, 'The Honest Politician', 'LIE' )
    ,( 100, 'My Invisible Friend', 'WHO' )
;
go


-- add temporary PK and index

alter table #test1 add sk int not null identity constraint PK_test1 primary key clustered;
create index IX_test1 on #test1( col1, col2, col3 );
go


-- note: rebuilding the indexes may or may not provide a performance benefit

alter index PK_test1 on #test1 rebuild;
alter index IX_test1 on #test1 rebuild;
go


-- remove duplicates

with ranks as
(
    select
         sk
        ,ordinal = row_number() over 
         ( 
            -- put all the columns composing uniqueness into the partition
            partition by col1, col2, col3
            order by sk
         )
    from #test1
)
delete 
from ranks
where ordinal > 1;
go


-- remove added columns

drop index IX_test1 on #test1;
alter table #test1 drop constraint PK_test1;
alter table #test1 drop column sk;
go

answered Sep 24 '22 05:09

Sean

Related questions
                            
                                How can I post a list of items in MVC
                            
                                How would you detect the current browser in an Api Controller?
                            
                                MSTest refuses to run 64-bit?
                            
                                How to convert Json array to list of objects in c#
                            
                                ServiceStack - Is there a way to force all serialized Dates to use a specific DateTimeKind?
                            
                                Double overflow [duplicate]
                            
                                How to set value of property where there is no setter
                            
                                How can I encode Azure storage table row keys and partition keys?
                            
                                Visual Studio "0 of 4 errors"
                            
                                Configure Unity DI for ASP.NET Identity
                            
                                Web API and OData- Pass Multiple Parameters
                            
                                RoutePrefixAttribute in ASP.NET 5
                            
                                How to attach debugger to step into native (C++) code from a managed (C#) wrapper?
                            
                                What is a "Nested Quantifier" and why is it causing my regex to fail?
                            
                                How to get which page threw an exception to Application_error in aspx
                            
                                How to insert CookieCollection to CookieContainer?
                            
                                How to use use late binding to get excel instance?
                            
                                Why can TimeSpan and Guid Structs be compared to null?
                            
                                Using Static Constructor (Jon Skeet Brainteaser)
                            
                                How can I convert a list of objects to csv?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to prevent duplicate records being inserted with SqlBulkCopy when there is no primary key

Tags:

c#

sql

sql-server

sql-server-2008

sqlbulkcopy

kscott

People also ask

2 Answers

gbn

Sean

Recent Activity

Donate For Us