Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

primary key duplicate record bypass to next insert

weird question. i am inserting 10,000 records or so in a table and the primary key is not an Identity field. so when inserting all 10,000 if some are duplicate, is there a way to skip to next record in sql server insert and make sure the non-duplicates go in? i really dont care bout the duplicates not getting inserted.

like image 695
uirn Avatar asked Dec 07 '22 04:12

uirn


2 Answers

Use the "Ignore duplicate key" option.

The simplest way to do this is to delete the Primary Key in SQL Server Management Studio.

Then create a new Index, of type "Index", set Is Unique to "Yes", and set "Ignore Duplicate Keys" to "Yes". Then insert your records. It will insert them all except the duplicates. When you are done, you can delete this index, and recreate your Primary Key.

If you want a TSQL method, see the IGNORE_DUP_KEY option in the CREATE INDEX call:

CREATE INDEX (Transact-SQL)

EDIT:

Another way would be to use a LEFT JOIN between your source table and the records you are going to insert, and a GROUP BY clause, only inserting records that don't exist in your source. The GROUP BY will eliminate your duplicates in the new records.

like image 140
GalacticJello Avatar answered Jan 06 '23 22:01

GalacticJello


you need to define your primary key to ignore duplicates:

CREATE TABLE [dbo].[t2](
      [n] [int] NOT NULL,
PRIMARY KEY CLUSTERED
(
      [n] ASC
)WITH (IGNORE_DUP_KEY = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

Using this option may hurt performance:

If your data has a small percentage of duplicates, then IGNORE_DUP_KEY may speed up your inserts. For larger amounts of duplicates, IGNORE_DUP_KEY may slow them down significantly. I set up two tables, stripping down all the irrelevant details, as follows:

CREATE TABLE t1(n INT NOT NULL PRIMARY KEY)
GO
CREATE TABLE [dbo].[t2](
      [n] [int] NOT NULL,
PRIMARY KEY CLUSTERED
(
      [n] ASC
)WITH (IGNORE_DUP_KEY = ON) ON [PRIMARY]
) ON [PRIMARY]
GO

If the incoming data had no duplicates, the performance of both inserts was consistently the same:

INSERT t1(n)
SELECT n FROM dbo.Numbers

INSERT t2(n)
SELECT n FROM dbo.Numbers

(Note that dbo. Numbers has 1 million rows.) Of course, I always truncated both tables between my tests.

If the incoming data had 1% of duplicates, the insert with IGNORE_DUP_KEY consistently performed approximately 5% faster:

INSERT t1(n)
SELECT DISTINCT n FROM(
SELECT n FROM dbo.Numbers
UNION ALL
SELECT n FROM dbo.Numbers WHERE n <10000
) AS t

INSERT t2(n)
SELECT n FROM dbo.Numbers
UNION ALL
SELECT n FROM dbo.Numbers WHERE n <10000

On the other hand, if the incoming data had 100% of duplicates, the insert with IGNORE_DUP_KEY consistently performed at least 300% slower, both for a large set of 2 million rows:

INSERT t1(n)
SELECT DISTINCT n FROM(
SELECT n FROM dbo.Numbers
UNION ALL
SELECT n FROM dbo.Numbers
) AS t

INSERT t2(n)
SELECT n FROM dbo.Numbers
UNION ALL
SELECT n FROM dbo.Numbers

As well as for a smaller set of 200K rows:

INSERT t1(n)
SELECT DISTINCT n FROM(
SELECT n FROM dbo.Numbers WHERE n<100000
UNION ALL
SELECT n FROM dbo.Numbers WHERE n<100000
) AS t

INSERT t2(n)
SELECT n FROM dbo.Numbers WHERE n<100000
UNION ALL
SELECT n FROM dbo.Numbers WHERE n<100000 

Overall, I decided not to use IGNORE_DUP_KEY in my particular case. I decided that small savings for a small amount of duplicates do not justify the risk of a huge performance drop for larger amounts of duplicate data.

like image 22
A-K Avatar answered Jan 06 '23 20:01

A-K