Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Can INSERT <table> (x) VALUES (@x) WHERE NOT EXISTS ( SELECT * FROM <table> WHERE x = @x) cause duplicates?

Tags:

sql

sql-server

While browsing SO I found the following Question/Discussion about the "best" approach for inserting records that don't exist yet. One of the statements that struck me was one of [Remus Rusanu] stating:

Both variants are incorrect. You will insert pairs of duplicate @value1, @value2, guaranteed.

Although I do agree about this for the syntax where the check is 'separated' from the INSERT (and no explicit locking/transaction mgmt is present); I'm having a hard time understanding why and when this would be true for the other proposed syntax that looks like this

INSERT INTO mytable (x)
SELECT @x WHERE NOT EXISTS (SELECT * FROM mytable WHERE x = @x);

I do NOT want to start (another) what's best/fastest discussion, nor do I think the syntax can 'replace' a unique index/constraint (or PK) but I really need to know in what situations this construction could cause doubles as I've been using this syntax in the past and wonder if it is unsafe to continue doing so in the future.

What I think that happens is that the INSERT & SELECT are both in the same (implicit) transaction. The query will take an IX lock on the related record (key) and not release it until the entire query has finished, thus only AFTER the record has been inserted. This lock blocks all other connections from making the same INSERT as they can't get a lock themselves until after our insert has finished; only then they get the lock and will start verifying for themselves if the record already exists or not.

As IMHO the best way to find out is by testing, I've been running the following code for a while on my laptop:

Create table

CREATE TABLE t_test (x int NOT NULL PRIMARY KEY (x))

Run below on many, many connections in parallel)

SET NOCOUNT ON

WHILE 1 = 1
    BEGIN
        INSERT t_test (x)
        SELECT x = DatePart(ms, CURRENT_TIMESTAMP)
         WHERE NOT EXISTS ( SELECT *
                              FROM t_test old
                             WHERE old.x = DatePart(ms, CURRENT_TIMESTAMP) )
    END

So far the only things to note are:

  • No errors encountered (yet)
  • CPU is running quite hot =)
  • table held 300 records quickly (due to 3ms 'precision' of datetime) after that no actual inserts are happening any more, as expected.

UPDATE:

Turns out my example above is not doing what I intended it to do. Instead of multiple connections trying to insert the same record simultaneously I simply had it not-inserting already existing records after the first second. As it probably took about a second to copy-paste & execute the query on the next connection there was never a danger of duplicates. I'll be wearing my donkey-ears for the remainder of the day...

Anyway, I've adapted the test to be more in line of the matter at hand (using the same table)

SET NOCOUNT ON

DECLARE @midnight datetime
SELECT @midnight = Convert(datetime, Convert(varchar, CURRENT_TIMESTAMP, 106), 106)

WHILE 1 = 1
    BEGIN
        INSERT t_test (x)
        SELECT x = DateDiff(ms, @midnight, CURRENT_TIMESTAMP)
         WHERE NOT EXISTS ( SELECT *
                              FROM t_test old
                             WHERE old.x = DateDiff(ms, @midnight, CURRENT_TIMESTAMP))
    END

And lo & behold, the output window now holds plenty of errors along the lines of

Msg 2627, Level 14, State 1, Line 8 Violation of PRIMARY KEY constraint 'PK__t_test__3BD019E521C3B7EE'. Cannot insert >duplicate key in object 'dbo.t_test'. The duplicate key value is (57581873).

FYI: As pointed out by Andomar, adding a HOLDLOCK and/or SERIALIZABLE hint indeed 'solves' the problem but then turns out to be causing lots of deadlocks... which isn't great but not unexpected either when I think it through.

Guess I have quite a bit of code review to do...

like image 638
deroby Avatar asked Sep 30 '13 13:09

deroby


People also ask

How do you know if data exists in one table and not in another?

We can get the records in one table that doesn't exist in another table by using NOT IN or NOT EXISTS with the subqueries including the other table in the subqueries.

How do you avoid duplicate queries in SQL insert?

Use the INSERT IGNORE command rather than the INSERT command. If a record doesn't duplicate an existing record, then MySQL inserts it as usual. If the record is a duplicate, then the IGNORE keyword tells MySQL to discard it silently without generating an error.


2 Answers

Thanks for posting separate question. You have several misconceptions:

The query will take an IX lock on the related record (key) and not release it until the entire query has finished

The INSERT will lock the rows inserted, X lock (intent locks like IX can only be requested on parent entities on the lock hierarchy, never on records). This lock must be held until the transaction commits (strict two-phase locking requires X locks always to be released only at the end of the transaction).

Note that the locks acquired by the INSERT will not block more inserts even of the same key. The only way to prevent duplicates is an unique index and the mechanism to enforce the uniqueness is not lock based. Yes, on a primary key, due to its uniqueness, duplicates will be prevented but the forces at play are different, even if locking does play a role.

In your example what will happen is that the operations will serialize because the SELECT blocks on the INSERT, due to the X vs. S lock conflict on the newly inserted row. another think to consider is that 300 records of type INT will fit on a single page and a lot of optimizations will kick in (eg. use a scan instead of multiple seeks) and will alter the test results. Remember, a hypothesis with many positives and no proof is still only a conjecture...

To test the problem you need to ensure that the INSERT does not block concurrent SELECTs. Running under RCSI or under snapshot isolation is one way to achieve this (and may 'achieve' it in production involuntarily and break the app that made all the assumptions above...) A WHERE clause is another way. A significantly big table and secondary indexes is yet another way.

So here is how I tested it:

set nocount on;
go

drop database test;
go

create database test;
go

use test;
go

create table test (id int primary key, filler char(200));
go

-- seed 10000 values, fill some pages
declare @i int = 0;
begin transaction
while @i < 10000
begin
    insert into test (id) values (@i);
    set @i += 1;
end
commit;

Now run this from several parallel connection (I used 3):

use test;
go

set nocount on;
go

declare @i int;
while (1=1)
begin
    -- This is not cheating. This ensures that many concurrent SELECT attempt 
    -- to insert the same values, and all of them believe the values are 'free'
    select @i = max(id) from test with (readpast);
    insert into test (id)
    select id
        from (values (@i), (@i+1), (@i+2), (@i+3), (@i+4), (@i+5)) as t(id)
        where t.id not in (select id from test);
end

Here are some results:

Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__test__3213E83FD9281543'. Cannot insert duplicate key in object 'dbo.test'. The duplicate key value is (130076).
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__test__3213E83FD9281543'. Cannot insert duplicate key in object 'dbo.test'. The duplicate key value is (130096).
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__test__3213E83FD9281543'. Cannot insert duplicate key in object 'dbo.test'. The duplicate key value is (130106).
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__test__3213E83FD9281543'. Cannot insert duplicate key in object 'dbo.test'. The duplicate key value is (130121).
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__test__3213E83FD9281543'. Cannot insert duplicate key in object 'dbo.test'. The duplicate key value is (130141).
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__test__3213E83FD9281543'. Cannot insert duplicate key in object 'dbo.test'. The duplicate key value is (130151).
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 6
Violation of PRIMARY KEY constraint 'PK__test__3213E83FD9281543'. Cannot insert duplicate key in object 'dbo.test'. The duplicate key value is (130176).
The statement has been terminated.
Msg 2627, Level 14, State 1, Line 6

Even with locking, no snapshot isolation, no RCSI. As each SELECT attempts to insert @i+1...@i+5, they'll all discover the values are not existign and then they'll all proceed to INSERT. One lucky winner will succeed, all the rest will cause PK violation. Frequently. I used the @i=MAX(id) intentionally to dramatically increase the chases of conflict, but that is not required. I'll leave the problems of figuring out why all violations occur on values %5+1 as an exercise.

like image 198
Remus Rusanu Avatar answered Oct 17 '22 03:10

Remus Rusanu


You are testing from a single connection, so you are not testing concurrency at all. Run the script twice from different windows and you will start to see conflicts.

There are multiple reasons for the conflicts:

  • By default, a lock is not held until the end of an (implicit) transaction. Use the with (holdlock) query hint to change this behavior.
  • The concurrency problem with your query is called a "phantom read". The default transaction isolation level is "read committed", which does not protect against phantom reads. Use the with (serializable) query hint to increase the isolation level. (Try to avoid the set transaction isolation level command, because the isolation level is not cleared when a connection is returned to the connection pool.)

The primary key constraint is always enforced. So your query will try to insert a duplicate row and fail by throwing a duplicate key error.

A good approach is to use your query (which will work 99% of the time) and make the client handle the occasional duplicate key exception in a graceful manner.

Wikipedia has a great explanation of isolation levels.

like image 26
Andomar Avatar answered Oct 17 '22 02:10

Andomar