Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

IN statement inconsistency with PRIMARY KEY

So I have a simple table called temp that can be created by:

CREATE TABLE temp (value int, id int not null primary key);
INSERT INTO temp
VALUES(0,1),
      (0,2),
      (0,3),
      (0,4),
      (1,5),
      (1,6),
      (1,7),
      (1,8);

I have a second table temp2 that can be created by:

CREATE TABLE temp (value int, id int);
INSERT INTO temp
VALUES(0,1),
      (0,2),
      (0,3),
      (0,4),
      (1,5),
      (1,6),
      (1,7),
      (1,8);

The only difference between temp and temp2 is that the id field is the primary key in temp, and temp2 has no primary key. I'm not sure how, but I am getting differing results with the following query:

select * from temp
where id in (
    select id
    from (
        select id, ROW_NUMBER() over (partition by value order by value) rownum
        from temp
    ) s1
    where rownum = 1
)

This is the result for temp:

value       id
----------- -----------
0           1
0           2
0           3
0           4
1           5
1           6
1           7
1           8

and this is what I get when temp is replaced by temp2 (THE CORRECT RESULT):

value       id
----------- -----------
0           1
1           5

When running the inner-most query (s1), the expected results are retrieved:

id          rownum
----------- --------------------
1           1
2           2
3           3
4           4
5           1
6           2
7           3
8           4

When just running the in statement query on both, I also get the expected result:

id
-----------
1
5

I cannot figure out what the reason for this could possibly be. Is this a bug?

Notes: temp2 was created with a simple select * into temp2 from temp. I am running SQL Server 2008. My apologies if this is a known glitch. It is difficult to search for this since it requires an in statement. An "equivalent" query that uses a join does produce the correct results on both tables.

Edit: dbfiddle showing the differences: Unexpected Results Expected Results

like image 311
Jason Avatar asked Jun 03 '19 20:06

Jason


People also ask

What is a primary key violation?

A Primary Key violation occur when there is an attempt to insert a value into a database table, but the entry already exists.

Can Atable have two primary keys?

Each table can only have one primary key. Access can automatically create a primary key field for you when you create a table, or you can specify the fields that you want to use as the primary key.

What happens if no primary key?

Every table should have some column (or set of columns) that uniquely identifies one and only one row. It makes it much easier to maintain the data. It's true, without a primary key (or some unique key), you don't have an insertion anomaly if you go to insert the same data multiple times.

Is primary key always sorted?

The primary key's main function is to uniquely identify each row in the table - but it doesn't imply any (physical) sorting per se.


1 Answers

I can't specifically answer your question, but changing the ORDER BY fixes the problem. partition by value order by value doesn't really make sense, and it looks like the problem is "fooling" SQL Server; as you're partitioning the rows by the same value you're ordering by, every row is "row number 1" as they could all be at the start. Don't forget, a table is an unordered heap; even when it has a Primary Key (clustered or not).

If you change your ORDER BY to id instead the problem goes away.

SELECT *
FROM temp2 t2
WHERE t2.id IN (SELECT s1.id
                FROM (SELECT sq.id,
                             ROW_NUMBER() OVER (PARTITION BY sq.value ORDER BY sq.id) AS rownum
                      FROM temp2 sq) s1
                WHERE s1.rownum = 1);

In fact, changing the ORDER BY clause to anything else fixes the problem:

SELECT *
FROM temp2 t2
WHERE t2.id IN (SELECT s1.id
                FROM (SELECT sq.id,
                             ROW_NUMBER() OVER (PARTITION BY sq.value ORDER BY (SELECT NULL)) AS rownum
                      FROM temp2 sq) s1
                WHERE s1.rownum = 1);

So the problem is that your are using the same expression (column) for both your PARTITION BY and ORDER BY clause; meaning that any of those rows could be row number 1, and none of them; thus all are returned. It doesn't make sense for both to be the same, so they should be different.

Still, this problem does persist in SQL Server 2017 (and I suspect 2019) so you might want to raise a support ticket with them anyway (but as you're using 2008 don't expect it to get fixed, as your support is about to end).

As comments can be deleted without notice I wanted to add @scsimon's comment and my response:

scsimon: Interesting. Changing rownum = 2 gives expected results without changing order by. I think it's a bug.

Larnu: I agree at @scsimon. I suspect that changing the WHERE to s1.rownum = 2 effectively forces the data engine to actually determine the values of rownum, rather than assume every row is "equal"; as if that were the case none would be returned.
Even so, changing the WHERE to s1.rownum = 2 is still resigning to "return a random row", if the PARTITION BY and ORDER BY clauses are the same

like image 116
Larnu Avatar answered Sep 25 '22 11:09

Larnu