Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why doesn't this sub-query seem to work?

Before anything, I am not looking for a re-write. This was presented to me, and I can't seem to figure out if this is a bug in general or some kind of syntactic craziness that occurs due to the peculiarity of the script. Okay with that said on with the setup:

  • Microsoft SQL Server Standard Edition (64-bit)

  • Version 10.50.2500.0

On a table located in a generic database, defined as:

CREATE TABLE [dbo].[Regions](
    [RegionID] [int] NOT NULL,
    [RegionGroupID] [int] NOT NULL,
    [IsDefault] [bit] NOT NULL,
 CONSTRAINT [PK_Regions] PRIMARY KEY CLUSTERED
(
    [RegionID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

insert some values:

INSERT INTO [dbo].[Regions]
([RegionID],[RegionGroupID],[IsDefault])
VALUES
(0,1,0),
(1,1,0),
(2,1,0),
(3,2,0),
(4,2,0),
(5,2,0),
(6,3,0),
(7,3,0),
(8,3,0)

Now run the query (to select a single from each group, remember no rewrite suggestions!):

SELECT RXXID FROM (
   SELECT
       RXX.RegionID as RXXID,
       ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM
   FROM Regions as RXX
) AS tmp
WHERE tmp.RXXNUM = 1

You should get:

RXXID
-----------
0
3
6

Now stick that inside an update statement (with a preset to 0 and a select all after):

UPDATE Regions SET IsDefault = 0

UPDATE Regions
SET IsDefault = 1
WHERE RegionID IN (
    SELECT RXXID FROM (
       SELECT
           RXX.RegionID as RXXID,
           ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM
       FROM Regions as RXX
    ) AS tmp
    WHERE tmp.RXXNUM = 1
)


SELECT * FROM Regions
ORDER BY RegionGroupID

and get this result:

RegionID    RegionGroupID IsDefault
----------- ------------- ---------
0           1             1
1           1             1
2           1             1
3           2             1
4           2             1
5           2             1
6           3             1
7           3             1
8           3             1

zomg wtf lamaz?

While I don't claim to be a SQL guru, this seems neither proper nor correct. And to make things more crazy, if you drop the primary key it seems to work:

Drop primary key:

IF  EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[Regions]') AND name = N'PK_Regions')
ALTER TABLE [dbo].[Regions] DROP CONSTRAINT [PK_Regions]

And re-run update statement set, result:

RegionID    RegionGroupID IsDefault
----------- ------------- ---------
0           1             1
1           1             0
2           1             0
3           2             1
4           2             0
5           2             0
6           3             1
7           3             0
8           3             0

Isn't that a b?

Does anyone have any clue what is going on here? My guess is some kind of sub-query caching and is this a bug? It sure doesn't seem like what SQL should be doing?

like image 245
ckozl Avatar asked Mar 23 '12 19:03

ckozl


People also ask

Why a sub query in the SELECT or WHERE clause can slow performance?

For multiple-table subqueries, execution of NULL IN (SELECT ...) is particularly slow because the join optimizer does not optimize for the case where the outer expression is NULL .

WHERE subqueries can not be used?

Subqueries are not allowed in the defining query of a CREATE PROJECTION statement. Subqueries are supported within UPDATE statements with the following exceptions: You cannot use SET column = {expression} to specify a subquery.

What are the restriction of sub query in SQL Server?

It can't include a COMPUTE or FOR BROWSE clause, and may only include an ORDER BY clause when a TOP clause is also specified. A subquery can be nested inside the WHERE or HAVING clause of an outer SELECT , INSERT , UPDATE , or DELETE statement, or inside another subquery.


2 Answers

Just update as a CTE directly:

WITH tmp AS (
SELECT 
       RegionID as RXXID,
       RegionGroupID,
       IsDefault,
       ROW_NUMBER() OVER (PARTITION BY RegionGroupID ORDER BY RegionID) AS RXXNUM
   FROM Regions

) 
UPDATE tmp SET IsDefault = 1 WHERE RXXNUM = 1
select * from Regions

Added more columns to illustrate. You can see this on http://sqlfiddle.com/#!3/03913/9

Not 100% sure what is going on in your example, but since you partition and order by the same column, you're not really certain to get the same order back, since they are all tied. Shouldn't you order by RegionID or some other column, as i did on sqlfiddle?


Back to your question:

If you change your UPDATE (with the clustered index) to a SELECT, you'll get all 9 rows back. If you drop the PK, and do the SELECT, you only get 3 rows. Back to your update statement. Inspecting the execution plans show that they differ slightly:

First (PK) Execution planSecond (No PK) Execution plan

What you can see here is that in the first (with PK) query, you'll scan the clustered index for the outer reference, note that it does not have the alias RXX. Then for each row in the top, do a lookup to the RXX. And yes, because of your row number ordering, every RegionID can be row_number() 1 for each RegionGroupID. SQL Server would know this based on your PK, i guess, and can say that For every RegionID, this RegionID can be row number 1. Therefore the statement is rather valid.

In the second query, there is no index, and you get a table scan on Regions, then it builds a probe table using the RXX, and joins differently (single pass, ROW_NUMBER() can only be 1 for one row per regiongroupid now). This way in that scan, every RegionID has only one ROW_NUMBER(), though you cannot be 100% certain it'll be the same every time.

This means: Using your subquery which doesn't have a deterministic order for every execution, you should avoid using a multiple pass (NESTED LOOP) join type, but a single pass (MERGE OR HASH) join.

To fix this without changing the structure of your query, add OPTION (HASH JOIN) or OPTION (MERGE JOIN) to the first UPDATE:

So, you'll need the following update statement (when you have the PK):

UPDATE Regions SET IsDefault = 0

UPDATE Regions 
SET IsDefault = 1 
WHERE RegionID IN (
    SELECT RXXID FROM (
       SELECT 
           RXX.RegionID as RXXID,
           ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM
       FROM Regions as RXX
    ) AS tmp
    WHERE tmp.RXXNUM = 1 
)
OPTION (HASH JOIN)

SELECT * FROM Regions
ORDER BY RegionGroupID

Here are the execution plans using these two join types (note actual number of rows: 3 in the properties):

Using MERGE JOINUsing HASH JOIN

like image 141
cairnz Avatar answered Sep 29 '22 19:09

cairnz


Your query in plain language is something like:
For each row in Regions check if RegionID exists in some sub query. Meaning that the sub query is executed for each row in Regions. (I know that is not the case but it is the semantics of the query).

Since you are using RegionGroupID as order and partition you really have no idea what RegionID will be returned so it might very well be a new ID for each time the sub-query is checked against.

Update:

Doing the update with a join against the derived table instead instead of using in changes the semantics of the query and it changed the result as well.

This works as expected:

UPDATE R 
SET IsDefault = 1
FROM Regions as R
  inner join 
      (
        SELECT RXXID FROM (
           SELECT 
               RXX.RegionID as RXXID,
               ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM
           FROM Regions as RXX
        ) AS tmp
        WHERE tmp.RXXNUM = 1 
      ) as C
    on R.RegionID = C.RXXID
like image 20
Mikael Eriksson Avatar answered Sep 29 '22 20:09

Mikael Eriksson