Before anything, I am not looking for a re-write. This was presented to me, and I can't seem to figure out if this is a bug in general or some kind of syntactic craziness that occurs due to the peculiarity of the script. Okay with that said on with the setup: <ul> <li>Microsoft SQL Server Standard Edition (64-bit)</li> <li>Version 10.50.2500.0</li> </ul> On a table located in a generic database, defined as: <pre class="prettyprint"><code>CREATE TABLE [dbo].[Regions]( [RegionID] [int] NOT NULL, [RegionGroupID] [int] NOT NULL, [IsDefault] [bit] NOT NULL, CONSTRAINT [PK_Regions] PRIMARY KEY CLUSTERED ( [RegionID] ASC )WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY] ) ON [PRIMARY] </code></pre> insert some values: <pre class="prettyprint"><code>INSERT INTO [dbo].[Regions] ([RegionID],[RegionGroupID],[IsDefault]) VALUES (0,1,0), (1,1,0), (2,1,0), (3,2,0), (4,2,0), (5,2,0), (6,3,0), (7,3,0), (8,3,0) </code></pre> Now run the query (to select a single from each group, remember no rewrite suggestions!): <pre class="prettyprint"><code>SELECT RXXID FROM ( SELECT RXX.RegionID as RXXID, ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM FROM Regions as RXX ) AS tmp WHERE tmp.RXXNUM = 1 </code></pre> You should get: <pre class="prettyprint"><code>RXXID ----------- 0 3 6 </code></pre> Now stick that inside an update statement (with a preset to 0 and a select all after): <pre class="prettyprint"><code>UPDATE Regions SET IsDefault = 0 UPDATE Regions SET IsDefault = 1 WHERE RegionID IN ( SELECT RXXID FROM ( SELECT RXX.RegionID as RXXID, ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM FROM Regions as RXX ) AS tmp WHERE tmp.RXXNUM = 1 ) SELECT * FROM Regions ORDER BY RegionGroupID </code></pre> and get this result: <pre class="prettyprint"><code>RegionID RegionGroupID IsDefault ----------- ------------- --------- 0 1 1 1 1 1 2 1 1 3 2 1 4 2 1 5 2 1 6 3 1 7 3 1 8 3 1 </code></pre> zomg wtf lamaz? While I don't claim to be a SQL guru, this seems neither proper nor correct. And to make things more crazy, if you drop the primary key it seems to work: Drop primary key: <pre class="prettyprint"><code>IF EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[Regions]') AND name = N'PK_Regions') ALTER TABLE [dbo].[Regions] DROP CONSTRAINT [PK_Regions] </code></pre> And re-run update statement set, result: <pre class="prettyprint"><code>RegionID RegionGroupID IsDefault ----------- ------------- --------- 0 1 1 1 1 0 2 1 0 3 2 1 4 2 0 5 2 0 6 3 1 7 3 0 8 3 0 </code></pre> Isn't that a b? Does anyone have any clue what is going on here? My guess is some kind of sub-query caching and is this a bug? It sure doesn't seem like what SQL should be doing?

Your query in plain language is something like: For each row in <code>Regions</code> check if <code>RegionID</code> exists in some sub query. Meaning that the sub query is executed for each row in <code>Regions</code>. (I know that is not the case but it is the semantics of the query). Since you are using <code>RegionGroupID</code> as order and partition you really have no idea what <code>RegionID</code> will be returned so it might very well be a new ID for each time the sub-query is checked against. Update: Doing the update with a join against the derived table instead instead of using in changes the semantics of the query and it changed the result as well. This works as expected: <pre class="prettyprint"><code>UPDATE R SET IsDefault = 1 FROM Regions as R inner join ( SELECT RXXID FROM ( SELECT RXX.RegionID as RXXID, ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM FROM Regions as RXX ) AS tmp WHERE tmp.RXXNUM = 1 ) as C on R.RegionID = C.RXXID </code></pre>

Why doesn't this sub-query seem to work?

Tags:

tsql

sql-server-2008

Before anything, I am not looking for a re-write. This was presented to me, and I can't seem to figure out if this is a bug in general or some kind of syntactic craziness that occurs due to the peculiarity of the script. Okay with that said on with the setup:

Microsoft SQL Server Standard Edition (64-bit)
Version 10.50.2500.0

On a table located in a generic database, defined as:

CREATE TABLE [dbo].[Regions](
    [RegionID] [int] NOT NULL,
    [RegionGroupID] [int] NOT NULL,
    [IsDefault] [bit] NOT NULL,
 CONSTRAINT [PK_Regions] PRIMARY KEY CLUSTERED
(
    [RegionID] ASC
)WITH (PAD_INDEX  = OFF, STATISTICS_NORECOMPUTE  = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS  = ON, ALLOW_PAGE_LOCKS  = ON) ON [PRIMARY]
) ON [PRIMARY]

insert some values:

INSERT INTO [dbo].[Regions]
([RegionID],[RegionGroupID],[IsDefault])
VALUES
(0,1,0),
(1,1,0),
(2,1,0),
(3,2,0),
(4,2,0),
(5,2,0),
(6,3,0),
(7,3,0),
(8,3,0)

Now run the query (to select a single from each group, remember no rewrite suggestions!):

SELECT RXXID FROM (
   SELECT
       RXX.RegionID as RXXID,
       ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM
   FROM Regions as RXX
) AS tmp
WHERE tmp.RXXNUM = 1

You should get:

RXXID
-----------
0
3
6

Now stick that inside an update statement (with a preset to 0 and a select all after):

UPDATE Regions SET IsDefault = 0

UPDATE Regions
SET IsDefault = 1
WHERE RegionID IN (
    SELECT RXXID FROM (
       SELECT
           RXX.RegionID as RXXID,
           ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM
       FROM Regions as RXX
    ) AS tmp
    WHERE tmp.RXXNUM = 1
)


SELECT * FROM Regions
ORDER BY RegionGroupID

and get this result:

RegionID    RegionGroupID IsDefault
----------- ------------- ---------
0           1             1
1           1             1
2           1             1
3           2             1
4           2             1
5           2             1
6           3             1
7           3             1
8           3             1

zomg wtf lamaz?

While I don't claim to be a SQL guru, this seems neither proper nor correct. And to make things more crazy, if you drop the primary key it seems to work:

Drop primary key:

IF  EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[Regions]') AND name = N'PK_Regions')
ALTER TABLE [dbo].[Regions] DROP CONSTRAINT [PK_Regions]

And re-run update statement set, result:

RegionID    RegionGroupID IsDefault
----------- ------------- ---------
0           1             1
1           1             0
2           1             0
3           2             1
4           2             0
5           2             0
6           3             1
7           3             0
8           3             0

Isn't that a b?

Does anyone have any clue what is going on here? My guess is some kind of sub-query caching and is this a bug? It sure doesn't seem like what SQL should be doing?

245

asked Mar 23 '12 19:03

ckozl

2 Answers

Just update as a CTE directly:

WITH tmp AS (
SELECT 
       RegionID as RXXID,
       RegionGroupID,
       IsDefault,
       ROW_NUMBER() OVER (PARTITION BY RegionGroupID ORDER BY RegionID) AS RXXNUM
   FROM Regions

) 
UPDATE tmp SET IsDefault = 1 WHERE RXXNUM = 1
select * from Regions

Added more columns to illustrate. You can see this on http://sqlfiddle.com/#!3/03913/9

Not 100% sure what is going on in your example, but since you partition and order by the same column, you're not really certain to get the same order back, since they are all tied. Shouldn't you order by RegionID or some other column, as i did on sqlfiddle?

Back to your question:

If you change your UPDATE (with the clustered index) to a SELECT, you'll get all 9 rows back. If you drop the PK, and do the SELECT, you only get 3 rows. Back to your update statement. Inspecting the execution plans show that they differ slightly:

First (PK) Execution plan Second (No PK) Execution plan

What you can see here is that in the first (with PK) query, you'll scan the clustered index for the outer reference, note that it does not have the alias RXX. Then for each row in the top, do a lookup to the RXX. And yes, because of your row number ordering, every RegionID can be row_number() 1 for each RegionGroupID. SQL Server would know this based on your PK, i guess, and can say that For every RegionID, this RegionID can be row number 1. Therefore the statement is rather valid.

In the second query, there is no index, and you get a table scan on Regions, then it builds a probe table using the RXX, and joins differently (single pass, ROW_NUMBER() can only be 1 for one row per regiongroupid now). This way in that scan, every RegionID has only one ROW_NUMBER(), though you cannot be 100% certain it'll be the same every time.

This means: Using your subquery which doesn't have a deterministic order for every execution, you should avoid using a multiple pass (NESTED LOOP) join type, but a single pass (MERGE OR HASH) join.

To fix this without changing the structure of your query, add OPTION (HASH JOIN) or OPTION (MERGE JOIN) to the first UPDATE:

So, you'll need the following update statement (when you have the PK):

UPDATE Regions SET IsDefault = 0

UPDATE Regions 
SET IsDefault = 1 
WHERE RegionID IN (
    SELECT RXXID FROM (
       SELECT 
           RXX.RegionID as RXXID,
           ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM
       FROM Regions as RXX
    ) AS tmp
    WHERE tmp.RXXNUM = 1 
)
OPTION (HASH JOIN)

SELECT * FROM Regions
ORDER BY RegionGroupID

Here are the execution plans using these two join types (note actual number of rows: 3 in the properties):

Using MERGE JOIN Using HASH JOIN

141

answered Sep 29 '22 19:09

cairnz

Your query in plain language is something like:
For each row in Regions check if RegionID exists in some sub query. Meaning that the sub query is executed for each row in Regions. (I know that is not the case but it is the semantics of the query).

Since you are using RegionGroupID as order and partition you really have no idea what RegionID will be returned so it might very well be a new ID for each time the sub-query is checked against.

Update:

Doing the update with a join against the derived table instead instead of using in changes the semantics of the query and it changed the result as well.

This works as expected:

UPDATE R 
SET IsDefault = 1
FROM Regions as R
  inner join 
      (
        SELECT RXXID FROM (
           SELECT 
               RXX.RegionID as RXXID,
               ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM
           FROM Regions as RXX
        ) AS tmp
        WHERE tmp.RXXNUM = 1 
      ) as C
    on R.RegionID = C.RXXID

answered Sep 29 '22 20:09

Mikael Eriksson

Related questions
                            
                                Invalid object name 'dbo.CategoryIdArray'
                            
                                How to Group time segments and check break time
                            
                                How to "iterate" through a SQL result in stored procedure but avoid using a cursor?
                            
                                LEFT JOIN gives different data set depending on the position of WHERE condition
                            
                                How to drop a table in SQL Server 2008 only if exists
                            
                                Best Practices for storing large amounts of XML type data in SQL Server
                            
                                Cannot create a row of size 8074 which is greater than the allowable maximum row size of 8060
                            
                                How is timezone handled in the lifecycle of an ADO.NET + SQL Server DateTime column?
                            
                                Disable Primary Key and Re-Enable After SQL Bulk Insert
                            
                                How to use the merge command
                            
                                SQL Server 2008 Update Query with Join and Where clause in joined table
                            
                                SQL Server 2008: should I be using Windows auth or SQL Server auth?
                            
                                Conversion failed when converting date and/or time from character string
                            
                                update random numbers for top 100 rows in sql?
                            
                                How to compare varbinary in where clause in SQL Server
                            
                                Can't create sequence in SQL Server 2008
                            
                                Sql progressive sum
                            
                                Cannot connect to SQLServer database in Java application
                            
                                Get identity column values upon INSERT
                            
                                sql query to produce xml output

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With