Before anything, I am not looking for a re-write. This was presented to me, and I can't seem to figure out if this is a bug in general or some kind of syntactic craziness that occurs due to the peculiarity of the script. Okay with that said on with the setup:
Microsoft SQL Server Standard Edition (64-bit)
Version 10.50.2500.0
On a table located in a generic database, defined as:
CREATE TABLE [dbo].[Regions](
[RegionID] [int] NOT NULL,
[RegionGroupID] [int] NOT NULL,
[IsDefault] [bit] NOT NULL,
CONSTRAINT [PK_Regions] PRIMARY KEY CLUSTERED
(
[RegionID] ASC
)WITH (PAD_INDEX = OFF, STATISTICS_NORECOMPUTE = OFF, IGNORE_DUP_KEY = OFF, ALLOW_ROW_LOCKS = ON, ALLOW_PAGE_LOCKS = ON) ON [PRIMARY]
) ON [PRIMARY]
insert some values:
INSERT INTO [dbo].[Regions]
([RegionID],[RegionGroupID],[IsDefault])
VALUES
(0,1,0),
(1,1,0),
(2,1,0),
(3,2,0),
(4,2,0),
(5,2,0),
(6,3,0),
(7,3,0),
(8,3,0)
Now run the query (to select a single from each group, remember no rewrite suggestions!):
SELECT RXXID FROM (
SELECT
RXX.RegionID as RXXID,
ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM
FROM Regions as RXX
) AS tmp
WHERE tmp.RXXNUM = 1
You should get:
RXXID
-----------
0
3
6
Now stick that inside an update statement (with a preset to 0 and a select all after):
UPDATE Regions SET IsDefault = 0
UPDATE Regions
SET IsDefault = 1
WHERE RegionID IN (
SELECT RXXID FROM (
SELECT
RXX.RegionID as RXXID,
ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM
FROM Regions as RXX
) AS tmp
WHERE tmp.RXXNUM = 1
)
SELECT * FROM Regions
ORDER BY RegionGroupID
and get this result:
RegionID RegionGroupID IsDefault
----------- ------------- ---------
0 1 1
1 1 1
2 1 1
3 2 1
4 2 1
5 2 1
6 3 1
7 3 1
8 3 1
zomg wtf lamaz?
While I don't claim to be a SQL guru, this seems neither proper nor correct. And to make things more crazy, if you drop the primary key it seems to work:
Drop primary key:
IF EXISTS (SELECT * FROM sys.indexes WHERE object_id = OBJECT_ID(N'[dbo].[Regions]') AND name = N'PK_Regions')
ALTER TABLE [dbo].[Regions] DROP CONSTRAINT [PK_Regions]
And re-run update statement set, result:
RegionID RegionGroupID IsDefault
----------- ------------- ---------
0 1 1
1 1 0
2 1 0
3 2 1
4 2 0
5 2 0
6 3 1
7 3 0
8 3 0
Isn't that a b?
Does anyone have any clue what is going on here? My guess is some kind of sub-query caching and is this a bug? It sure doesn't seem like what SQL should be doing?
For multiple-table subqueries, execution of NULL IN (SELECT ...) is particularly slow because the join optimizer does not optimize for the case where the outer expression is NULL .
Subqueries are not allowed in the defining query of a CREATE PROJECTION statement. Subqueries are supported within UPDATE statements with the following exceptions: You cannot use SET column = {expression} to specify a subquery.
It can't include a COMPUTE or FOR BROWSE clause, and may only include an ORDER BY clause when a TOP clause is also specified. A subquery can be nested inside the WHERE or HAVING clause of an outer SELECT , INSERT , UPDATE , or DELETE statement, or inside another subquery.
Just update as a CTE directly:
WITH tmp AS (
SELECT
RegionID as RXXID,
RegionGroupID,
IsDefault,
ROW_NUMBER() OVER (PARTITION BY RegionGroupID ORDER BY RegionID) AS RXXNUM
FROM Regions
)
UPDATE tmp SET IsDefault = 1 WHERE RXXNUM = 1
select * from Regions
Added more columns to illustrate. You can see this on http://sqlfiddle.com/#!3/03913/9
Not 100% sure what is going on in your example, but since you partition and order by the same column, you're not really certain to get the same order back, since they are all tied. Shouldn't you order by RegionID or some other column, as i did on sqlfiddle?
Back to your question:
If you change your UPDATE (with the clustered index) to a SELECT, you'll get all 9 rows back. If you drop the PK, and do the SELECT, you only get 3 rows. Back to your update statement. Inspecting the execution plans show that they differ slightly:
What you can see here is that in the first (with PK) query, you'll scan the clustered index for the outer reference, note that it does not have the alias RXX. Then for each row in the top, do a lookup to the RXX. And yes, because of your row number ordering, every RegionID can be row_number() 1 for each RegionGroupID. SQL Server would know this based on your PK, i guess, and can say that For every RegionID, this RegionID can be row number 1. Therefore the statement is rather valid.
In the second query, there is no index, and you get a table scan on Regions, then it builds a probe table using the RXX, and joins differently (single pass, ROW_NUMBER() can only be 1 for one row per regiongroupid now). This way in that scan, every RegionID has only one ROW_NUMBER(), though you cannot be 100% certain it'll be the same every time.
This means: Using your subquery which doesn't have a deterministic order for every execution, you should avoid using a multiple pass (NESTED LOOP) join type, but a single pass (MERGE OR HASH) join.
To fix this without changing the structure of your query, add OPTION (HASH JOIN) or OPTION (MERGE JOIN) to the first UPDATE:
So, you'll need the following update statement (when you have the PK):
UPDATE Regions SET IsDefault = 0
UPDATE Regions
SET IsDefault = 1
WHERE RegionID IN (
SELECT RXXID FROM (
SELECT
RXX.RegionID as RXXID,
ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM
FROM Regions as RXX
) AS tmp
WHERE tmp.RXXNUM = 1
)
OPTION (HASH JOIN)
SELECT * FROM Regions
ORDER BY RegionGroupID
Here are the execution plans using these two join types (note actual number of rows: 3 in the properties):
Your query in plain language is something like:
For each row in Regions
check if RegionID
exists in some sub query. Meaning that the sub query is executed for each row in Regions
. (I know that is not the case but it is the semantics of the query).
Since you are using RegionGroupID
as order and partition you really have no idea what RegionID
will be returned so it might very well be a new ID for each time the sub-query is checked against.
Update:
Doing the update with a join against the derived table instead instead of using in changes the semantics of the query and it changed the result as well.
This works as expected:
UPDATE R
SET IsDefault = 1
FROM Regions as R
inner join
(
SELECT RXXID FROM (
SELECT
RXX.RegionID as RXXID,
ROW_NUMBER() OVER (PARTITION BY RXX.RegionGroupID ORDER BY RXX.RegionGroupID) AS RXXNUM
FROM Regions as RXX
) AS tmp
WHERE tmp.RXXNUM = 1
) as C
on R.RegionID = C.RXXID
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With