I've been analyzing a recurring "bug report" (perf issue) in one of our systems related to a particularly slow delete operation. Long story short: It seems that the CASCADE DELETE
keys were largely responsible, and I'd like to know (a) if this makes sense, and (b) why it's the case.
We have a schema of, let's say, widgets, those being at the root of a large graph of related tables and related-to-related tables and so on. To be perfectly clear, deleting from this table is actively discouraged; it is the "nuclear option" and users are under no illusions to the contrary. Nevertheless, it sometimes just has to be done.
The schema looks something like this:
Widgets
|
+--- Anvils [1:1]
| |
| +--- AnvilTestData [1:N]
|
+--- WidgetHistory (1:N)
|
+--- WidgetHistoryDetails (1:N)
Column definitions look like the following:
Widgets (WidgetID int PK, WidgetName varchar(50))
Anvils (AnvilID int PK, WidgetID int FK/IX/UNIQUE, ...)
AnvilTestData (AnvilID int FK/IX, TestID int, ...Test Data...)
WidgetHistory (HistoryID int PK, WidgetID int FK/IX, HistoryDate datetime, ...)
WidgetHistoryDetails (HistoryID int FK/IX, DetailType smallint, ...)
Nothing too scary, really. A Widget
can be different types, an Anvil
is a special type, so that relationship is 1:1 (or more accurately 1:0..1). Then there's a large amount of data - perhaps thousands of rows of AnvilTestData
per Anvil
collected over time, dealing with hardness, corrosion, exact weight, hammer compatibility, usability issues, and impact tests with cartoon heads.
Then every Widget
has a long, boring history of various types of transactions - production, inventory moves, sales, defect investigations, RMAs, repairs, customer complaints, etc. There might be 10-20k details for a single widget, or none at all, depending on its age.
So, unsurprisingly, there's a CASCADE DELETE
relationship at every level here. If a Widget
needs to be deleted, it means something's gone terribly wrong and we need to erase any records of that widget ever existing, including its history, test data, etc. Again, nuclear option.
Relations are all indexed, statistics are up to date. Normal queries are fast. The system tends to hum along pretty smoothly for everything except deletes.
Getting to the point here, finally, for various reasons we only allow deleting one widget at a time, so a delete statement would look like this:
DELETE FROM Widgets
WHERE WidgetID = @WidgetID
Pretty simple, innocuous looking delete... that takes over 2 minutes to run, for a widget with no data!
After slogging through execution plans I was finally able to pick out the AnvilTestData
and WidgetHistoryDetails
deletes as the sub-operations with the highest cost. So I experimented with turning off the CASCADE
(but keeping the actual FK, just setting it to NO ACTION
) and rewriting the script as something very much like the following:
DECLARE @AnvilID int
SELECT @AnvilID = AnvilID FROM Anvils WHERE WidgetID = @WidgetID
DELETE FROM AnvilTestData
WHERE AnvilID = @AnvilID
DELETE FROM WidgetHistory
WHERE HistoryID IN (
SELECT HistoryID
FROM WidgetHistory
WHERE WidgetID = @WidgetID)
DELETE FROM Widgets WHERE WidgetID = @WidgetID
Both of these "optimizations" resulted in significant speedups, each one shaving nearly a full minute off the execution time, so that the original 2-minute deletion now takes about 5-10 seconds - at least for new widgets, without much history or test data.
Just to be absolutely clear, there is still a CASCADE
from WidgetHistory
to WidgetHistoryDetails
, where the fanout is highest, I only removed the one originating from Widgets
.
Further "flattening" of the cascade relationships resulted in progressively less dramatic but still noticeable speedups, to the point where deleting a new widget was almost instantaneous once all of the cascade deletes to larger tables were removed and replaced with explicit deletes.
I'm using DBCC DROPCLEANBUFFERS
and DBCC FREEPROCCACHE
before each test. I've disabled all triggers that might be causing further slowdowns (although those would show up in the execution plan anyway). And I'm testing against older widgets, too, and noticing a significant speedup there as well; deletes that used to take 5 minutes now take 20-40 seconds.
Now I'm an ardent supporter of the "SELECT ain't broken" philosophy, but there just doesn't seem to be any logical explanation for this behaviour other than crushing, mind-boggling inefficiency of the CASCADE DELETE
relationships.
So, my questions are:
Is this a known issue with DRI in SQL Server? (I couldn't seem to find any references to this sort of thing on Google or here in SO; I suspect the answer is no.)
If not, is there another explanation for the behaviour I'm seeing?
If it is a known issue, why is it an issue, and are there better workarounds I could be using?
SQL Server
is best at set-based operations, while CASCADE
deletions are, by their nature, record-based.
SQL Server
, unlike the other servers, tries to optimize the immediate set-based operations, however, it works only one level deep. It needs to have the records deleted in the upper-level tables to delete those in the lower-level tables.
In other words, cascading operations work up-down, while your solution works down-up, which is more set-based and efficient.
Here's a sample schema:
CREATE TABLE t_g (id INT NOT NULL PRIMARY KEY)
CREATE TABLE t_p (id INT NOT NULL PRIMARY KEY, g INT NOT NULL, CONSTRAINT fk_p_g FOREIGN KEY (g) REFERENCES t_g ON DELETE CASCADE)
CREATE TABLE t_c (id INT NOT NULL PRIMARY KEY, p INT NOT NULL, CONSTRAINT fk_c_p FOREIGN KEY (p) REFERENCES t_p ON DELETE CASCADE)
CREATE INDEX ix_p_g ON t_p (g)
CREATE INDEX ix_c_p ON t_c (p)
, this query:
DELETE
FROM t_g
WHERE id > 50000
and its plan:
|--Sequence
|--Table Spool
| |--Clustered Index Delete(OBJECT:([test].[dbo].[t_g].[PK__t_g__176E4C6B]), WHERE:([test].[dbo].[t_g].[id] > (50000)))
|--Index Delete(OBJECT:([test].[dbo].[t_p].[ix_p_g]) WITH ORDERED PREFETCH)
| |--Sort(ORDER BY:([test].[dbo].[t_p].[g] ASC, [test].[dbo].[t_p].[id] ASC))
| |--Table Spool
| |--Clustered Index Delete(OBJECT:([test].[dbo].[t_p].[PK__t_p__195694DD]) WITH ORDERED PREFETCH)
| |--Sort(ORDER BY:([test].[dbo].[t_p].[id] ASC))
| |--Merge Join(Inner Join, MERGE:([test].[dbo].[t_g].[id])=([test].[dbo].[t_p].[g]), RESIDUAL:([test].[dbo].[t_p].[g]=[test].[dbo].[t_g].[id]))
| |--Table Spool
| |--Index Scan(OBJECT:([test].[dbo].[t_p].[ix_p_g]), ORDERED FORWARD)
|--Index Delete(OBJECT:([test].[dbo].[t_c].[ix_c_p]) WITH ORDERED PREFETCH)
|--Sort(ORDER BY:([test].[dbo].[t_c].[p] ASC, [test].[dbo].[t_c].[id] ASC))
|--Clustered Index Delete(OBJECT:([test].[dbo].[t_c].[PK__t_c__1C330188]) WITH ORDERED PREFETCH)
|--Table Spool
|--Sort(ORDER BY:([test].[dbo].[t_c].[id] ASC))
|--Hash Match(Inner Join, HASH:([test].[dbo].[t_p].[id])=([test].[dbo].[t_c].[p]))
|--Table Spool
|--Index Scan(OBJECT:([test].[dbo].[t_c].[ix_c_p]), ORDERED FORWARD)
First, SQL Server
deletes records from t_g
, then joins the records deleted with t_p
and deletes from the latter, finally, joins records deleted from t_p
with t_c
and deletes from t_c
.
A single three-table join would be much more efficient in this case, and this is what you do with your workaround.
If it makes you feel better, Oracle
does not optimize cascade operations in any way: they are always NESTED LOOPS
and God help you if your forgot to create an index on the referencing column.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With