Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there a better way to write a query for deleting records based on sub queries?

I have this query:

DELETE from MailingListTable where Md5Hash in (
   SELECT
      dbo.ListItems.Md5Hash
   FROM dbo.Lists
   INNER JOIN dbo.ListItems ON dbo.Lists.Id = dbo.ListItems.ListId
   where dbo.Lists.IsGlobal = 1
 )

The MailingListTable is built dynamically from multiple lists. I then run the above query to remove any list items that are in a global remove list.

It's not horrible on small sets, but larger sets it can take almost 5 to 8 minutes (based on some tests I did). I am curious if there is a better way to write this. I don't believe I can use joins with a delete statement. So that is why I opted for the sub query.

I also tried using EXISTS, but that was much slower. Would it be better to use common-table expressions since I am using SQL Server 2008?

like image 645
DDiVita Avatar asked Sep 05 '13 17:09

DDiVita


Video Answer


1 Answers

I presume it takes a long time because (a) you're deleting millions of rows and (b) you are treating your log like a revolving door. This isn't going to magically go from 5-8 minutes to 5 seconds because you use EXISTS instead of IN or change a subquery to a CTE or using a JOIN. Go ahead and try it, I bet it is no better:

DELETE ml 
  FROM dbo.MailingListTable AS ml
  INNER JOIN dbo.ListItems AS li
  ON ml.Md4Hash = li.Md5Hash
  INNER JOIN dbo.Lists AS l
  ON l.Id = li.ListId 
  WHERE l.IsGlobal = 1;

The problem is almost certainly the I/O involved with performing the DELETE, not the method used to identify the rows to delete. I bet a SELECT using the exact same data and without changing index structure etc. and no matter the isolation level does NOT take 5-8 minutes.

So, how to fix?

First, make sure that your log is tuned to handle transactions of that size.

  • Pre-size the log so that it doesn't ever have to grow during such an operation, perhaps to double whatever the largest size you've seen it. The exact ideal size is not something someone on Stack Overflow is going to be able to tell you.

  • Make sure auto-growth is not set to silly defaults like 10% or 1MB. Autogrow should be a fallback but, when you need it, it should happen exactly once, not multiple times to cover any specific activity. So make sure it is a fixed size (making the size + duration predictable) and that the size is reasonable (so that it only happens once). What is reasonable? No idea - too many "it depends."

  • Disable any jobs that shrink the log - permanently. Deal with out-of-control log on a case-by-case basis instead of "preventing" log growth by repeatedly shrinking the log file.

Next, consider changing your query to batch those deletes into chunks. You can play around with the TOP (?) parameter based on how many rows lead to what kind of duration of transaction (there is no magic formula for this, even if we did have a lot more information).

CREATE TABLE #x
(
  Md5Hash SOME_DATA_TYPE_I_DO_NOT_KNOW PRIMARY KEY
);

INSERT #x SELECT DISTINCT li.Md5Hash
  FROM dbo.ListItems AS li
  INNER JOIN dbo.Lists AS l
  ON l.Id = li.ListId 
  WHERE l.IsGlobal = 1;

DECLARE @p TABLE(p INT SOME_DATA_TYPE_I_DO_NOT_KNOW PRIMARY KEY);

SELECT @rc = 1;

WHILE @rc > 0
BEGIN
  DELETE @p;

  DELETE TOP (?)  
    OUTPUT deleted.Md5Hash INTO @p
    FROM #x;

  SET @rc = @@ROWCOUNT;

  BEGIN TRANSACTION;    

    DELETE ml FROM dbo.MailingListTable AS ml
    WHERE EXISTS (SELECT 1 FROM @p WHERE Md5Hash = ml.Md5Hash);

  COMMIT TRANSACTION;
  -- to minimize log impact you may want to CHECKPOINT
  -- or backup the log here, every loop or every N loops
END

This may extend the total amount of time that the operation takes (especially if you backup or checkpoint on each loop, or add an artificial delay using WAITFOR, or both), but should allow other transactions to sneak in between chunks, waiting for shorter transactions instead of the whole process. Also, because you are having less individual impact to the log, it may actually end up finishing a lot faster. But I have to assume that the problem isn't that it takes 5-8 minutes, it's probably that it takes 5-8 minutes and blocks. This should alleviate that considerably (and if it does, why do you care how long it takes?).

I wrote a lot more about this technique here.

like image 161
Aaron Bertrand Avatar answered Nov 02 '22 06:11

Aaron Bertrand