I have a table <code>cats</code> with 42,795,120 rows. Apparently this is a lot of rows. So when I do: <pre class="prettyprint"><code>/* owner_cats is a many-to-many join table */ DELETE FROM cats WHERE cats.id_cat IN ( SELECT owner_cats.id_cat FROM owner_cats WHERE owner_cats.id_owner = 1) </code></pre> the query times out :( (edit: I need to increase my <code>CommandTimeout</code> value, default is only 30 seconds) I can't use <code>TRUNCATE TABLE cats</code> because I don't want to blow away cats from other owners. I'm using SQL Server 2005 with "Recovery model" set to "Simple." So, I thought about doing something like this (executing this SQL from an application btw): <pre class="prettyprint"><code>DELETE TOP (25) PERCENT FROM cats WHERE cats.id_cat IN ( SELECT owner_cats.id_cat FROM owner_cats WHERE owner_cats.id_owner = 1) DELETE TOP(50) PERCENT FROM cats WHERE cats.id_cat IN ( SELECT owner_cats.id_cat FROM owner_cats WHERE owner_cats.id_owner = 1) DELETE FROM cats WHERE cats.id_cat IN ( SELECT owner_cats.id_cat FROM owner_cats WHERE owner_cats.id_owner = 1) </code></pre> My question is: what is the threshold of the number of rows I can <code>DELETE</code> in SQL Server 2005? Or, if my approach is not optimal, please suggest a better approach. Thanks. This post didn't help me enough: <ul> <li>SQL Server Efficiently dropping a group of rows with millions and millions of rows</li> </ul> EDIT (8/6/2010): Okay, I just realized after reading the above link again that I did not have indexes on these tables. Also, some of you have already pointed out that issue in the comments below. Keep in mind this is a fictitious schema, so even <code>id_cat</code> is not a PK, because in my real life schema, it's not a unique field. I will put indexes on: <ol> <li><code>cats.id_cat</code></li> <li><code>owner_cats.id_cat</code></li> <li><code>owner_cats.id_owner</code></li> </ol> I guess I'm still getting the hang of this data warehousing, and obviously I need indexes on all the <code>JOIN</code> fields right? However, it takes hours for me to do this batch load process. I'm already doing it as a <code>SqlBulkCopy</code> (in chunks, not 42 mil all at once). I have some indexes and PKs. I read the following posts which confirms my theory that the indexes are slowing down even a bulk copy: <ul> <li>SqlBulkCopy slow as molasses</li> <li>What’s the fastest way to bulk insert a lot of data in SQL Server (C# client)</li> </ul> So I'm going to <code>DROP</code> my indexes before the copy and then re <code>CREATE</code> them when it's done. Because of the long load times, it's going to take me awhile to test these suggestions. I'll report back with the results. UPDATE (8/7/2010): Tom suggested: <pre class="prettyprint"><code>DELETE FROM cats c WHERE EXISTS (SELECT 1 FROM owner_cats o WHERE o.id_cat = c.id_cat AND o.id_owner = 1) </code></pre> And still with no indexes, for 42 million rows, it took 13:21 min:sec versus 22:08 with the way described above. However, for 13 million rows, took him 2:13 versus 2:10 my old way. It's a neat idea, but I still need to use indexes! Update (8/8/2010): Something is terribly wrong! Now with the indexes on, my first delete query above took 1:9 hrs:min (yes an hour!) versus 22:08 min:sec and 13:21 min:sec versus 2:10 min:sec for 42 mil rows and 13 mil rows respectively. I'm going to try Tom's query with the indexes now, but this is heading in the wrong direction. Please help. Update (8/9/2010): Tom's delete took 1:06 hrs:min for 42 mil rows and 10:50 min:sec for 13 mil rows with indexes versus 13:21 min:sec and 2:13 min:sec respectively. Deletes are taking longer on my database when I use indexes by an order of magnitude! I think I know why, my database .mdf and .ldf grew from 3.5 GB to 40.6 GB during the first (42 mil) delete! What am I doing wrong? Update (8/10/2010): For lack of any other options, I have come up with what I feel is a lackluster solution (hopefully temporary): <ol> <li>Increase timeout for database connection to 1 hour (<code>CommandTimeout=60000;</code> default was 30 sec)</li> <li>Use Tom's query: <code>DELETE FROM WHERE EXISTS (SELECT 1 ...)</code> because it performed a little faster</li> <li> <code>DROP</code> all indexes and PKs before running delete statement (???) </li> <li>Run <code>DELETE</code> statement</li> <li> <code>CREATE</code> all indexes and PKs</li> </ol> Seems crazy, but at least it's faster than using <code>TRUNCATE</code> and starting over my load from the beginning with the first <code>owner_id</code>, because one of my <code>owner_id</code> takes 2:30 hrs:min to load versus 17:22 min:sec for the delete process I just described with 42 mil rows. (Note: if my load process throws an exception, I start over for that <code>owner_id</code>, but I don't want to blow away previous <code>owner_id</code>, so I don't want to <code>TRUNCATE</code> the <code>owner_cats</code> table, which is why I'm trying to use <code>DELETE</code>.) Anymore help would still be appreciated :)

If the delete will remove "a significant number" of rows from the table, this can be an alternative to a DELETE: put the records to keep somewhere else, truncate the original table, put back the 'keepers'. Something like: <pre class="prettyprint"><code>SELECT * INTO #cats_to_keep FROM cats WHERE cats.id_cat NOT IN ( -- note the NOT SELECT owner_cats.id_cat FROM owner_cats WHERE owner_cats.id_owner = 1) TRUNCATE TABLE cats INSERT INTO cats SELECT * FROM #cats_to_keep </code></pre>

DELETE SQL with correlated subquery for table with 42 million rows?

Tags:

sql

sql-server

sql-delete

correlated-subquery

I have a table cats with 42,795,120 rows.

Apparently this is a lot of rows. So when I do:

/* owner_cats is a many-to-many join table */
DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

the query times out :(

(edit: I need to increase my CommandTimeout value, default is only 30 seconds)

I can't use TRUNCATE TABLE cats because I don't want to blow away cats from other owners.

I'm using SQL Server 2005 with "Recovery model" set to "Simple."

So, I thought about doing something like this (executing this SQL from an application btw):

DELETE TOP (25) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

DELETE TOP(50) PERCENT FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

DELETE FROM cats
WHERE cats.id_cat IN (
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

My question is: what is the threshold of the number of rows I can DELETE in SQL Server 2005?

Or, if my approach is not optimal, please suggest a better approach. Thanks.

This post didn't help me enough:

SQL Server Efficiently dropping a group of rows with millions and millions of rows

EDIT (8/6/2010):

Okay, I just realized after reading the above link again that I did not have indexes on these tables. Also, some of you have already pointed out that issue in the comments below. Keep in mind this is a fictitious schema, so even id_cat is not a PK, because in my real life schema, it's not a unique field.

I will put indexes on:

cats.id_cat
owner_cats.id_cat
owner_cats.id_owner

I guess I'm still getting the hang of this data warehousing, and obviously I need indexes on all the JOIN fields right?

However, it takes hours for me to do this batch load process. I'm already doing it as a SqlBulkCopy (in chunks, not 42 mil all at once). I have some indexes and PKs. I read the following posts which confirms my theory that the indexes are slowing down even a bulk copy:

SqlBulkCopy slow as molasses
What’s the fastest way to bulk insert a lot of data in SQL Server (C# client)

So I'm going to DROP my indexes before the copy and then re CREATE them when it's done.

Because of the long load times, it's going to take me awhile to test these suggestions. I'll report back with the results.

UPDATE (8/7/2010):

Tom suggested:

DELETE
FROM cats c
WHERE EXISTS (SELECT 1
FROM owner_cats o
WHERE o.id_cat = c.id_cat
AND o.id_owner = 1)

And still with no indexes, for 42 million rows, it took 13:21 min:sec versus 22:08 with the way described above. However, for 13 million rows, took him 2:13 versus 2:10 my old way. It's a neat idea, but I still need to use indexes!

Update (8/8/2010):

Something is terribly wrong! Now with the indexes on, my first delete query above took 1:9 hrs:min (yes an hour!) versus 22:08 min:sec and 13:21 min:sec versus 2:10 min:sec for 42 mil rows and 13 mil rows respectively. I'm going to try Tom's query with the indexes now, but this is heading in the wrong direction. Please help.

Update (8/9/2010):

Tom's delete took 1:06 hrs:min for 42 mil rows and 10:50 min:sec for 13 mil rows with indexes versus 13:21 min:sec and 2:13 min:sec respectively. Deletes are taking longer on my database when I use indexes by an order of magnitude! I think I know why, my database .mdf and .ldf grew from 3.5 GB to 40.6 GB during the first (42 mil) delete! What am I doing wrong?

Update (8/10/2010):

For lack of any other options, I have come up with what I feel is a lackluster solution (hopefully temporary):

Increase timeout for database connection to 1 hour (CommandTimeout=60000; default was 30 sec)
Use Tom's query: DELETE FROM WHERE EXISTS (SELECT 1 ...) because it performed a little faster
DROP all indexes and PKs before running delete statement (???)
Run DELETE statement
CREATE all indexes and PKs

Seems crazy, but at least it's faster than using TRUNCATE and starting over my load from the beginning with the first owner_id, because one of my owner_id takes 2:30 hrs:min to load versus 17:22 min:sec for the delete process I just described with 42 mil rows. (Note: if my load process throws an exception, I start over for that owner_id, but I don't want to blow away previous owner_id, so I don't want to TRUNCATE the owner_cats table, which is why I'm trying to use DELETE.)

Anymore help would still be appreciated :)

510

asked Aug 06 '10 22:08

JohnB

2 Answers

There is no practical threshold. It depends on what your command timeout is set to on your connection.

Keep in mind that the time it takes to delete all of these rows is contingent upon:

The time it takes to find the rows of interest
The time it takes to log the transaction in the transaction log
The time it takes to delete the index entries of interest
The time it takes to delete the actual rows of interest
The time it takes to wait for other processes to stop using the table so you can acquire what in this case will most likely be an exclusive table lock

The last point may often be the most significant. Do an sp_who2 command in another query window to make sure that there isn't lock contention going on, preventing your command from executing.

Improperly configured SQL Servers will do poorly at this type of query. Transaction logs which are too small and/or share the same disks as the data files will often incur severe performance penalties when working with large rows.

As for a solution, well, like all things, it depends. Is this something you intend to be doing often? Depending on how many rows you have left, the fastest way might be to rebuild the table as another name and then rename it and recreate its constraints, all inside a transaction. If this is just an ad-hoc thing, make sure your ADO CommandTimeout is set high enough and you can just bear the cost of this big delete.

195

answered Oct 04 '22 11:10

Dave Markle

If the delete will remove "a significant number" of rows from the table, this can be an alternative to a DELETE: put the records to keep somewhere else, truncate the original table, put back the 'keepers'. Something like:

SELECT *
INTO #cats_to_keep
FROM cats
WHERE cats.id_cat NOT IN (    -- note the NOT
SELECT owner_cats.id_cat FROM owner_cats
WHERE owner_cats.id_owner = 1)

TRUNCATE TABLE cats

INSERT INTO cats
SELECT * FROM #cats_to_keep

answered Oct 04 '22 11:10

Patrick Marchand

Related questions
                            
                                How do you avoid column name conflicts?
                            
                                using GO keyword in SQL Server
                            
                                Why is my string value accumulating empty spaces when inserting into SQL database?
                            
                                Inner Join Three Tables
                            
                                Can't attach any database to SQL Server 2012
                            
                                Unknown Column in Field List. PHP + Mysql
                            
                                Difference Between Timestamps in Milliseconds in Oracle
                            
                                Rails: repeated ActiveRecord::RecordNotUnique when creating objects with Postgres?
                            
                                Deleting data from a large table
                            
                                How to fetch unmatching records from two SQL tables?
                            
                                How do I search for names starting wih A in MySQL?
                            
                                Postgres insert optimization
                            
                                Infinite loop in SQL?
                            
                                How to retrieve the Date part out of a Datetime result column in SQLite?
                            
                                SQL query for something like "not having"?
                            
                                MySQL: select first element of a comma-separated list
                            
                                How to select particular range of values in MySQL table?
                            
                                How do you leave comments in SQL Server 2008 R2 view with SSMS?
                            
                                Get one element from jsonb array in PostgreSQL
                            
                                View Temporary Table Created from Stored Procedure

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With