Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Faster way to delete matching rows?

I'm a relative novice when it comes to databases. We are using MySQL and I'm currently trying to speed up a SQL statement that seems to take a while to run. I looked around on SO for a similar question but didn't find one.

The goal is to remove all the rows in table A that have a matching id in table B.

I'm currently doing the following:

DELETE FROM a WHERE EXISTS (SELECT b.id FROM b WHERE b.id = a.id); 

There are approximately 100K rows in table a and about 22K rows in table b. The column 'id' is the PK for both tables.

This statement takes about 3 minutes to run on my test box - Pentium D, XP SP3, 2GB ram, MySQL 5.0.67. This seems slow to me. Maybe it isn't, but I was hoping to speed things up. Is there a better/faster way to accomplish this?


EDIT:

Some additional information that might be helpful. Tables A and B have the same structure as I've done the following to create table B:

CREATE TABLE b LIKE a; 

Table a (and thus table b) has a few indexes to help speed up queries that are made against it. Again, I'm a relative novice at DB work and still learning. I don't know how much of an effect, if any, this has on things. I assume that it does have an effect as the indexes have to be cleaned up too, right? I was also wondering if there were any other DB settings that might affect the speed.

Also, I'm using INNO DB.


Here is some additional info that might be helpful to you.

Table A has a structure similar to this (I've sanitized this a bit):

DROP TABLE IF EXISTS `frobozz`.`a`; CREATE TABLE  `frobozz`.`a` (   `id` bigint(20) unsigned NOT NULL auto_increment,   `fk_g` varchar(30) NOT NULL,   `h` int(10) unsigned default NULL,   `i` longtext,   `j` bigint(20) NOT NULL,   `k` bigint(20) default NULL,   `l` varchar(45) NOT NULL,   `m` int(10) unsigned default NULL,   `n` varchar(20) default NULL,   `o` bigint(20) NOT NULL,   `p` tinyint(1) NOT NULL,   PRIMARY KEY  USING BTREE (`id`),   KEY `idx_l` (`l`),   KEY `idx_h` USING BTREE (`h`),   KEY `idx_m` USING BTREE (`m`),   KEY `idx_fk_g` USING BTREE (`fk_g`),   KEY `fk_g_frobozz` (`id`,`fk_g`),   CONSTRAINT `fk_g_frobozz` FOREIGN KEY (`fk_g`) REFERENCES `frotz` (`g`) ) ENGINE=InnoDB AUTO_INCREMENT=179369 DEFAULT CHARSET=utf8 ROW_FORMAT=DYNAMIC; 

I suspect that part of the issue is there are a number of indexes for this table. Table B looks similar to table B, though it only contains the columns id and h.

Also, the profiling results are as follows:

starting 0.000018 checking query cache for query 0.000044 checking permissions 0.000005 Opening tables 0.000009 init 0.000019 optimizing 0.000004 executing 0.000043 end 0.000005 end 0.000002 query end 0.000003 freeing items 0.000007 logging slow query 0.000002 cleaning up 0.000002 

SOLVED

Thanks to all the responses and comments. They certainly got me to think about the problem. Kudos to dotjoe for getting me to step away from the problem by asking the simple question "Do any other tables reference a.id?"

The problem was that there was a DELETE TRIGGER on table A which called a stored procedure to update two other tables, C and D. Table C had a FK back to a.id and after doing some stuff related to that id in the stored procedure, it had the statement,

DELETE FROM c WHERE c.id = theId; 

I looked into the EXPLAIN statement and rewrote this as,

EXPLAIN SELECT * FROM c WHERE c.other_id = 12345; 

So, I could see what this was doing and it gave me the following info:

id            1 select_type   SIMPLE table         c type          ALL possible_keys NULL key           NULL key_len       NULL ref           NULL rows          2633 Extra         using where 

This told me that it was a painful operation to make and since it was going to get called 22500 times (for the given set of data being deleted), that was the problem. Once I created an INDEX on that other_id column and reran the EXPLAIN, I got:

id            1 select_type   SIMPLE table         c type          ref possible_keys Index_1 key           Index_1 key_len       8 ref           const rows          1 Extra          

Much better, in fact really great.

I added that Index_1 and my delete times are in line with the times reported by mattkemp. This was a really subtle error on my part due to shoe-horning some additional functionality at the last minute. It turned out that most of the suggested alternative DELETE/SELECT statements, as Daniel stated, ended up taking essentially the same amount of time and as soulmerge mentioned, the statement was pretty much the best I was going to be able to construct based on what I needed to do. Once I provided an index for this other table C, my DELETEs were fast.

Postmortem:
Two lessons learned came out of this exercise. First, it is clear that I didn't leverage the power of the EXPLAIN statement to get a better idea of the impact of my SQL queries. That's a rookie mistake, so I'm not going to beat myself up about that one. I'll learn from that mistake. Second, the offending code was the result of a 'get it done quick' mentality and inadequate design/testing led to this problem not showing up sooner. Had I generated several sizable test data sets to use as test input for this new functionality, I'd have not wasted my time nor yours. My testing on the DB side was lacking the depth that my application side has in place. Now I've got the opportunity to improve that.

Reference: EXPLAIN Statement

like image 353
itsmatt Avatar asked May 01 '09 18:05

itsmatt


People also ask

How do I get rid of match rows in Excel?

Select the range of cells that has duplicate values you want to remove. Tip: Remove any outlines or subtotals from your data before trying to remove duplicates. Click Data > Remove Duplicates, and then Under Columns, check or uncheck the columns where you want to remove the duplicates.

What is the fastest way to permanently delete a 1 million row table named customers?

If you are in a hurry and need the fastest way possible: take the database out of production. drop all non-clustered indexes and triggers. delete the records (or if the majority of records is bad, copy+drop+rename the table)

How do I automatically delete duplicates in Excel?

➤ Go to Data > Remove Duplicates tool in Excel Toolbar under the section Data Tools. Step 2: ➤ Click on Remove Duplicates. ➤ Put a check on all names of the columns you want to remove duplicates from.


1 Answers

Deleting data from InnoDB is the most expensive operation you can request of it. As you already discovered the query itself is not the problem - most of them will be optimized to the same execution plan anyway.

While it may be hard to understand why DELETEs of all cases are the slowest, there is a rather simple explanation. InnoDB is a transactional storage engine. That means that if your query was aborted halfway-through, all records would still be in place as if nothing happened. Once it is complete, all will be gone in the same instant. During the DELETE other clients connecting to the server will see the records until your DELETE completes.

To achieve this, InnoDB uses a technique called MVCC (Multi Version Concurrency Control). What it basically does is to give each connection a snapshot view of the whole database as it was when the first statement of the transaction started. To achieve this, every record in InnoDB internally can have multiple values - one for each snapshot. This is also why COUNTing on InnoDB takes some time - it depends on the snapshot state you see at that time.

For your DELETE transaction, each and every record that is identified according to your query conditions, gets marked for deletion. As other clients might be accessing the data at the same time, it cannot immediately remove them from the table, because they have to see their respective snapshot to guarantee the atomicity of the deletion.

Once all records have been marked for deletion, the transaction is successfully committed. And even then they cannot be immediately removed from the actual data pages, before all other transactions that worked with a snapshot value before your DELETE transaction, have ended as well.

So in fact your 3 minutes are not really that slow, considering the fact that all records have to be modified in order to prepare them for removal in a transaction safe way. Probably you will "hear" your hard disk working while the statement runs. This is caused by accessing all the rows. To improve performance you can try to increase InnoDB buffer pool size for your server and try to limit other access to the database while you DELETE, thereby also reducing the number of historic versions InnoDB has to maintain per record. With the additional memory InnoDB might be able to read your table (mostly) into memory and avoid some disk seeking time.

like image 181
Daniel Schneller Avatar answered Oct 22 '22 14:10

Daniel Schneller