I'm looking for a way to select one table row explicitly for one thread. I've written a crawler, that works with about 50 parallel processes. Every process has to take one row out of a table and process it.
CREATE TABLE `crawler_queue` (
`id` int(10) unsigned NOT NULL AUTO_INCREMENT,
`url` text NOT NULL,
`class_id` tinyint(3) unsigned NOT NULL,
`server_id` tinyint(3) unsigned NOT NULL,
`proc_id` mediumint(8) unsigned NOT NULL,
`prio` tinyint(3) unsigned NOT NULL,
`inserted` int(10) unsigned NOT NULL,
PRIMARY KEY (`id`),
KEY `proc_id` (`proc_id`),
KEY `app_id` (`app_id`),
KEY `crawler` (`class_id`,`prio`,`proc_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8
Now my processes do the following:
SELECT * FROM crawler_queue WHERE class_id=2 AND prio=20 AND proc_id=0 ORDER BY id LIMIT 1 FOR UPDATE
UPDATE crawler_queue SET server_id=1,proc_id=1376 WHERE id=23892
This should help that no other process can grab a row that is processed yet. Doing an EXPLAIN on the select shows
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE crawler_queue ref proc_id,crawler proc_id 3 const 617609 Using where
But the processes seem to cause too high parallelism, because sometimes I can see two types of errors/warnings in my log (every 5 minutes or so):
mysqli::query(): (HY000/1205): Lock wait timeout exceeded; try restarting transaction (in /var/www/db.php l
ine 81)
mysqli::query(): (40001/1213): Deadlock found when trying to get lock; try restarting transaction (in /var/www/db.php line 81)
My question is: can anybody point me in the right direction to minimize these locking problems? (in production state, the parallelism would be 3-4 times higher than now, so I assume, that there would be much more locking problems)
I modified SELECT
to use index crawler
by hint USE INDEX(crawler)
. My problem now are lockwait timeouts anymore (deadlocks disappeared).
EXPLAIN
with USE INDEX()
shows now (no. of rows is higher, because table contains more data now):
id select_type table type possible_keys key key_len ref rows Extra
1 SIMPLE crawler_queue ref proc_id,crawler crawler 5 const,const,const 5472426 Using where
Your EXPLAIN report shows that you're using only the single-column index proc_id
, and the query has to examine over 600K rows. It would probably be better if the optimizer chose the crawler
index.
InnoDB may be locking all 600K rows, not just the rows that match the full condition in your WHERE clause. InnoDB locks all the examined rows to make sure concurrent changes don't get written to the binlog in the wrong order.
The solution is to use an index to narrow the range of examined rows. This will probably help you not only to find the rows more quickly, but also to avoid locking large ranges of rows. The crawler
index should help here, but it's not immediately clear why it's not using that index.
You might have to ANALYZE TABLE
to make sure to update InnoDB's table statistics to know about the crawler
index before it uses that index in the optimization plan. ANALYZE TABLE is an inexpensive operation.
The other option is to use an index hint:
SELECT * FROM crawler_queue USE INDEX(crawler) ...
This tells the optimizer to use that index, and do not consider other indexes for this query. I prefer to avoid index hints, because the optimizer is usually able to make good decisions on its own, and using the hint in code means I may be forcing the optimizer not to consider an index I create in the future, which it would otherwise choose.
With more explanation, it's now clear you're using your RDBMS as a FIFO. This is not an efficient use of an RDBMS. There are message queue technologies for this purpose.
See also:
From what I can tell the problem that you're facing is that two threads are vyying for the same row in the table and they both can't have it. But there isn't any elegant way for the database to say "no you can't have that one, find another row" and thus you get errors. This is called resource contention.
When you're doing highly parallel work like this one of the easiest ways to reduce contention-based problems is to completely eliminate the contention by inventing a way for all the threads to know which rows they're supposed to work on ahead of time. Then they can lock without having to contend for the resources and your database doesn't have to resolve the contention.
How best to do this? Usually people pick some kind of thread-id scheme and use modulo arithmetic to determine which threads get which rows. If you 10 threads then thread 0 gets row 0, 10, 20, 30, etc. Thread 1 gets 1, 11, 21, 31, etc.
In general if you have NUM_THREADS then each of your threads would pick the ids which are THREAD_ID + i*NUM_THREADS from the database and work on those.
We have introduced a problem in that threads may stall or die, and you could end up with rows in the database which never get touched. There are several solutions to that problem, one of which is to run a "cleanup" once most/all of your threads have finished where all the threads grab piecemeal whatever rows they can and crawl them until there are no un-crawled URLs left. You could get more sophisticated and have a few cleanup threads constantly running, or have each thread occasionally perform cleanup duties, etc.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With