Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Select only one table row on high parallel connections

I'm looking for a way to select one table row explicitly for one thread. I've written a crawler, that works with about 50 parallel processes. Every process has to take one row out of a table and process it.

CREATE TABLE `crawler_queue` (
 `id` int(10) unsigned NOT NULL AUTO_INCREMENT,
 `url` text NOT NULL,
 `class_id` tinyint(3) unsigned NOT NULL,
 `server_id` tinyint(3) unsigned NOT NULL,
 `proc_id` mediumint(8) unsigned NOT NULL,
 `prio` tinyint(3) unsigned NOT NULL,
 `inserted` int(10) unsigned NOT NULL,
 PRIMARY KEY (`id`),
 KEY `proc_id` (`proc_id`),
 KEY `app_id` (`app_id`),
 KEY `crawler` (`class_id`,`prio`,`proc_id`)
) ENGINE=InnoDB AUTO_INCREMENT=1 DEFAULT CHARSET=utf8

Now my processes do the following:

  • start DB transaction
  • do a select like SELECT * FROM crawler_queue WHERE class_id=2 AND prio=20 AND proc_id=0 ORDER BY id LIMIT 1 FOR UPDATE
  • then update this row with UPDATE crawler_queue SET server_id=1,proc_id=1376 WHERE id=23892
  • commit transaction

This should help that no other process can grab a row that is processed yet. Doing an EXPLAIN on the select shows

id  select_type  table          type  possible_keys    key      key_len  ref    rows    Extra
1   SIMPLE       crawler_queue  ref   proc_id,crawler  proc_id  3        const  617609  Using where

But the processes seem to cause too high parallelism, because sometimes I can see two types of errors/warnings in my log (every 5 minutes or so):

mysqli::query(): (HY000/1205): Lock wait timeout exceeded; try restarting transaction (in /var/www/db.php l
ine 81)

mysqli::query(): (40001/1213): Deadlock found when trying to get lock; try restarting transaction (in /var/www/db.php line 81)

My question is: can anybody point me in the right direction to minimize these locking problems? (in production state, the parallelism would be 3-4 times higher than now, so I assume, that there would be much more locking problems)

I modified SELECT to use index crawler by hint USE INDEX(crawler). My problem now are lockwait timeouts anymore (deadlocks disappeared).

EXPLAIN with USE INDEX() shows now (no. of rows is higher, because table contains more data now):

id  select_type  table          type  possible_keys    key      key_len  ref                rows     Extra
1   SIMPLE       crawler_queue  ref   proc_id,crawler  crawler  5        const,const,const  5472426  Using where
like image 970
rabudde Avatar asked Oct 05 '22 13:10

rabudde


2 Answers

Your EXPLAIN report shows that you're using only the single-column index proc_id, and the query has to examine over 600K rows. It would probably be better if the optimizer chose the crawler index.

InnoDB may be locking all 600K rows, not just the rows that match the full condition in your WHERE clause. InnoDB locks all the examined rows to make sure concurrent changes don't get written to the binlog in the wrong order.

The solution is to use an index to narrow the range of examined rows. This will probably help you not only to find the rows more quickly, but also to avoid locking large ranges of rows. The crawler index should help here, but it's not immediately clear why it's not using that index.

You might have to ANALYZE TABLE to make sure to update InnoDB's table statistics to know about the crawler index before it uses that index in the optimization plan. ANALYZE TABLE is an inexpensive operation.

The other option is to use an index hint:

SELECT * FROM crawler_queue USE INDEX(crawler) ...

This tells the optimizer to use that index, and do not consider other indexes for this query. I prefer to avoid index hints, because the optimizer is usually able to make good decisions on its own, and using the hint in code means I may be forcing the optimizer not to consider an index I create in the future, which it would otherwise choose.


With more explanation, it's now clear you're using your RDBMS as a FIFO. This is not an efficient use of an RDBMS. There are message queue technologies for this purpose.

See also:

  • https://stackoverflow.com/a/13851231/20860
  • The Database as Queue Anti-Pattern.
like image 144
Bill Karwin Avatar answered Oct 10 '22 03:10

Bill Karwin


From what I can tell the problem that you're facing is that two threads are vyying for the same row in the table and they both can't have it. But there isn't any elegant way for the database to say "no you can't have that one, find another row" and thus you get errors. This is called resource contention.

When you're doing highly parallel work like this one of the easiest ways to reduce contention-based problems is to completely eliminate the contention by inventing a way for all the threads to know which rows they're supposed to work on ahead of time. Then they can lock without having to contend for the resources and your database doesn't have to resolve the contention.

How best to do this? Usually people pick some kind of thread-id scheme and use modulo arithmetic to determine which threads get which rows. If you 10 threads then thread 0 gets row 0, 10, 20, 30, etc. Thread 1 gets 1, 11, 21, 31, etc.

In general if you have NUM_THREADS then each of your threads would pick the ids which are THREAD_ID + i*NUM_THREADS from the database and work on those.

We have introduced a problem in that threads may stall or die, and you could end up with rows in the database which never get touched. There are several solutions to that problem, one of which is to run a "cleanup" once most/all of your threads have finished where all the threads grab piecemeal whatever rows they can and crawl them until there are no un-crawled URLs left. You could get more sophisticated and have a few cleanup threads constantly running, or have each thread occasionally perform cleanup duties, etc.

like image 33
Mike Sandford Avatar answered Oct 10 '22 03:10

Mike Sandford