Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

MySQL : Permanently getting " Waiting for table metadata lock"

My MySQL database serves three webapps as the storage backend. However I recently encounter permanantly the error "Waiting for table metadata lock". It happen nearly all the time and I do not understand why.

mysql> show processlist
    -> ;
+------+-----------+-----------------+------------+---------+------+---------------------------------+------------------------------------------------------------------------------------------------------+
| Id   | User      | Host            | db         | Command | Time | State                           | Info                                                                                                 |
+------+-----------+-----------------+------------+---------+------+---------------------------------+------------------------------------------------------------------------------------------------------+
|   36 | root      | localhost:33444 | bookmaker2 | Sleep   |  139 |                                 | NULL                                                                                                 |
|   37 | root      | localhost:33445 | bookmaker2 | Sleep   |  139 |                                 | NULL                                                                                                 |
|   38 | root      | localhost:33446 | bookmaker2 | Sleep   |  139 |                                 | NULL                                                                                                 |
|   39 | root      | localhost:33447 | bookmaker2 | Sleep   |   49 |                                 | NULL                                                                                                 |
|   40 | root      | localhost:33448 | bookmaker2 | Sleep   |  139 |                                 | NULL                                                                                                 |
| 1315 | bookmaker | localhost:34869 | bookmaker  | Sleep   |   58 |                                 | NULL                                                                                                 |
| 1316 | root      | localhost:34874 | bookmaker3 | Sleep   |   56 |                                 | NULL                                                                                                 |
| 1395 | bookmaker | localhost:34953 | bookmaker  | Sleep   |   58 |                                 | NULL                                                                                                 |
| 1396 | root      | localhost:34954 | bookmaker3 | Sleep   |   46 |                                 | NULL                                                                                                 |
| 1398 | root      | localhost:34956 | bookmaker3 | Query   |   28 | Waiting for table metadata lock | CREATE TABLE IF NOT EXISTS LogEntries  ( 
                    lid         INT NOT NULL AUTO_INCREMEN |
| 1399 | root      | localhost       | NULL       | Query   |    0 | NULL                            | show processlist                                                                                     |
+------+-----------+-----------------+------------+---------+------+---------------------------------+------------------------------------------------------------------------------------------------------+

Of course one can kill the corresponding process. However if I restart my program that tries to create the table structure of my database "bookmaker3" the newly created process ends up again in a metalock.

I even cannot drop the database:

mysql> drop database bookmaker3;

This yields also a metalock.

How can this be repaired?

like image 593
toom Avatar asked Nov 05 '13 23:11

toom


People also ask

What does waiting for table metadata lock mean?

This means that when a transaction (including XA transactions) uses a table, it locks its metadata until the end of transaction.

What is metadata lock?

A metadata lock on a table prevents changes to the table's structure. This locking approach has the implication that a table that is being used by a transaction within one session cannot be used in DDL statements by other sessions until the transaction ends.


2 Answers

The accepted solution is, unfortunately, wrong. It is right as far as it says,

Kill the connection with lock

This is indeed (almost surely; see below) what to do. But then it suggests,

Kill 1398 

...and 1398 is not the connection with the lock. How could it be? 1398 is the connection waiting for the lock. This means it does not yet have the lock, and therefore, killing it avails nothing. The process holding the lock will still hold the lock, and the next thread trying to do something will therefore also stall and enter "Waiting for metadata lock" in due order.

You have no guarantee that the processes "waiting for metadata lock" (WFML) won't also block, but you can be certain that killing only WFML processes will achieve exactly nothing.

The real cause is that another process is holding the lock, and more importantly, SHOW FULL PROCESSLIST will not tell you directly which it is.

It WILL tell you if the process is doing something, yes. Usually it works. Here, the process holding the lock is doing nothing, and hides among other threads also doing nothing.

In this case the culprit is almost certainly process 1396, which started before process 1398 and is now in Sleep state, and has been for 46 seconds. Since 1396 clearly did all that it needed to do (as proved by the fact that it is now sleeping, and has done so for 46 seconds, as far as MySQL is concerned), no thread having gone to sleep before that could have held a lock (or 1396 would also have stalled).

IMPORTANT: if you connected to MySQL as a limited user, SHOW FULL PROCESSLIST will not show all the processes. So the lock might be held by a process that you don't see.

A better SHOW PROCESSLIST

SELECT ID, TIME, USER, HOST, DB, COMMAND, STATE, INFO     FROM INFORMATION_SCHEMA.PROCESSLIST WHERE DB IS NOT NULL     AND (`INFO` NOT LIKE '%INFORMATION_SCHEMA%' OR INFO IS NULL)     ORDER BY `DB`, `TIME` DESC 

The above can be tuned to show only the processes in SLEEP state, and anyway it will sort them by time descending, so it is easier to find the process that is hanging (it usually is the Sleep'ing one immediately before the ones "waiting for metadata lock").

The important thing

Leave any "waiting for metadata lock" process alone.

Quick and dirty solution, not really recommended but quick

Kill all processes in "Sleep" state, on the same database, that are older than the oldest thread in "waiting for metadata lock" state. This is what Arnaud Amaury would have done:

  • for each database that has at least one thread in WaitingForMetadataLock:
    • the oldest connection in WFML on that DB turns out to be Z seconds old
    • ALL the "Sleep" threads on that DB and older than Z must go. Start with the freshest ones, just in case.
    • If one older and non-sleeping connection exists on that DB, then maybe that is the one holding the lock, but it is doing something. You may of course kill it, but especially if it is an UPDATE/INSERT/DELETE, you do so at your own peril.

Ninety-nine times out of one hundred, the thread to be killed is the youngest among those in Sleep state that are older than the older one waiting for metadata lock:

TIME     STATUS 319      Sleep 205      Sleep  19      Sleep                      <--- one of these two "19"  19      Sleep                      <--- and probably this one(*)  15      Waiting for metadata lock  <--- oldest WFML  15      Waiting for metadata lock  14      Waiting for metadata lock 

(*) the TIME order actually has milliseconds, or so I was told, it just doesn't show them. So while both processes have a Time value of 19, the lowest one ought to be younger.

More focused fix

Run SHOW ENGINE INNODB STATUS and look at the "TRANSACTION" section. You will find, among others, something like

TRANSACTION 1701, ACTIVE 58 sec;2 lock struct(s), heap size 376, 1 row lock(s), undo log entries 1 MySQL thread id 1396, OS thread handle 0x7fd06d675700, query id 1138 hostname 1.2.3.4 whatever; 

Now you check with SHOW FULL PROCESSLIST what is thread id 1396 doing with its #1701 transaction. Chances are it is in "Sleep" status. So: an active transaction (#1701) with an active lock, it has even done some changes as it has an undo log entry... but is currently idle. That and no other is the thread you need to kill. Losing those changes.

Remember that doing nothing in MySQL does not mean doing nothing in general. If you get some records from MySQL and build a CSV for FTP upload, during the FTP upload the MySQL connection is idle.

Actually if the process using MySQL and the MySQL server are on the same machine, that machine runs Linux, and you have root privileges, there's a way to find out which process has the connection that requested the lock. This in turn allows to determine (from CPU usage or, at worst, strace -ff -p pid) whether that process is really doing something or not, to help decide if it's safe to kill.

Why does this happen?

I see this happening with webapps that use "persistent" or "pooled" MySQL connections, which nowadays usually save very little time: the webapp instance terminated, but the connection did not, so its lock is still alive... and blocking everyone else.

Another interesting way that I found is, in the hypotheses above, to run a query returning some rows, and only retrieve some of them. If the query is not set to "auto-clean" (however the underlying DBA does it), it will keep the connection open and prevent a full lock on the table from going through. I had this happen to me in a piece of code that verified whether a row existed by selecting that row and verifying whether it got an error (not exists) or not (it must exist), but without actually retrieving the row.

PHP and PDO

PDO has persistent connection capability. This is how I ensure PDO does not pool connections and closes every one of them. It's messy.

When opening, set the options (fourth option to new PDO()):

PDO::ATTR_PERSISTENT => false 

When disconnecting:

// We should have no transactions and no locks. // So we discard them. try {     $pdo->exec('ROLLBACK WORK');     $pdo->exec('UNLOCK TABLES'); } catch (Exception $err) {     // Send a mail } // No cooperative locks. So this will not hurt a bit. try {     $pdo->exec('DO RELEASE_ALL_LOCKS()'); } catch (Exception $err) {     // Send a mail } // Ensure the connection withers on the vine, but not too soon. $pdo->exec('SET wait_timeout = 5');  // $pdo->setAttribute(PDO::ATTR_TIMEOUT, 5); // If nothing else works! // try { //     $pdo->exec('KILL CONNECTION_ID()'); // } catch (Exception $err) { //     // Exception here is expected: "Query execution was interrupted" // } // Invoke the garbage collector $pdo = NULL; 

Ask the DB

Another way to get the culprit if you have a recent MySQL, but not too recent since this is going to be deprecated, is (you need privileges again on the information schema)

SELECT * FROM INFORMATION_SCHEMA.INNODB_LOCKS       WHERE LOCK_TRX_ID IN          (SELECT BLOCKING_TRX_ID FROM INFORMATION_SCHEMA.INNODB_LOCK_WAITS); 

Actual solution, requiring time and work

The problem is usually caused by this architecture:

 webapp (jar, php) --> container or app connection pool    (mysqldb, php_module, fastcgi...) -->     --> MySQL 

When the webapp dies, or the webapp lightweight thread instance dies, the container/connection pool might not. And it is the container that keeps the connection open, so obviously the connection does not close. Quite predictably, MySQL does not consider the operation complete.

If the webapp didn't clean after itself (no ROLLBACK or COMMIT for a transaction, no UNLOCK TABLES, etc.), then whatever that webapp started doing is still extant, and might still be blocking everyone else.

There are then two solutions. The worse one is to lower the idle timeout. But guess what happens if you wait too long between two queries (exactly: "MySQL server has gone away"). You could then use mysql_ping if available (soon to be deprecated. There are workarounds for PDO. Or you might check for that error, and reopen the connection if it happens (this is the Python way). So - for a small performance fee - it's doable.

The better, smarter solution is less straightforward to implement. Endeavour to have the script clean after itself, ensuring to retrieve all rows or free all query resources, catch all exception and deal with them properly, or, if possible, skip persistent connections altogether. Let each instance create its own connection or use a smart pool driver (in PHP PDO, use PDO::ATTR_PERSISTENT explicitly set to false). Alternatively (e.g. in PHP) you can have destruct and exception handlers force clean the connection by committing or rolling back transactions and issuing explicit table unlocks.

I do not know of a way of querying for extant resultset resources in order to free them; the only way would be to save those resources in a private array.

like image 100
LSerni Avatar answered Sep 24 '22 00:09

LSerni


Kill the connection with lock

Kill 1398

Then check if you have autocommit set to 0 by

select @@autocommit;

If yes, you propably forgot to commit transaction. Then another connection want to do something with this table, which causes the lock.

In your case: If you made some query to LogEntries (which exists) and did not commit it, then you try to execute CREATE TABLE IF NOT EXISTS from another connection - metadata lock happens.

edit For me the bug is somewhere at your application. Check there, or set autocommit to 1 if your not using transactions in application.

ps also check this posts:

like image 29
michalczukm Avatar answered Sep 24 '22 00:09

michalczukm