SQL Azure issue.
I've got an issue that manifests as the following exception on our (asp.net) site:
Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding. The statement has been terminated.
It also results in update and insert statements never completing in SMSS. There aren't any X or IX locks present when querying: sys.dm_tran_locks
and there are no transactions when querying sys.dm_tran_active_transactions
or sys.dm_tran_database_transactions
.
The problem is present for every table in the database but other databases on the same instance don't cause the problem. The duration of the issue can be anywhere from 2 minutes to 2 hours and doesn't happen at any specific times of day.
The database is not full.
At one point this issue didn't resolve itself but I was able to resolve the issue by querying sys.dm_exec_connections
finding the longest running session, and then killing it. The odd thing is, that the connection was 15 minutes old, but the lock issue had been present for over 3 hours.
Is there anything else I can check?
EDIT
As per Paul's answer below. I'd actually tracked down the problem before he answered. I will post the steps I used to figure this out below, in case they help anyone else.
The following queries were run when a "timeout period" was present.
select * from sys.dm_exec_requests
As we can see, all the WAIT requests are waiting on session 1021 which is the replication request! The TM Request
indicates a DTC transaction and we don't use distributed transactions. You can also see the wait_type of SE_REPL_COMMIT_ACK
which again implicates replication.
select * from sys.dm_tran_locks
Again waiting on session 1021
SELECT * FROM sys.dm_db_wait_stats ORDER BY wait_time_ms desc
And yes, SE_REPL_CATCHUP_THROTTLE
has a total wait time of 8094034
ms, that is 134.9minutes!!!
Also see the following forum for details on this issue. http://social.technet.microsoft.com/Forums/en-US/ssdsgetstarted/thread/c3003a28-8beb-4860-85b2-03cf6d0312a8
I've been given the following answer in my communication with Microsoft (we've seen this issue with 4 of our 15 databases in the EU data center):
Question: Have there been changes to these soft throttling limits in the last three weeks ie since my problems started?
Answer: No, there has not.
Question: Are there ways we can prevent or be warned we are approaching a limit?
Answer: No. The issue may not be caused by your application but can be caused by other tenants relying on the same physical hardware. In other words, your application can have very little load and still run into the problem. In other words, your own traffic may be a cause of this problem, but it can just as well be caused by other tenants relying on the same physical hardware. There's no way to know beforehand that the issue will soon occur - it can occur at any time without warning. The SQL Azure operations team does not monitor this type of error, so they won't automatically try to solve the problem for you. So if you run into it you have two opitions:
Create a copy of your db and use that and hope the db is placed on another server with less load.
Contact Windows Azure Support and inform the about the problem and let them do Option 1 for you
When performing an insert operation, the database does not lock the entire table, but locks the row. The following is the test I did, you can see that only the row is locked(X lock) when inserting data, and there is only an Intent lock(IX lock) on the table: use test.
32,767 files per database, unless the instance storage size limit has been reached. Maximum size of each data file is 8 TB.
You might be running into the SE_REPL* issues that are currently plaguing a lot of folks using Sql Azure (my company included).
When you experience the timeouts, try checking your wait requests for wait types of:
Run the following to check your wait types on current connections:
SELECT TOP 10 r.session_id, r.plan_handle,
r.sql_handle, r.request_id,
r.start_time, r.status,
r.command, r.database_id,
r.user_id, r.wait_type,
r.wait_time, r.last_wait_type,
r.wait_resource, r.total_elapsed_time,
r.cpu_time, r.transaction_isolation_level,
r.row_count
FROM sys.dm_exec_requests r
You can also check a history of sorts for this by running:
SELECT * FROM sys.dm_db_wait_stats
ORDER BY wait_time_ms desc
If you're seeing a lot of SE_REPL* wait types and these are staying set on your connections for any length of time, then basically you're screwed. Microsoft are aware of the problem, but I've had a support ticket open for a week with them now and they're still working on it apparently.
The SE_REPL* waits happen when the Sql Azure replication slaves fall behind. Basically the whole db suspends queries while replication catches up :/
So essentially the aspect that makes Sql Azure highly available is causing databases to become randomly unavailable. I'd laugh at the irony if it wasn't killing us.
Have a look at this thread for details: http://social.technet.microsoft.com/Forums/en-US/ssdsgetstarted/thread/c3003a28-8beb-4860-85b2-03cf6d0312a8
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With