We have a set of 5 online auction systems running on Windows Azure & SQL Azure. Each system consists of a single web worker and one or more web roles. Each system is using ASP.NET MVC 3 and Entity Framework, Repository Pattern and StructureMap.
The worker role is responsible for housekeeping and runs two groups of processes. One group is run every ten seconds, the other every second. Each process will likely run a database query or stored procedure. These are scheduled with Quartz.net
The web role serves the public interface and back office. Among other basic crud functionality, both of these provide screens which, when open, will repeatedly call controller methods which will result in execution of stored procedure read-only queries. The frequency of repetition is about 2-3 seconds per client. A typical use case would be 5 back office windows open, and 25 end user windows open – all hitting the system repeatedly.
For a long time we have been experiencing intermittent SQL timeout errors. Three of the most common ones are:
System.Data.SqlClient.SqlException: A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - An existing connection was forcibly closed by the remote host.)
System.Data.SqlClient.SqlException: A transport-level error has occurred when receiving results from the server. (provider: TCP Provider, error: 0 - The semaphore timeout period has expired.)
System.Data.SqlClient.SqlException: Timeout expired. The timeout period elapsed prior to completion of the operation or the server is not responding.
The only predictable scenario is during an auction where a specific controller -> sproc starts to timeout during the event (presumably due to load). All other times the errors appear to be completely random and come in singles, two’s, and three’s etc. even during periods of user inactivity. For example the system will go 18 hours without an error and then could be 5 – 10 errors from different housekeeping methods, or perhaps a user logged on and viewed their account.
Other info:
I have tried to run the affected queries/sprocs on SQL Azure using both local SSMS and Azure web-based query tool – all seem to execute quickly, 1 second max. Query plans not showing anything too suspicious although I am by no means a SQL query performance expert, or any other kind of expert for that matter J
We have wrapped all affected areas in Azure SQL Transient Fault Handling Blocks – but as is discussed here http://social.msdn.microsoft.com/Forums/en-US/ssdsgetstarted/thread/7a50985d-92c2-472f-9464-a6591efec4b3, they do not catch timeouts, and according to “Valery M” this is for good reason.
We are not storing any session information in the database, although asp.net membership information is stored in the database.
We use 1 “SQL Azure server instance” which hosts all 5 databases, two for staging and three for production. All 5 systems are generally active at the same time although it is unlikely that more than one will be in a state of live load use at any given time. All web roles, worker roles and the SQL Azure server reside in the same Azure Geographical Region.
Any thoughts on where we should be looking? Would it help giving each system it's own SQL Azure server? ... Failing a solution by ourselves - is it possible to get Microsoft to open a support ticket and take a look under the hood at what’s going on in with our application – how does one go about this?
Thanks in advance.
Ilan
SQL Azure is a multitenant system and you could be suffering from potential over usage from other tenants. Microsoft does an OK job by keeping other tenants throttled, but once in a while SQL Azure queries do time out..
To open support with Microsoft visit this page: https://support.microsoft.com/oas/default.aspx?gprid=14919&st=1&wfxredirect=1&sd=gn
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With