Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Corrupt Azure SQL stored procedure could only be fixed by drop recreate

Pardon me if this is a duplicate. The closest I could find was Random timeout running a stored proc - drop recreate fixes but I'm not certain the answers there about recompiling the stored procedure apply.

I have an Azure SQL database, latest version, that has a lot of traffic from an Azure web app front end. I have a nightly remote job that runs a batch to rebuild indexes on the Azure SQL database as that seems to help greatly with controlling database size and performance.

Normally, the rebuilding of indexes takes about 20 minutes. Last night it timed out after 2 hours. The error handler in that batch did not log any errors.

Soon after rebuilding indexes was started, one particular stored procedure starting timing out for every client calling it. Other stored procedures using the same tables were not having any issues. When I discovered the problem, I could alleviate all the timeouts and suspended processes by altering the stored procedure to immediately return. When I altered the stored procedure again to behave normally, the issues reappeared immediately. My understanding is that altering the stored procedure forced it to recompile but that didn't fix it.

Ultimately, I completely dropped and recreated the procedure with the original code and the issue was resolved.

This procedure and the schema it uses have been completely stable for many months. The procedure itself is quite simple:

CREATE Procedure [dbo].[uspActivityGet] (@databaseid uniqueidentifier) AS
begin
    SET NOCOUNT ON;
    --There may be writing activities to the table asynchronously, do not use nolock on tblActivity - the ActivityBlob might be null in a dirty read.
    select top 100 a.Id, h.HandsetNumber, a.ActivityBlob, a.ActivityReceived
    from dbo.tblDatabases d with(nolock) join dbo.tblHandsets h with(nolock) on d.DatabaseId = h.DatabaseId join dbo.tblActivity a on h.Id = a.HandsetId
    where d.DatabaseId = @databaseid and a.ActivitySent is null
    order by a.ActivityReceived
end

While the procedure would hang and time out with something like this:

exec dbo.uspActivityGet 'AF3EA01B-DB22-4A39-9E1C-D096D2DF1215'

Running the identical select in a query window would return promptly and successfully:

declare @databaseid uniqueidentifier; set @databaseid = 'AF3EA01B-DB22-4A39-9E1C-D096D2DF1215'
select top 100 a.Id, h.HandsetNumber, a.ActivityBlob, a.ActivityReceived
from dbo.tblDatabases d with(nolock) join dbo.tblHandsets h with(nolock) on d.DatabaseId = h.DatabaseId join dbo.tblActivity a on h.Id = a.HandsetId
where d.DatabaseId = @databaseid and a.ActivitySent is null
order by a.ActivityReceived

Any ideas how I can prevent this from happening in the future? Thank you.

Edit - Adding execution plan screenshot Execution Plan of stored procedure above

Edit - Adding query used to view running processes. There were many, guessing aproximately 150, in the suspended state and they were all for the same stored procedure - uspActivityGet. Also, Data IO Percentage was maxed out the whole time when it normally runs 20 - 40% in peak demand times. I don't recall what the wait type was. Here is the query used to view that.

select * from sys.dm_Exec_requests r with(nolock) CROSS APPLY sys.dm_exec_sql_text(r.sql_handle)  order by r.total_elapsed_time desc

Edit - It happened again tonight. Here is the execution plan of the same procedure during the issue. After dropping and creating the procedure again, the execution plan returned to normal and the issue was resolved.

During the issue, sp_executesql with the identical query took about 5 minutes to execute and I believe that is representative of what was happening. There were about 50 instances of uspActivityGet suspended with wait type SLEEP_TASK or IO_QUEUE_LIMIT.

Perhaps the next question is why is index rebuilding or other nightly maintenance doing this to the execution plan?

enter image description here

like image 528
RJBreneman Avatar asked Jan 07 '16 15:01

RJBreneman


1 Answers

The clues are in the query and the troublesome execution plan. See Poor Performance with Parallelism and Top

The normal execution plan seems quite efficient and shouldn't need recompiled as long as the relevant schema doesn't change. I also want to avoid parallelism in this query. I added the following two options to the query for assurance on both points and all is happy again.

OPTION (KEEPFIXED PLAN, MAXDOP 1)
like image 191
RJBreneman Avatar answered Oct 01 '22 01:10

RJBreneman