Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why did SQL Server suddenly decide to use such a terrible execution plan?

Background

We recently had an issue with query plans sql server was using on one of our larger tables (around 175,000,000 rows). The column and index structure of the table has not changed for 5+ years.

The table and indexes looks like this:

create table responses (
    response_uuid uniqueidentifier not null,
    session_uuid uniqueidentifier not null,
    create_datetime datetime not null,
    create_user_uuid uniqueidentifier not null,
    update_datetime datetime not null,
    update_user_uuid uniqueidentifier not null,
    question_id int not null,
    response_data varchar(4096) null,
    question_type_id varchar(3) not null,
    question_length tinyint null,
    constraint pk_responses primary key clustered (response_uuid),
    constraint idx_responses__session_uuid__question_id unique nonclustered (session_uuid asc, question_id asc) with (fillfactor=80),
    constraint fk_responses_sessions__session_uuid foreign key(session_uuid) references dbo.sessions (session_uuid),
    constraint fk_responses_users__create_user_uuid foreign key(create_user_uuid) references dbo.users (user_uuid),
    constraint fk_responses_users__update_user_uuid foreign key(update_user_uuid) references dbo.users (user_uuid)
)

create nonclustered index idx_responses__session_uuid_fk on responses(session_uuid) with (fillfactor=80)

The query that was performing poorly (~2.5 minutes instead of the normal <1 second performance) looks like this:

SELECT 
[Extent1].[response_uuid] AS [response_uuid], 
[Extent1].[session_uuid] AS [session_uuid], 
[Extent1].[create_datetime] AS [create_datetime], 
[Extent1].[create_user_uuid] AS [create_user_uuid], 
[Extent1].[update_datetime] AS [update_datetime], 
[Extent1].[update_user_uuid] AS [update_user_uuid], 
[Extent1].[question_id] AS [question_id], 
[Extent1].[response_data] AS [response_data], 
[Extent1].[question_type_id] AS [question_type_id], 
[Extent1].[question_length] AS [question_length]
FROM [dbo].[responses] AS [Extent1]
WHERE [Extent1].[session_uuid] = @f6_p__linq__0;

(The query is generated by entity framework and executed using sp_executesql)

The execution plan during the poor performance period looked like this:

execution plan

Some background on the data- running the query above would never return more than 400 rows. In other words, filtering on session_uuid really pares down the result set.

Some background on scheduled maintenance- a scheduled job runs on a weekly basis to rebuild the database's statistics and rebuild the table's indexes. The job runs a script that looks like this:

alter index all on responses rebuild with (fillfactor=80)

The resolution for the performance problem was to run the rebuild index script (above) on this table.

Other possibly relevant tidbits of information... The data distribution didn't change at all since the last index rebuild. There are no joins in the query. We're a SAAS shop, we have at 50 - 100 live production databases with exactly the same schema, some with more data, some with less, all with the same queries executing against them spread across a few sql servers.

Question:

What could have happened that would make sql server start using this terrible execution plan in this particular database?

Keep in mind the problem was solved by simply rebuilding the indexes on the table.

Maybe a better question is "what are the circumstances where sql server would stop using an index?"

Another way of looking at it is "why would the optimizer not use an index that was rebuilt a few days ago and then start using it again after doing an emergency rebuild of the index once we noticed the bad query plan?"

like image 353
Jeremy Danyow Avatar asked Jan 09 '15 00:01

Jeremy Danyow


People also ask

Why might an SQL query suddenly begin to perform much slower than usual standards?

Sometimes performance problems sneak into SQL Server after a software release that involves database schema or code changes, or after SQL Server is upgraded to a new version or patched. Other times, SQL queries suddenly start performing poorly for no apparent reason at all.

Where is bad execution plan in SQL Server?

You can identify the “bad plan hash” that you don't want to keep in cache. When it's in cache, you then: Remove it from the cache. You can do this using DBCC FREEPROCCACHE and the plan_handle value (you can get this by running: sp_BlitzCache @results='expert').

What is the use of execution plan in SQL Server?

The actual execution plan shows the steps SQL Server takes to execute the query. It is giving actual information by the query processor. It provides all information like which are the steps involved when we execute that query.


3 Answers

The reason is simple: the optimizer changes its mind on what the best plan is. This can be due to subtle changes in the distribution of the data (or other reasons, such as a type incompatibility in a join key). I wish there were a tool that not only gave the execution plan for a query but also showed thresholds for how close you are to another execution plan. Or a tool that would let you stash an execution plan and give an alert if the same query starts using a different plan.

I've asked myself this exact same question on more than one occasion. You have a system that's running nightly, for months on end. It processes lots of data using really complicated queries. Then, one day, you come in in the morning and the job that normally finishes by 11:00 p.m. is still running. Arrrggg!

The solution that we came up with was to use explicit join hints for the failed joins. (option (merge join, hash join)). We also started saving the execution plans for all our complex queries, so we could compare changes from one night to the next. In the end, this was of more academic interest than practical interest -- when the plans changed, we were already suffering from a bad execution plan.

like image 188
Gordon Linoff Avatar answered Oct 04 '22 09:10

Gordon Linoff


This is one my most hated issues with SQL - I've had more than one failure due to this issue - once a query that had been working for months went from ~250ms to beyond the timeout threshold causing a manufacturing system to crash at 3am of course. Took awhile to isolate the query and stick it into SSMS and then start breaking it into pieces - but everything I did just "worked". In the end I just added the phrase " AND 1=1" to the query which got things working again for a few weeks - the final patch was to "blind" the optimizer - basically copying all passed parameters into local parameters. If the query works off the bat, it seems like it will continue to work.

To me a reasonably simple fix from MS would be: if this query has been profiled already and ran just fine the last time, and the relevant statistics haven't changed significantly (e.g. come up with some factor of various changes in tables or new indexes, etc), and the "optimizer" decides to spice things up with a new execution plan, how about if that new and improved plan takes more than X-multiple of the old plan, I abort and switch back again. I can understand if a table goes from 100 to 100,000,000 rows or if a key index is deleted, but for a stable production environment to have a query jump in duration to between 100x and 1000x slower, it couldn't be that hard to detect this, flag the plan, and go back to the previous one.

like image 44
mszil Avatar answered Oct 05 '22 09:10

mszil


Newer SQL Server versions have a great new feature called "Query Store" where you can analyse recent queries performance.

If you see a query that sometimes uses a "fast" plan and sometimes a "slow" one - you can force the fast plan. See the screenshot. The "yellow circle" plan is the fast one, but the "blue square" plan is not (it's higher on the "duration" chart")

query store

like image 37
Alex from Jitbit Avatar answered Oct 04 '22 09:10

Alex from Jitbit