I realize this question is kind of vague, but I'm hoping someone can point me in a direction of things to look at.
My problem is that I have a query within a program that is driving the CPUs on my SQL Server to 100%. I think that the problem is related to parallelism, but I don't know how to confirm my suspicion, or the best way to fix the problem.
The query in question spent about 20 minutes running the server at 100% CPU, at which point I killed the query. I used sys.dm_exec_query_stats to look at the query execution plan. There are no index scans. The index operations are all seeks. I'm confident that my indexes are well chosen for this query. I do see a lot of parallelism going on within the query.
For comparison, I tried running the same query in the query window in Management Studio. The query runs in no more than a few seconds, and the Display Estimated Execution Plan button gives me a different execution plan. One of the differences in this execution plan is that there is no parallelism.
Here is a link to the query execution plan for the bad query. One thing which is looks like a problem is the clustered index seek on PS_TRANSACTION_INV with a predicate on PS_TRANSACTION_INV.BUSINESS_UNIT (all the way to the right in the execution plan) shows 1 estimated row. All but a few hundred rows have the value TRUS1 and the rest of the rows have the value TRCN1. The table has about 15 million rows and uses about 22 GB (so the rows are pretty large). I did DBCC SHOW_STATISTICS for the table's clustered index, and the statistics look correct (as shown below). The execution plan also has an estimated row size of 54 B, which is wrong. The true row size is larger.

However, even with a bad execution plan, I don't understand how this query could run for 24+ hours without finishing.
Looking at sys.sysprocesses, I can see that the last wait type is SOS_SCHEDULER_YIELD.
Here is a link to the query execution plan for the good query.
My server has 8 CPUs and 24 GB of memory. The query itself returns about 27,000 rows, so it should be well within the capability of the server to execute this quickly. This program has been running for years with no problems, so I think something must have changed about the execution plan. My server's DOP is 64.
Am I right in thinking that the problem is likely related to parallelism? If so, how could that be causing a query that should run in a second or two to use 100% of the CPU for over 20 minutes and still not finish?
EDIT: In the execution plan for the bad query, I found that one of the steps is doing an index seek on a field where it should ideally do a clustered index seek. The index seek is on a table that has about 20,000,000 rows, but the indexed field has high selectivity. That index seek feeds into a nested loop (with parallelism) in the query plan. I still think that the problem is in some way related to the parallelism.
For reference, I can run a SELECT on that table using the nonclustered index, and the query takes a few seconds.
My first question is what wait types did you see?
Second, how is your server set up (Maybe run sp_Blitz and give us the output sp_Blitz @OutputType = ‘markdown’, @CheckServerInfo = 1)?
Third, is there anything else running, causing the bottleneck? Maybe start with sp_BlitzFirst or sp_whoisactive to see what's going on.
Can parallelism cause 100% CPU usage?
Sure can. Brent wrote a query to do just this thing. Or, at least, he uses MAXDOP 0 to ensure it happens, faster?
For comparison, I tried running the same query in the query window in Management Studio. The query runs in no more than a few seconds, and the Display Estimated Execution Plan button gives me a different execution plan
I don't work for Brent, but he talks about this too.. You need to use the actual execution plan, not the estimated. Adding it here, and including the link in your post is helpful. Also, you need to determine if the application is using a cached procedure plan, or if they are both using new plans. You can use option(recompile) to force a new plan and compare apples to apples (but like red delicious to granny smith which is better than apples to oranges)
The query itself returns about 27,000 rows, so it should be well within the capability of the server to execute this quickly.
The final number of rows being this small shouldn't be a measure of why it's performing poorly. Somewhere in the N number of joins there could be millions of rows being pulled back before the final predicate is evaluated. This is where the actual execution plan would show you bottlenecks.
My server's DOP is 64
Seems high, but you didn't tell us everything about how many cores you are running. Luckily, Microsoft has good guidance on this.. Before you go changing it at the server level, add a query hint to the bottom of your query... option(maxdop 8). Also, I'd expect to see CXPACKET waits if it was parallelism. Here's a video on that..
Lastly, remember that query optimization is hard, even for the server. Granted this isn't the case for you since your application is taking forever, and SSMS returns the data fast, which is where I'd expect to see ASYNC_NETWORK_IO waits or something.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With