I have an advanced SQL question for your SQL perf gurus out there :-)
I'm currently trying to understand some behavior in a larger application, but it boils down to a query against these two tables:
Users
table - roughly 750 entries, UserId
(varchar(50)
) as clustered PKActionLog
table - millions of entries, includes UserId
- but no FK relationshipFor a grid in my ASP.NET application, I'm trying to get all users plus their last log entry's date.
The SQL statement that is currently being used looks something like this:
SELECT
UserId, (other columns),
LastLogDate = (SELECT TOP (1) [Timestamp] FROM dbo.ActionLog a WHERE a.UserId = u.UserId ORDER BY [Timestamp] DESC)
FROM
dbo.Users u;
and it returns the rows to display - but it's fairly slow (approx. 20 seconds).
My first thought was to add an index on the ActionLog
table on UserId
and to include the Timestamp
column in it:
CREATE NONCLUSTERED INDEX [IDX_UserId]
ON [dbo].[ActionLog]([UserId] ASC)
INCLUDE ([Timestamp])
The rows are now returned very quickly - under 2 seconds, with 350'000 entries in the ActionLog
table, and my index is being used just fine, as the execution plan shows me. All seems fine.
Now, for approximating the production scenario, we loaded roughly 2 million rows into the ActionLog
table, 95% or more of which refer to a non-existing user (i.e. these rows have a UserId
that doesn't exist in the Users
table).
Now suddenly, the query becomes extremely slow (24 minutes!), and the index isn't being used anymore.
I assumed that since the vast majority of the entries in the ActionLog
table don't line up with an existing user, I would see performance gains if I use a filtered index - to "weed out" all those messy entries without a corresponding user - so I created this index (replacing the other one that existed before):
CREATE NONCLUSTERED INDEX [IDX_UserId]
ON [dbo].[Log]([UserId] ASC)
INCLUDE ([Timestamp])
WHERE UserId <> 'user' -- that's the fixed, non-existing "UserId" I wanted to avoid
But to my dismay - the query is still about the same - takes over 20 minutes to complete. I updated the statistics - no change - still extremely slow.
Funny thing (to me) is: when I dropped the index and re-created it -> now the query was really fast again (again less than 3 seconds). WOW!
But as soon as I start adding more entries again, the query "tilts" and becomes really really slow.......
I don't fully understand why this is happening - I was thinking that with a filtered index that eliminates all those "rogue" entries, I would be seeing good performance on trying to find the newest ActionLog
entry for existing users - but that doesn't seem to be the case.
WHY NOT?
Any ideas? Thoughts? Things to try??
WAITING: Queries can be slow because they're waiting on a bottleneck for a long time. See a detailed list of bottlenecks in types of Waits. RUNNING: Queries can be slow because they're running (executing) for a long time. In other words, these queries are actively using CPU resources.
First, INCLUDE
here is not the best choice. You sort by entry date, but included columns aren't sorted. Better solution would be:
CREATE NONCLUSTERED INDEX [IX_ActionLog_UserIdTimestamp] ON [dbo].[ActionLog]
([UserId], [Timestamp]);
Second, looks like you might need to update statistics on your index more often than an automatic update would. I have seen cases when, in a situation akin to yours, I had to update stats every 10 minutes, due to excessive inserts. That was back on 2005, though.
Try this query and see how it performs with your original index or with the modified suggested by @Roger Wolf:
SELECT u.UserId, a.LastLogDate
FROM dbo.Users u
INNER JOIN (
SELECT UserId, Max([TimeStamp]) AS LastLogDate
FROM dbo.ActionLog
WHERE userid <> 'user' -- the user to filter out
GROUP BY UserId
) a ON a.UserId = u.UserId
If it sucks I'll delete the answer :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With