Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does a covering index pay off when the data is in order of the clustered index?

I my scenario, I have posts, which are grouped in categories. For an overview list of categories I want to display a summary of the top 10 posts with the categories (as opposed to the detail view of a category, that displays the full data). The top 10 posts are determined by a score, which comes from another table (actually an indexed view - but that doesn't matter here).

The table structure is the following:

CREATE TABLE [dbo].[Categories]
(
    [Id] INT NOT NULL IDENTITY CONSTRAINT [PK_Categories] PRIMARY KEY,
    [Key] CHAR(10) CONSTRAINT [UK_Categories_Key] UNIQUE,
    [Caption] NVARCHAR(500) NOT NULL,
    [Description] NVARCHAR(4000) NULL
)
GO

CREATE TABLE [dbo].[Posts]
(
    [Id] INT NOT NULL IDENTITY CONSTRAINT [PK_Posts] PRIMARY KEY,
    [CategoryId] INT NOT NULL CONSTRAINT [FK_Posts_Category] FOREIGN KEY REFERENCES [dbo].[Categories] ([Id]),
    [Key] CHAR(10) CONSTRAINT [UK_Post_Key] UNIQUE,
    [Text] NVARCHAR(4000) NULL,
    [SummaryText] AS
        CASE WHEN LEN([Text]) <= 400
            THEN CAST([Text] AS NVARCHAR(400))
            ELSE CAST(SUBSTRING([Text], 0, 399) + NCHAR(8230) AS NVARCHAR(400)) --First 399 characters and ellipsis
        END
        PERSISTED
)
GO

CREATE TABLE [dbo].[Scores] (
    [Id] INT NOT NULL IDENTITY CONSTRAINT [PK_Scores] PRIMARY KEY,
    [CategoryId] INT NOT NULL CONSTRAINT [FK_Scores_Category] FOREIGN KEY REFERENCES [dbo].[Categories] ([Id]),
    [PostId] INT NOT NULL CONSTRAINT [FK_Scores_Post] FOREIGN KEY REFERENCES [dbo].[Posts] ([Id]),
    [Value] INT NOT NULL
)
GO

CREATE INDEX [IX_Scores_CategoryId_Value_PostId]
    ON [dbo].[Scores] ([CategoryId], [Value] DESC, [PostId])
GO

I can now use a view to get the top ten posts of each category:

CREATE VIEW [dbo].[TopPosts]
AS
SELECT c.Id AS [CategoryId], cp.PostId, p.[Key], p.SummaryText, cp.Value AS [Score]
FROM [dbo].[Categories] c
CROSS APPLY (
    SELECT TOP 10 s.PostId, s.Value
    FROM [dbo].[Scores] s
    WHERE s.CategoryId = c.Id
    ORDER BY s.Value DESC
) AS cp
INNER JOIN [dbo].[Posts] p ON cp.PostId = p.Id

I understand that the CROSS APPLY will use the covering index IX_Scores_CategoryId_Value_PostId, because it contains the category ID (for the WHERE) the value (for the ORDER BY and the SELECT) and the post ID (for the SELECT) and thus will be reasonably fast.

The question is now: what about the INNER JOIN? The join predicate uses the post ID, which is the key of the Post table's clustered index (the primary key). When I create a covering index that includes all the fields of the SELECT (see below), can I significantly increase query performance (with a better execution plan, reduced I/O, index caching etc.), even though accessing the clustered index is already a pretty fast operation?

The covering index would look like this:

CREATE INDEX [IX_Posts_Covering]
    ON [dbo].[Posts] ([Id], [Key], [SummaryText])
GO

UPDATE:

Since the direction of my question doesn't seem entirely clear, let me put down my thoughts in more detail. I am wondering if the covering index (or index with included columns) could be faster for the following reasons (and the performance gain woul be worth it):

  1. Hard drive access. The second index would be considerably smaller than the clustered index, SQL Server would have to go through less pages on the HD, which would yield better read performance. Is that correct and would you see the difference?
  2. Memory consumption. To load the data into the memory, I assume SQL Server would have to load the entire row into memory and then pick the columns it needs. Wouldn't that increase memory consumption?
  3. CPU. My assumption is that you wouldn't see a measurable difference in CPU usage, since extracting the row from the columns is not per se a CPU operation. Correct?
  4. Caching. My understanding is that you won't see much difference in caching, because SQL Server would only cache the data it returns, not the entire row. Or am I wrong?

These are basically (more or less educated) assumptions. I would appreciate it a lot if someone could enlighten me about this admittedly very specific issue.

like image 473
Sefe Avatar asked Nov 14 '16 13:11

Sefe


People also ask

Can clustered index be covering index?

The covering index is to store data on the index page, so that when searching for the corresponding data, as long as the index page is found, the data can be accessed, and there is no need to query the data page, so this index is data "covered". The clustered index is actually a covering index.

How does a covering index work?

when we create index then we can mention multiple column name and that is called composite index but when we create cover index then we create index on one column and for cover index we mention other column in include function.

Does order matter in index match?

If the data isn't in ascending order, MATCH can return incorrect results or incorrect #N/A values. If the lookup_value isn't in your lookup_array, MATCH returns the position of the largest value that's less than or equal to the lookup_value. (I'll show you a examples shortly.)

Does order matter in Composite index?

Execution is most efficient when you create a composite index with the columns in order from most to least distinct. In other words, the column that returns the highest count of distinct rows when queried with the DISTINCT keyword in the SELECT statement should come first in the composite index.


2 Answers

This is a fun question because all four sub-questions you raise can be answered with "it depends", which is usually a good sign that the subject matter is interesting.

First of all, if you have an unhealthy fascination with how SQL Server works under the covers (like I do) the go-to source is "Microsoft SQL Server Internals", by Delaney et al. You don't need to read all ~1000 pages, the chapters on the storage engine are interesting enough on their own.

I won't touch the question of whether this particular covering index is useful in this particular case, because I think the other answers have covered that nicely (no pun intended), including the recommendation to use INCLUDE for columns that don't need to be indexed themselves.

The second index would be considerably smaller than the clustered index, SQL Server would have to go through less pages on the HD, which would yield better read performance. Is that correct and would you see the difference?

If you assume the choice is either between reading pages of the clustered index or pages of the covering index, the covering index is smaller1, which means less I/O, better performance, all that niceness. But queries don't execute in a vacuum -- if this is not the only query on the table, the buffer pool may already contain most or all of the clustered index, in which case disk read performance could be negatively affected by having to read the less-frequently used covering index as well. Overall performance may also be decreased by the total increase in data pages. The optimizer considers only individual queries; it will not carefully tune buffer pool usage based on all queries combined (dropping pages happens through a simple LRU policy). So if you create indexes excessively, especially indexes that are used infrequently, overall performance will suffer. And that's not even considering the intrinsic overhead of indexes when data is inserted or updated.

Even if we assume the covering index is a net benefit, the question "would you see the difference" (as in, does performance measurably increase) can only be effectively answered empirically. SET STATISTICS IO ON is your friend here (as well as DBCC DROPCLEANBUFFERS, in a test environment). You can try and guess based on assumptions, but since the outcome depends on the execution plan, the size of your indexes, the amount of memory SQL Server has in total, I/O characteristics, the load on all databases and the query patterns of applications, I wouldn't do this beyond a ballpark guess of whether the index could possibly be useful. In general, sure, if you have a very wide table and a small covering index, it's not hard to see how this pays off. And in general, you will sooner see bad performance from not enough indexes than from too many indexes. But real databases don't run on generalizations.

To load the data into the memory, I assume SQL Server would have to load the entire row into memory and then pick the columns it needs. Wouldn't that increase memory consumption?

See above. The clustered index takes up more pages than the covering index, but whether memory usage is affected positively or negatively depends on how each index is used. In the very worst case, the clustered index is used intensively by other queries that don't profit from your covering index, while the covering index is only of help to a rare query, so all the covering index does is cause buffer pool churn that slows down the majority of your workload. This would be unusual and a sign your server could do with a memory upgrade, but it's certainly possible.

My assumption is that you wouldn't see a measurable difference in CPU usage, since extracting the row from the columns is not per se a CPU operation. Correct?

CPU usage is typically not measurably affected by row size. Execution time is (and that, in turn, does affect usage depending on how many queries you want to run in parallel). Once you've covered the I/O bottleneck by giving your server plenty of memory, there's still the matter of scanning the data in memory.

My understanding is that you won't see much difference in caching, because SQL Server would only cache the data it returns, not the entire row. Or am I wrong?

Rows are stored on pages, and SQL Server caches the pages it reads in the buffer pool. It does not cache result sets, or any intermediate data generated as part of the query execution, or individual rows. If you execute a query twice on an initially empty buffer pool, the second one is typically faster because the pages it needs are already in memory, but that's the only source of speedup.

With that in mind, see the answer to your first question -- yes, caching is affected because the pages of your covering index, if used, are cached separately from the pages of the clustered index, if used.


1 A covering index may not actually be smaller if it's heavily fragmented due to page splits. But this is an academic point, because it's not really about what index is physically larger but how much pages of each are actually accessed.

like image 59
Jeroen Mostert Avatar answered Sep 17 '22 23:09

Jeroen Mostert


No, you do not need this covering index.

Limit the number of indexes for each table: A table can have any number of indexes. However, the more indexes there are, the more overhead is incurred as the table is modified. Thus, there is a trade-off between the speed of retrieving data from a table and the speed of updating the table.

Your scenario is more likely as an OLTP system instead of Data Warehouse, it will have large numbers of on-line transactions(insert, update, delete). So creating this covering index will slow down your modification operations.

Update:

Yes,there will be 10 posts per each category. So if you have N category types, the return result set is at most 10*N post records.

Another Guideline about Index: Create an index if you frequently want to retrieve less than 15 percent of the rows in a large table. (My SQL Tuning instructor suggests us 5 percent) If greater than 15 percent, the final execution plan will not be optimal when we use Index.

Let's consider two extreme cases about your POST table:

  1. Post table just has 10*N records and every category type is hit by post record 10 times. So the final execution plan will full scan POST table instead of using any index.
  2. The number of Post table is greater than (10 * N / 15%), so it will retrieve less than 15% of rows in Post table. The Optimizer will use Post ID field to do join operation. And it should be a hash join.

So even you have created a covering index, the Optimizer will never use it unless you use a hint.

Updated:

Clustered and Nonclustered Indexes Described

like image 34
shawn Avatar answered Sep 21 '22 23:09

shawn