Does a covering index pay off when the data is in order of the clustered index?

Tags:

I my scenario, I have posts, which are grouped in categories. For an overview list of categories I want to display a summary of the top 10 posts with the categories (as opposed to the detail view of a category, that displays the full data). The top 10 posts are determined by a score, which comes from another table (actually an indexed view - but that doesn't matter here).

The table structure is the following:

CREATE TABLE [dbo].[Categories]
(
    [Id] INT NOT NULL IDENTITY CONSTRAINT [PK_Categories] PRIMARY KEY,
    [Key] CHAR(10) CONSTRAINT [UK_Categories_Key] UNIQUE,
    [Caption] NVARCHAR(500) NOT NULL,
    [Description] NVARCHAR(4000) NULL
)
GO

CREATE TABLE [dbo].[Posts]
(
    [Id] INT NOT NULL IDENTITY CONSTRAINT [PK_Posts] PRIMARY KEY,
    [CategoryId] INT NOT NULL CONSTRAINT [FK_Posts_Category] FOREIGN KEY REFERENCES [dbo].[Categories] ([Id]),
    [Key] CHAR(10) CONSTRAINT [UK_Post_Key] UNIQUE,
    [Text] NVARCHAR(4000) NULL,
    [SummaryText] AS
        CASE WHEN LEN([Text]) <= 400
            THEN CAST([Text] AS NVARCHAR(400))
            ELSE CAST(SUBSTRING([Text], 0, 399) + NCHAR(8230) AS NVARCHAR(400)) --First 399 characters and ellipsis
        END
        PERSISTED
)
GO

CREATE TABLE [dbo].[Scores] (
    [Id] INT NOT NULL IDENTITY CONSTRAINT [PK_Scores] PRIMARY KEY,
    [CategoryId] INT NOT NULL CONSTRAINT [FK_Scores_Category] FOREIGN KEY REFERENCES [dbo].[Categories] ([Id]),
    [PostId] INT NOT NULL CONSTRAINT [FK_Scores_Post] FOREIGN KEY REFERENCES [dbo].[Posts] ([Id]),
    [Value] INT NOT NULL
)
GO

CREATE INDEX [IX_Scores_CategoryId_Value_PostId]
    ON [dbo].[Scores] ([CategoryId], [Value] DESC, [PostId])
GO

I can now use a view to get the top ten posts of each category:

CREATE VIEW [dbo].[TopPosts]
AS
SELECT c.Id AS [CategoryId], cp.PostId, p.[Key], p.SummaryText, cp.Value AS [Score]
FROM [dbo].[Categories] c
CROSS APPLY (
    SELECT TOP 10 s.PostId, s.Value
    FROM [dbo].[Scores] s
    WHERE s.CategoryId = c.Id
    ORDER BY s.Value DESC
) AS cp
INNER JOIN [dbo].[Posts] p ON cp.PostId = p.Id

I understand that the CROSS APPLY will use the covering index IX_Scores_CategoryId_Value_PostId, because it contains the category ID (for the WHERE) the value (for the ORDER BY and the SELECT) and the post ID (for the SELECT) and thus will be reasonably fast.

The question is now: what about the INNER JOIN? The join predicate uses the post ID, which is the key of the Post table's clustered index (the primary key). When I create a covering index that includes all the fields of the SELECT (see below), can I significantly increase query performance (with a better execution plan, reduced I/O, index caching etc.), even though accessing the clustered index is already a pretty fast operation?

The covering index would look like this:

CREATE INDEX [IX_Posts_Covering]
    ON [dbo].[Posts] ([Id], [Key], [SummaryText])
GO

UPDATE:

Since the direction of my question doesn't seem entirely clear, let me put down my thoughts in more detail. I am wondering if the covering index (or index with included columns) could be faster for the following reasons (and the performance gain woul be worth it):

Hard drive access. The second index would be considerably smaller than the clustered index, SQL Server would have to go through less pages on the HD, which would yield better read performance. Is that correct and would you see the difference?
Memory consumption. To load the data into the memory, I assume SQL Server would have to load the entire row into memory and then pick the columns it needs. Wouldn't that increase memory consumption?
CPU. My assumption is that you wouldn't see a measurable difference in CPU usage, since extracting the row from the columns is not per se a CPU operation. Correct?
Caching. My understanding is that you won't see much difference in caching, because SQL Server would only cache the data it returns, not the entire row. Or am I wrong?

These are basically (more or less educated) assumptions. I would appreciate it a lot if someone could enlighten me about this admittedly very specific issue.

473

asked Nov 14 '16 13:11

Sefe

2 Answers

This is a fun question because all four sub-questions you raise can be answered with "it depends", which is usually a good sign that the subject matter is interesting.

First of all, if you have an unhealthy fascination with how SQL Server works under the covers (like I do) the go-to source is "Microsoft SQL Server Internals", by Delaney et al. You don't need to read all ~1000 pages, the chapters on the storage engine are interesting enough on their own.

I won't touch the question of whether this particular covering index is useful in this particular case, because I think the other answers have covered that nicely (no pun intended), including the recommendation to use INCLUDE for columns that don't need to be indexed themselves.

The second index would be considerably smaller than the clustered index, SQL Server would have to go through less pages on the HD, which would yield better read performance. Is that correct and would you see the difference?

If you assume the choice is either between reading pages of the clustered index or pages of the covering index, the covering index is smaller¹, which means less I/O, better performance, all that niceness. But queries don't execute in a vacuum -- if this is not the only query on the table, the buffer pool may already contain most or all of the clustered index, in which case disk read performance could be negatively affected by having to read the less-frequently used covering index as well. Overall performance may also be decreased by the total increase in data pages. The optimizer considers only individual queries; it will not carefully tune buffer pool usage based on all queries combined (dropping pages happens through a simple LRU policy). So if you create indexes excessively, especially indexes that are used infrequently, overall performance will suffer. And that's not even considering the intrinsic overhead of indexes when data is inserted or updated.

Even if we assume the covering index is a net benefit, the question "would you see the difference" (as in, does performance measurably increase) can only be effectively answered empirically. SET STATISTICS IO ON is your friend here (as well as DBCC DROPCLEANBUFFERS, in a test environment). You can try and guess based on assumptions, but since the outcome depends on the execution plan, the size of your indexes, the amount of memory SQL Server has in total, I/O characteristics, the load on all databases and the query patterns of applications, I wouldn't do this beyond a ballpark guess of whether the index could possibly be useful. In general, sure, if you have a very wide table and a small covering index, it's not hard to see how this pays off. And in general, you will sooner see bad performance from not enough indexes than from too many indexes. But real databases don't run on generalizations.

To load the data into the memory, I assume SQL Server would have to load the entire row into memory and then pick the columns it needs. Wouldn't that increase memory consumption?

See above. The clustered index takes up more pages than the covering index, but whether memory usage is affected positively or negatively depends on how each index is used. In the very worst case, the clustered index is used intensively by other queries that don't profit from your covering index, while the covering index is only of help to a rare query, so all the covering index does is cause buffer pool churn that slows down the majority of your workload. This would be unusual and a sign your server could do with a memory upgrade, but it's certainly possible.

My assumption is that you wouldn't see a measurable difference in CPU usage, since extracting the row from the columns is not per se a CPU operation. Correct?

CPU usage is typically not measurably affected by row size. Execution time is (and that, in turn, does affect usage depending on how many queries you want to run in parallel). Once you've covered the I/O bottleneck by giving your server plenty of memory, there's still the matter of scanning the data in memory.

My understanding is that you won't see much difference in caching, because SQL Server would only cache the data it returns, not the entire row. Or am I wrong?

Rows are stored on pages, and SQL Server caches the pages it reads in the buffer pool. It does not cache result sets, or any intermediate data generated as part of the query execution, or individual rows. If you execute a query twice on an initially empty buffer pool, the second one is typically faster because the pages it needs are already in memory, but that's the only source of speedup.

With that in mind, see the answer to your first question -- yes, caching is affected because the pages of your covering index, if used, are cached separately from the pages of the clustered index, if used.

¹ A covering index may not actually be smaller if it's heavily fragmented due to page splits. But this is an academic point, because it's not really about what index is physically larger but how much pages of each are actually accessed.

answered Sep 17 '22 23:09

Jeroen Mostert

No, you do not need this covering index.

Limit the number of indexes for each table: A table can have any number of indexes. However, the more indexes there are, the more overhead is incurred as the table is modified. Thus, there is a trade-off between the speed of retrieving data from a table and the speed of updating the table.

Your scenario is more likely as an OLTP system instead of Data Warehouse, it will have large numbers of on-line transactions(insert, update, delete). So creating this covering index will slow down your modification operations.

Update:

Yes,there will be 10 posts per each category. So if you have N category types, the return result set is at most 10*N post records.

Another Guideline about Index: Create an index if you frequently want to retrieve less than 15 percent of the rows in a large table. (My SQL Tuning instructor suggests us 5 percent) If greater than 15 percent, the final execution plan will not be optimal when we use Index.

Let's consider two extreme cases about your POST table:

Post table just has 10*N records and every category type is hit by post record 10 times. So the final execution plan will full scan POST table instead of using any index.
The number of Post table is greater than (10 * N / 15%), so it will retrieve less than 15% of rows in Post table. The Optimizer will use Post ID field to do join operation. And it should be a hash join.

So even you have created a covering index, the Optimizer will never use it unless you use a hint.

Updated:

Clustered and Nonclustered Indexes Described

answered Sep 21 '22 23:09

shawn

Related questions
                            
                                How is a graph database different to a graph represented in a relational database?
                            
                                How do I convert time into an integer in SQL Server
                            
                                How to avoid quotes around table aliases in jOOQ
                            
                                The type or namespace name 'DataSetExtensions' does not exist in the namespace 'System.Data' (are you missing an assembly reference?)
                            
                                Does stopping query with a rollback guarantee a rollback
                            
                                Generating "Fake" Records Within A Query
                            
                                How to read NLS_DATE_FORMAT?
                            
                                Comparing two variables in SQL
                            
                                Transform SQL insert script into CSV format
                            
                                NLS setting in sql developer to change the Number format
                            
                                MySQL: INSERT or UPDATE if exists, but not based on key column
                            
                                How to count how many times a meta_value appears in a column by certain meta_key?
                            
                                ScalikeJDBC: Connection pool is not yet initialized.(name:'default)
                            
                                Entity Framework Database First many-to-many
                            
                                Swapping records' values for a column with a UNIQUE constraint in PostgreSQL
                            
                                Why does Redshift not need materialized views or indexes?
                            
                                Unusual output format required in SQL
                            
                                decimal separator oracle
                            
                                Insert Multiple Rows SQL Teradata
                            
                                Extract last two words from my string

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Does a covering index pay off when the data is in order of the clustered index?

Tags:

sql

sql-server

indexing

Sefe

People also ask

2 Answers

Jeroen Mostert

shawn

Recent Activity

Donate For Us