Effects of Clustered Index on DB Performance

Tags:

I recently became involved with a new software project which uses SQL Server 2000 for its data storage.

In reviewing the project, I discovered that one of the main tables uses a clustered index on its primary key which consists of four columns:

Sequence  numeric(18, 0)
Date      datetime
Client    varchar(9)
Hash      tinyint

This table experiences a lot of inserts in the course of normal operation.

Now, I'm a C++ developer, not a DB Admin, but my first impression of this table design was that that having these fields as a clustered index would be very detrimental to insert performance, since the data would have to be physically reordered on each insert.

In addition, I can't really see any benefit to this since one would have to be querying all of these fields frequently to justify the clustered index, right?

So basically I need some ammunition for when I go to the powers that be to convince them that the table design should be changed.

739

asked Jul 20 '10 18:07

Avalanchis

1 Answers

The clustered index should contain the column(s) most queried by to give the greatest chance of seeks or of making a nonclustered index cover all the columns in the query.

The primary key and the clustered index do not have to be the same. They are both candidate keys, and tables often have more than one such key.

You said

In addition, I can't really see any benefit to this since one would have to be querying all of these fields frequently to justify the clustered index, right?

That's not true. A seek can be had just by using the first column or two of the clustered index. It may be a range seek, but it's still a seek. You don't have to specify all the columns of it in order to get that benefit. But the order of the columns does matter a lot. If you're predominantly querying on Client, then the Sequence column is a bad choice as the first in the clustered index. The choice of the second column should be the item that is most queried in conjunction with the first (not by itself). If you find that a second column is queried by itself almost as often as the first column, then a nonclustered index will help.

As others have said, reducing the number of columns/bytes in the clustered index as much as possible is important.

It's too bad that the Sequence is a random value instead of incrementing, but that may not be able to be helped. The answer isn't to throw in an identity column unless your application can start using it as the primary query condition on this table (unlikely). Now, since you're stuck with this random Sequence column (presuming it IS the most often queried), let's look at another of your statements:

having these fields as a clustered index would be very detrimental to insert performance, since the data would have to be physically reordered on each insert.

That's not entirely true.

The physical location on the disk is not really what we're talking about here, but it does come into play in terms of fragmentation, which is a performance implication.

The rows inside each 8k page are not ordered. It's just that all the rows in each page are less than the next page and more than the previous one. The problem occurs when you insert a row and the page is full: you get a page split. The engine has to copy all the rows after the inserted row to a new page, and this can be expensive. With a random key you're going to get a lot of page splits. You can ameliorate the problem by using a lower fillfactor when rebuilding the index. You'd have to play with it to get the right number, but 70% or 60% might serve you better than 90%.

I believe that having datetime as the second CI column could be beneficial, since you'd still be dealing with pages needing to be split between two different Sequence values, but it's not nearly as bad as if the second column in the CI was also random, since you'd be guaranteed to page split on every insert, where with an ascending value you can get lucky if the row can be added to a page because the next Sequence number starts on the next page.

Shortening the data types and number of all columns in a table as well as its nonclustered indexes can boost performance too, since more rows per page = fewer page reads per request. Especially if the engine is forced to do a table scan. Moving a bunch of rarely-queried columns to a separate 1-1 table could do wonders for some of your queries.

Last, there are some design tweaks that could help as well (in my opinion):

Change the Sequence column to a bigint to save a byte for every row (8 bytes instead of 9 for the numeric).
Use a lookup table for Client with a 4-byte int identity column instead of a varchar(9). This saves 5 bytes per row. If possible, use a smallint (-32768 to 32767) which is 2 bytes, an even greater savings of 7 bytes per row.

Summary: The CI should start with the column most queried on. Remove any columns from the CI that you can. Shorten columns (bytes) as much as you can. Use a lower fillfactor to mitigate the page splits caused by the random Sequence column (if it has to stay first because of being queried the most).

Oh, and get your online defragging going. If the table can't be changed, at least it can be reorganized frequently to keep it in best possible shape. Don't neglect statistics, either, so the engine can pick appropriate execution plans.

UPDATE

Another strategy to consider is if the composite key used in the table can be converted to an int, and a lookup table of the values is created. Let's say some combination of less than all 4 columns is repeated in over 100 rows, for example, Sequence + Client + Hash but only with varying Date values. Then an insert to a separate SequenceClientHash table with an identity column could make sense, because then you could look up the artificial key once and use it over and over again. This would also get your CI to add new rows only on the last page (yay) and substantially reduce the size of the CI as repeated in all nonclustered indexes (yippee). But this would only make sense in certain narrow usage patterns.

Now, marc_s suggested just adding an additional int identity column as the clustered index. It is possible that this could help by making all the nonclustered indexes get more rows per page, but it all depends on exactly where you want the performance to be, because this would guarantee that every single query on the table would have to use a bookmark lookup and you could never get a table seek.

About "tons of page splits and bad index fragmentation": as I already said this can be ameliorated somewhat with a lower fill factor. Also, frequent online index reorganization (not the same as rebuilding) can help reduce the effect of this.

Ultimately, it all comes down to the exact system and its unique pattern of data access combined with decisions about which parts you want optimized. For some systems, having a slower insert isn't bad as long as selects are always fast. For others, having consistent but slightly slower select times is more important than having slightly faster but inconsistent select times. For others, the data isn't really read until it's pushed to a data warehouse anyway so the inserts need to be as fast as possible. And adding into the mix is the fact that performance isn't just about user wait time or even query response time but also about server resources especially in the case of massive parallelism, so that total throughput (say, in client responses per time unit) matters more than any other factor.

143

answered Oct 21 '22 14:10

ErikE

Related questions
                            
                                How to do a SQL Server select using a timestamp minus a certain number of hours
                            
                                How can you track the progress of a SQL update?
                            
                                What are the consequences of declaring a SQL Server Variable with two (or more) '@' symbols?
                            
                                SQL Server truncates decimal points of a newly created field in a view
                            
                                Sql Server - how to get last server restart (DMV reset date/time)
                            
                                If field is null, pull certain fields; otherwise, pull other fields
                            
                                Insert all data of a datagridview to database at once
                            
                                Bulk Insert doesn't insert any rows
                            
                                Delete Every Alternate Row in SQL
                            
                                How to convert Visual Foxpro database into SQL Server database
                            
                                Cast string+ntext to nvarchar error
                            
                                An exception of type 'System.Data.SqlClient.SqlException' occurred in System.Data.dll
                            
                                Joining All Rows of Two Tables in SQL Server
                            
                                Unable to connect to SQL Server with PHP
                            
                                How to ignore data loss warning while schema comparison?
                            
                                SELECT TOP N with UNION and ORDER BY
                            
                                Moving from ints to GUIDs as primary keys
                            
                                Count total rows with a group by
                            
                                How can I find sql server port number from windows registry?
                            
                                Operation stopped while importing CSV file into SQL server

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Effects of Clustered Index on DB Performance

Tags:

sql-server

indexing

sql-server-2000

database-design

Avalanchis

People also ask

1 Answers

ErikE

Recent Activity

Donate For Us