Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What column should the clustered index be put on?

Lately, I have been doing some reading on indexes of all types and the main advice is to put the clustered index on the primary key of the table, but what if the primary key actually is not used in a query (via a select or join) and is just put for purely relational purposes, so in this case it is not queried against. Example, say I have a car_parts table and it contains 3 columns, car_part_id, car_part_no, and car_part_title. car_part_id is the unique primary key identity column. In this case car_part_no is unique as well and is most likely car_part_title. car_part_no is what is most queried against, so doesn't it make sense to put the clustered index on that column instead of car_part_id? The basics of the question is what column should actually have the clustered index since you are only allowed one of them?

like image 567
Xaisoft Avatar asked Sep 17 '09 16:09

Xaisoft


2 Answers

Kimberly Tripp is always one of the best sources on insights on indexing.

See her blog post "Ever-increasing clustering key - the Clustered Index Debate - again!" in which she quite clearly lists and explains the main requirements for a good clustering key - it needs to be:

  • Unique
  • Narrow
  • Static

and best of all, if you can manage:

  • ever-increasing

Taking all this into account, an INT IDENTITY (or BIGINT IDENTITY if you really need more than 2 billion rows) works out to be the best choice in the vast majority of cases.

One thing a lot of people don't realize (and thus don't take into account when making their choice) is the fact that the clustering key (all the columns that make up the clustered index) will be added to each and every index entry for each and every non-clustered index on your table - thus the "narrow" requirement becomes extra important!

Also, since the clustering key is used for bookmark lookups (looking up the actual data row when a row is found in a non-clustered index), the "unique" requirement also becomes very important. So important in fact, that if you choose a (set of) column(s) that is/are not guaranteed to be unique, SQL Server will add a 4-byte uniquefier to each row --> thus making each and every of your clustered index keys extra wide ; definitely NOT a good thing.

Marc

like image 116
marc_s Avatar answered Nov 11 '22 12:11

marc_s


An index, clustered or non clustred, can be used by the query optimizer if and only if the leftmost key in the index is filtered on. So if you define an index on columns (A, B, C), a WHERE condition on B=@b, on C=@c or on B=@b AND C=@c will not fully leverage the index (see note). This applies also to join conditions. Any WHERE filter that includes A will consider the index: A=@a or A=@a AND B=@b or A=@a AND C=@c or A=@a AND B=@b AND C=@c.

So in your example if you make the clustred index on part_no as the leftmost key, then a query looking for a specific part_id will not use the index and a separate non-clustered index must exist on part-id.

Now about the question which of the many indexes should be the clustered one. If you have several query patterns that are about the same importance and frequency and contradict each other on terms of the keys needed (eg. frequent queries by either part_no or part_id) then you take other factors into consideration:

  • width: the clustered index key is used as the lookup key by all other non-clustered indexes. So if you choose a wide key (say two uniquidentifier columns) then you are making all the other indexes wider, thus consuming more space, generating more IO and slowing down everything. So between equaly good keys from a read point of view, choose the narrowest one as clustered and make the wider ones non-clustered.
  • contention: if you have specific patterns of insert and delete try to separate them physically so they occur on different portions of the clustered index. Eg. if the table acts as a queue with all inserts at one logical end and all deletes at the other logical end, try to layout the clustered index so that the physical order matches this logical order (eg. enqueue order).
  • partitioning: if the table is very large and you plan to deploy partioning then the partitioning key must be the clustered index. Typical example is historical data that is archived using a sliding window partitioning scheme. Even thow the entities have a logical primary key like 'entity_id', the clustred index is done by a datetime column that is also used for the partitioning function.
  • stability: a key that changes often is a poor candidate for a clustered key as each update the clustered key value and force all non-clustered indexes to update the lookup key they store. As an update of a clustered key will also likely relocate the record into a different page it can cause fragmentation on the clustered index.

Note: not fully leverage as sometimes the engine will choose an non-clustered index to scan instead of the clustered index simply because is narrower and thus has fewer pages to scan. In my example if you have an index on (A, B, C) and a WHERE filter on B=@b and the query projects C, the index will be likely used but not as a seek, as a scan, because is still faster than a full clustered scan (fewer pages).

like image 9
Remus Rusanu Avatar answered Nov 11 '22 13:11

Remus Rusanu