I got this hint from mssqlcity.com. However, I cannot understand its explanation.
For example, if a table has surrogate integer primary key declared as IDENTITY and the clustered index was created on this column, then every time data is inserted into this table, the rows will be added to the end of the table. When many rows will be added a "hot spot" can occur. A "hot spot" occurs when many queries try to read or write data in the same area at the same time. A "hot spot" results in I/O bottleneck. Note. By default, SQL Server creates clustered index for the primary key constraint. So, in this case, you should explicitly specify NONCLUSTERED keyword to indicate that a nonclustered index is created for the primary key constraint.
Before I read that, I thought if I pick a column that is random in nature, it's not correct because this will cause unnecessary page relocation when adding a new row. So, I think using a sorted column is preferrable.
After reading this hint, I think it's trying to say we don't really want to use a straightly sorted column to be our clustered index either because there is going to be an I/O bottleneck for those write-intensive application.
I don't really understand the cause of the I/O bottleneck that they are talking about. Are they saying too many operations sharing the same page is going to slow down the disk operations? How does this happen? Can somebody explain to me?
The Hot Spot they are referring to is not an issue in SQL Server 2005 and newer.
What USED to happen is that all your data was being written to the same area of the clustered index and the same sector(s) on the disk which caused a lot of dirty pages to be created at once (dirty pages being data pages that have been altered but not committed to disk), and when a flush or checkpoint ran this could cause issues.
Newer versions do not experience this behavior due to changes in the IO architecture (from what I understand).
All modern transactional databases (modern means developed in the last decade) use transaction logging.
This means that all changes to the database are written into a special file (called a transactional log) in a sequential manner, and then a special dedicated process parses this file and applies the changes to the actual data. This is called a CHECKPOINT
.
If ten threads insert ten records into a table with an IDENTITY
column, the engine will create ten transaction log records (written one after another by a single process called Log Writer
), and then, when it's time for a CHECKPOINT
, these records will be written to the appropriate data pages (also by a single process, called Checkpoint
).
Since they are continuous, most probably, they will be written into a single data page in a single I/O
operation, and no page splits can occur, since there are no data after them.
Thus said, a clustered index on an ever-incrementing key is more efficient than that on a random key.
Well, I've heard the same story before. Apparently it's a myth. Generally advice goes towards having growing clustered primary keys. All the major DB vendors knows this and mitigates the situation that you quote for avoiding growing keys.
See also https://dba.stackexchange.com/questions/1584/is-avoid-creating-a-clustered-index-based-on-an-incrementing-key-a-myth-from-sq
The quote also goes against the advice (from the same page):
Consider creating a surrogate integer primary key (identity, for example). Every table must have a primary key (a unique identifier for a row within a database table). A surrogate primary key is a field that has a unique value but has no actual meaning to the record itself, so users should never see or change a surrogate primary key. Some developers use surrogate primary keys, others use data fields themselves as the primary key. If a primary key consists of many data fields and has a big size, consider creating a surrogate integer primary key. This can improve performance of your queries.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With