In SQL Server (2005+) I need to index a column (exact matches only) that is <code>nvarchar(2000+)</code>. What is the most scalable, performant way to approach this? <strike>In SQL Server (2005+), what would be the practical difference in indexing on a column with the following types:</strike> <ul> <li><code>nvarchar(2000)</code></li> <li><code>char(40)</code></li> <li><code>binary(16)</code></li> </ul> E.g. would a lookup against an indexed <code>binary(16)</code> column be measurably faster than a lookup against an indexed <code>nvarchar(2000)</code>? If so, how much? Obviously smaller is always better in some regard, but I am not familiar enough with how SQL Server optimizes its indexes to know how it deals with length.

You can have at most 900 bytes per index entry, so your nvarchar(2000) won't fly. The biggest difference will be index depth - the number of pages to traverse from the root to the leaf page. So, if you need to search, you can index on CHECKSUM, like this: <pre class="prettyprint"><code>alter table recipe add text_checksum as checksum(recipe_text) create index text_checksum_ind on recipe(text_checksum) </code></pre> (example from here Indexes on Computed Columns: Speed Up Queries, Add Business Rules) which will not give you an exact match, only narrow down your search very well. Of course, if you need to enforce uniqueness, you'll have to use triggers. Another idea is to zip your nvarchar to a smaller binary value, and index on that, but can you guarantee that every value is always zipped to 900 bytes or less?

SQL Server Index performance - long column

3 Answers

You're thinking about this from the wrong direction:

Do create indexes you need to meet performance goals
Do NOT create indexes you don't need

Whether a column is a binary(16) or nvarchar(2000) makes little difference there, because you don't just go add indexes willy nilly.

Don't let index choice dictate your column types. If you need to index an nvarchar(2000) consider a fulltext index or adding a hash value for the column and index that.

Based on your update, I would probably create either a checksum column or a computed column using the HashBytes() function and index that. Note that a checksum isn't the same as a cryptographic hash and so you are somewhat more likely have collisions, but you can also match the entire contents of the text and it will filter with the index first. HashBytes() is less likely to have collisions, but it is still possible and so you still need to compare the actual column. HashBytes is also more expensive to compute the hash for each query and each change.

103

answered Sep 22 '22 13:09

Joel Coehoorn

OF COURSE a binary(16) will be MUCH faster - just do the quickest of calculations:

a SQL Server page is always 8K
if you have 16 bytes per entry, you can store 500 entries on a page
with 4000 bytes per entry (nvarchar) you'll end up with 2 entries per page (worst case, if your NVARCHAR(2000) are fully populated)

If you have a table with 100'000 entries, you'll have to have 200 pages for the index with a binary(16) key, while you'll need 50'000 pages for the same index with nvarchar(2000)

Even just the added I/O to read and scan all those pages is going to kill any performance you might have had........

Marc

UPDATE:
For my usual indexes, I try to avoid compound indexes as much as I can - referencing them from other tables just gets rather messy (WHERE clauses with several equality comparisons).

Also, regularly check and maintain your indices - if you have more than 30% fragmentation, rebuild - if you have 5-30% fragmentation, reorganize. Check out an automatic, well tested DB Index maintenance script at http://sqlfool.com/2009/06/index-defrag-script-v30/

For the clustered key on a SQL Server table, try to avoid GUID's since they're random in nature and thus cause potentially massive index fragmentation and therefore hurt performance. Also, while not a hard requirement, try to make sure your clustered key is unique - if it's not, SQL Server will add a four-byte uniqueifier to it. Also, the clustered key gets added to each and every entry in each and every non-clustered index - so in the clustered key, it's extremely important to have a small, unique, stable (non-changing) column (optimally it's ever-increasing , that gives you the best characteristics and performance --> INT IDENTITY is perfect).

answered Sep 25 '22 13:09

marc_s

You can have at most 900 bytes per index entry, so your nvarchar(2000) won't fly. The biggest difference will be index depth - the number of pages to traverse from the root to the leaf page. So, if you need to search, you can index on CHECKSUM, like this:

alter table recipe add text_checksum as checksum(recipe_text)
create index text_checksum_ind on recipe(text_checksum)

(example from here Indexes on Computed Columns: Speed Up Queries, Add Business Rules) which will not give you an exact match, only narrow down your search very well.

Of course, if you need to enforce uniqueness, you'll have to use triggers.

Another idea is to zip your nvarchar to a smaller binary value, and index on that, but can you guarantee that every value is always zipped to 900 bytes or less?

answered Sep 25 '22 13:09

A-K

Related questions
                            
                                Save return values from INSERT...RETURNING into temp table (PostgreSQL)
                            
                                Oracle insert failure : not a valid month
                            
                                How to store fixed row values in a variable - SQL server
                            
                                How to tell if SQL server is trimming the result if TOP is used?
                            
                                DBFlow select where COLUMN in List?
                            
                                How to speed up simple UPDATE query with millions of rows?
                            
                                PHP - How to substitute array as host parameter in prepared statement
                            
                                Oracle how to convert time in UTC to the local time (offset information is missing)
                            
                                What's the Grain in the context of DW
                            
                                sql unique records puzzle
                            
                                Will Spark SQL completely replace Apache Impala or Apache Hive?
                            
                                Replace first occurrence of '.' in sql String
                            
                                CASE statements in Hive
                            
                                How can I quickly detect and resolve SQL Server Index fragmentation for a database?
                            
                                Missing semicolons at line-end of JPA-generated sql script
                            
                                Filling missing dates in BigQuery (SQL) without creating a new calendar
                            
                                Search json array using SQL server JSON_VALUE
                            
                                SQL query to get the top "n" scores out of a list
                            
                                How to make a sql search query more powerful?
                            
                                Inserting a row into DB2 from a sub-select - NULL error

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

SQL Server Index performance - long column

Tags:

performance

sql

sql-server

indexing

Rex M

People also ask

3 Answers

Joel Coehoorn

marc_s

A-K

Recent Activity

Donate For Us