In SQL Server (2005+) I need to index a column (exact matches only) that is nvarchar(2000+)
. What is the most scalable, performant way to approach this?
In SQL Server (2005+), what would be the practical difference in indexing on a column with the following types:
nvarchar(2000)
char(40)
binary(16)
E.g. would a lookup against an indexed binary(16)
column be measurably faster than a lookup against an indexed nvarchar(2000)
? If so, how much?
Obviously smaller is always better in some regard, but I am not familiar enough with how SQL Server optimizes its indexes to know how it deals with length.
Indexes should not be used on tables containing few records. Tables that have frequent, large batch updates or insert operations. Indexes should not be used on columns that contain a high number of NULL values. Indexes should not be used on the columns that are frequently manipulated.
Apply SQL Server index key column best practices Columns with text, image, ntext, varchar(max), nvarchar(max) and varbinary(max) cannot be used in the index key columns. It is recommended to use an integer data type in the index key column. It has a low space requirement and works efficiently.
No, there is overhead in maintaining the indexes, so indexing all columns would slow down all of your insert, update and delete operations.
The number of indexes on a table is the most dominant factor for insert performance. The more indexes a table has, the slower the execution becomes.
You're thinking about this from the wrong direction:
Whether a column is a binary(16)
or nvarchar(2000)
makes little difference there, because you don't just go add indexes willy nilly.
Don't let index choice dictate your column types. If you need to index an nvarchar(2000)
consider a fulltext index or adding a hash value for the column and index that.
Based on your update, I would probably create either a checksum column or a computed column using the HashBytes()
function and index that. Note that a checksum isn't the same as a cryptographic hash and so you are somewhat more likely have collisions, but you can also match the entire contents of the text and it will filter with the index first. HashBytes() is less likely to have collisions, but it is still possible and so you still need to compare the actual column. HashBytes is also more expensive to compute the hash for each query and each change.
OF COURSE a binary(16) will be MUCH faster - just do the quickest of calculations:
If you have a table with 100'000 entries, you'll have to have 200 pages for the index with a binary(16) key, while you'll need 50'000 pages for the same index with nvarchar(2000)
Even just the added I/O to read and scan all those pages is going to kill any performance you might have had........
Marc
UPDATE:
For my usual indexes, I try to avoid compound indexes as much as I can - referencing them from other tables just gets rather messy (WHERE clauses with several equality comparisons).
Also, regularly check and maintain your indices - if you have more than 30% fragmentation, rebuild - if you have 5-30% fragmentation, reorganize. Check out an automatic, well tested DB Index maintenance script at http://sqlfool.com/2009/06/index-defrag-script-v30/
For the clustered key on a SQL Server table, try to avoid GUID's since they're random in nature and thus cause potentially massive index fragmentation and therefore hurt performance. Also, while not a hard requirement, try to make sure your clustered key is unique - if it's not, SQL Server will add a four-byte uniqueifier to it. Also, the clustered key gets added to each and every entry in each and every non-clustered index - so in the clustered key, it's extremely important to have a small, unique, stable (non-changing) column (optimally it's ever-increasing , that gives you the best characteristics and performance --> INT IDENTITY is perfect).
You can have at most 900 bytes per index entry, so your nvarchar(2000) won't fly. The biggest difference will be index depth - the number of pages to traverse from the root to the leaf page. So, if you need to search, you can index on CHECKSUM, like this:
alter table recipe add text_checksum as checksum(recipe_text)
create index text_checksum_ind on recipe(text_checksum)
(example from here Indexes on Computed Columns: Speed Up Queries, Add Business Rules) which will not give you an exact match, only narrow down your search very well.
Of course, if you need to enforce uniqueness, you'll have to use triggers.
Another idea is to zip your nvarchar to a smaller binary value, and index on that, but can you guarantee that every value is always zipped to 900 bytes or less?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With