SQL server - worth indexing large string keys?

Tags:

sql-server

I have a table that has a large string key (varchar(1024)) that I was thinking to be indexed over on SQL server (I want to be able to search over it quickly but also inserts are important). In sql 2008 I don't get a warning for this, but under sql server 2005 it tells me that it exceeds 900 bytes and that inserts/updates with the column over this size will be dropped (or something in that area)

What are my alternatives if I would want to index on this large column ? I don't know if it would worth it if I could anyway.

782

asked Nov 03 '11 21:11

Ghita

2 Answers

An index with all the keys near 900 bytes would be very large and very deep (very few keys per page result in very tall B-Trees).

It depends on how you plan to query the values. An index is useful in several cases:

when a value is probed. This is the most typical use, is when an exact value is searched in the table. Typical examples are WHERE column='ABC' or a join condition ON a.column = B.someothercolumn.
when a range is scanned. This is also fairly typical when a range of values is searched in the table. Besides the obvious example of WHERE column BETWEEN 'ABC' AND 'DEF' there are other less obvious examples, like a partial match: WHERE column LIKE 'ABC%'.
an ordering requirement. This use is less known, but indexes can help a query that has an explicit ORDER BY column requirement to avoid a stop-and-go sort, and also can help certain hidden sort requirement, like a ROW_NUMBER() OVER (ORDER BY column).

So, why do you need the index for? What kind of queries would use it?

For range scans and for ordering requirements there is no other solution but to have the index, and you will have to weigh the cost of the index vs. the benefits.

For probes you can, potentially, use hash to avoid indexing a very large column. Create a persisted computed column as column_checksum = CHECKSUM(column) and then index on that column. Queries have to be rewritten to use WHERE column_checksum = CHECKSUM('ABC') AND column='ABC'. Careful consideration would have to be given to weighing the advantage of a narrow index (32 bit checksum) vs. the disadvantages of collision double-check and lack of range scan and order capabilities.

after the comment

I once had a similar problem and I used a hash column. The value was too large to index (>1K) and I also needed to convert the value into an ID to store (basically, a dictionary). Something along the lines:

create table values_dictionary (
  id int not null identity(1,1),
  value varchar(8000) not null,
  value_hash = checksum(value) persisted,
  constraint pk_values_dictionary_id
     primary key nonclustered (id));
create unique clustered index cdx_values_dictionary_checksum on (value_hash, id);
go

create procedure usp_get_or_create_value_id (
   @value varchar(8000),
   @id int output)
begin
   declare @hash = CHECKSUM(@value);
   set @id = NULL;
   select @id = id
      from table
      where value_hash = @hash
      and value = @value;
  if @id is null
  begin
      insert into values_dictionary (value)
        values (@value);
      set @id = scope_identity();
  end
end

In this case the dictionary table is organized as a clustered index on the values_hash column which groups all the colliding hash values together. The id column is added to make the clustered index unique, avoiding the need for a hidden uniqueifier column. This structure makes the lookup for @value as efficient as possible, w/o a hugely inefficient index on value and bypassing the 900 character limitation. The primary key on id is non-clustered which means that looking up the value from and id incurs the overhead of one extra probe in the clustered index.

Not sure if this answers your problem, you obviously know more about your actual scenarios than I do. Also, the code does not handle error conditions and can actually insert duplicate @value entries, which may or may not be correct.

110

answered Oct 13 '22 05:10

Remus Rusanu

General Index Design Guidelines

When you design an index consider the following column guidelines:

Keep the length of the index key short for clustered indexes. Additionally, clustered indexes benefit from being created on unique or nonnull columns. For more information, see Clustered Index Design Guidelines.

Columns that are of the ntext, text, image, varchar(max), nvarchar(max), and varbinary(max) data types cannot be specified as index key columns. However, varchar(max), nvarchar(max), varbinary(max), and xml data types can participate in a nonclustered index as nonkey index columns. For more information, see Index with Included Columns.

Examine data distribution in the column. Frequently, a long-running query is caused by indexing a column with few unique values, or by performing a join on such a column. This is a fundamental problem with the data and query, and generally cannot be resolved without identifying this situation. For example, a physical telephone directory sorted alphabetically on last name will not expedite locating a person if all people in the city are named Smith or Jones

answered Oct 13 '22 05:10

sll

Related questions
                            
                                What is the number in brackets on the tab label of a sql script in VS2010?
                            
                                View SQL prepared with sp_prepare
                            
                                SQL Triggers - how do I get the updated value?
                            
                                SQL injection: isn't replace("'", "''") good enough?
                            
                                Create .mdf/.sdf database dynamically
                            
                                SSMS 2012 Intellisense Behavior
                            
                                In SSRS is there a way to copy formatting between cells?
                            
                                How can I add an attribute to the root element of xml generated by SQL's Select for xml
                            
                                Adventure Works Explanation
                            
                                Convert datetime to nvarchar but keep format
                            
                                SQL72043 and other SSDT errors: how can I find the incorrect code?
                            
                                Getting sql connection string from web.config file
                            
                                Most efficient way to test SQL connection string availibility
                            
                                Trigger to prevent Insertion for duplicate data of two columns
                            
                                How to create/open DAC application in Visual Studio 2012?
                            
                                Passing array to a SQL Server Stored Procedure
                            
                                How do I calculate total minutes between start and end times?
                            
                                Running a stored procedure with NodeJS and MSSQL package error
                            
                                Is it possible to create a table on a linked server?
                            
                                In-Out Parameter for SqlCommand

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With