Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Server Search Proper Names Full Text Index vs LIKE + SOUNDEX

I have a database of names of people that has (currently) 35 million rows. I need to know what is the best method for quickly searching these names. The current system (not designed by me), simply has the first and last name columns indexed and uses "LIKE" queries with the additional option of using SOUNDEX (though I'm not sure this is actually used much). Performance has always been a problem with this system, and so currently the searches are limited to 200 results (which still takes too long to run). So, I have a few questions:

  1. Does full text index work well for proper names?
  2. If so, what is the best way to query proper names? (CONTAINS, FREETEXT, etc)
  3. Is there some other system (like Lucene.net) that would be better?

Just for reference, I'm using Fluent NHibernate for data access, so methods that work will with that will be preferred. I'm using SQL Server 2008 currently.

EDIT I want to add that I'm very interested in solutions that will deal with things like commonly misspelled names, eg 'smythe', 'smith', as well as first names, eg 'tomas', 'thomas'.

Query Plan

  |--Parallelism(Gather Streams)
       |--Nested Loops(Inner Join, OUTER REFERENCES:([testdb].[dbo].[Test].[Id], [Expr1004]) OPTIMIZED WITH UNORDERED PREFETCH)
            |--Hash Match(Inner Join, HASH:([testdb].[dbo].[Test].[Id])=([testdb].[dbo].[Test].[Id]))
            |    |--Bitmap(HASH:([testdb].[dbo].[Test].[Id]), DEFINE:([Bitmap1003]))
            |    |    |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([testdb].[dbo].[Test].[Id]))
            |    |         |--Index Seek(OBJECT:([testdb].[dbo].[Test].[IX_Test_LastName]), SEEK:([testdb].[dbo].[Test].[LastName] >= 'WHITDþ' AND [testdb].[dbo].[Test].[LastName] < 'WHITF'),  WHERE:([testdb].[dbo].[Test].[LastName] like 'WHITE%') ORDERED FORWARD)
            |    |--Parallelism(Repartition Streams, Hash Partitioning, PARTITION COLUMNS:([testdb].[dbo].[Test].[Id]))
            |         |--Index Seek(OBJECT:([testdb].[dbo].[Test].[IX_Test_FirstName]), SEEK:([testdb].[dbo].[Test].[FirstName] >= 'THOMARþ' AND [testdb].[dbo].[Test].[FirstName] < 'THOMAT'),  WHERE:([testdb].[dbo].[Test].[FirstName] like 'THOMAS%' AND PROBE([Bitmap1003],[testdb].[dbo].[Test].[Id],N'[IN ROW]')) ORDERED FORWARD)
            |--Clustered Index Seek(OBJECT:([testdb].[dbo].[Test].[PK__TEST__3214EC073B95D2F1]), SEEK:([testdb].[dbo].[Test].[Id]=[testdb].[dbo].[Test].[Id]) LOOKUP ORDERED FORWARD)

SQL for above:

SELECT * FROM testdb.dbo.Test WHERE LastName LIKE 'WHITE%' AND FirstName LIKE 'THOMAS%'

Based on advice from Mitch, I created an index like this:

CREATE INDEX IX_Test_Name_DOB
ON Test (LastName ASC, FirstName ASC, BirthDate ASC)
INCLUDE (and here I list the other columns)

My searches are now incredibly fast for my typical search (last, first, and birth date).

like image 267
Matthew Talbert Avatar asked Jun 01 '10 01:06

Matthew Talbert


2 Answers

Depends what your LIKE queries look like.

If you are searching for LIKE '%abc%' then no index can be utilised, whereas when searching for LIKE 'abc%' an index can be used. Also, if the index(es) on First and Last name are not 'covering' the emitted query then key lookups (Bookmark Lookups) will be performed and significantly impact performance.

Are your indexes rebuilt regularly?

Do you have an example query plan?

Update: A covering index for a query is one which can be used to perform the WHERE criteria and also has all of the columns required to satisfy the rest of the query such as the SELECT column list.

Using Covering Indexes to Improve Query Performance

Update: Even if you create a composite index on (Lastname, Firstname) (since lastname should be more selective), a lookup for all the other columns (the '*' column list) will still be required into the tables clustered index.

like image 65
Mitch Wheat Avatar answered Sep 30 '22 14:09

Mitch Wheat


I don't like soundex much. I think newer iterations of the algorithm are better, but you are hashing every word in the English language down to a fairly small hash. This tends to generate a ton of false matches over time. I've read that metaphone and it's successor double metaphone are better, but I don't have direct experience with them.

Mitch's coverage of like is pretty thorough, so I'm not going to repeat it.

like image 36
Donnie Avatar answered Sep 30 '22 13:09

Donnie