Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Tell me SQL Server Full-Text searcher is crazy, not me

i have some customers with a particular address that the user is searching for:

123 generic way

There are 5 rows in the database that match:

ResidentialAddress1
=============================
123 GENERIC WAY
123 GENERIC WAY
123 GENERIC WAY
123 GENERIC WAY
123 GENERIC WAY

i run a FT query to look for these rows. i'll show you each step as i add more criteria to the search:

SELECT ResidentialAddress1 FROM Patrons
WHERE CONTAINS(Patrons.ResidentialAddress1, '"123*"')

ResidentialAddress1
=========================
123 MAPLE STREET
12345 TEST
123 MINE STREET
123 GENERIC WAY
123 FAKE STREET
...

(30 row(s) affected)

Okay, so far so good, now adding the word "generic":

SELECT ResidentialAddress1 FROM Patrons
WHERE  CONTAINS(Patrons.ResidentialAddress1, '"123*"')
AND CONTAINS(Patrons.ResidentialAddress1, '"generic*"')

ResidentialAddress1
=============================
123 GENERIC WAY
123 GENERIC WAY
123 GENERIC WAY
123 GENERIC WAY
123 GENERIC WAY

(5 row(s) affected)

Excellent. And now i'l add the final keyword that the user wants to make sure exists:

SELECT ResidentialAddress1 FROM Patrons
WHERE  CONTAINS(Patrons.ResidentialAddress1, '"123*"')
AND CONTAINS(Patrons.ResidentialAddress1, '"generic*"')
AND CONTAINS(Patrons.ResidentialAddress1, '"way*"')


ResidentialAddress1            
------------------------------ 

(0 row(s) affected)

Huh? No rows? What if i query for just "way*":

SELECT ResidentialAddress1 FROM Patrons
WHERE CONTAINS(Patrons.ResidentialAddress1, '"way*"')

ResidentialAddress1            
------------------------------ 

(0 row(s) affected)

At first i thought that perhaps it's because of the *, and it's requiring that the root way have more characters after it. But that's not true:

  • Searching for "123*" matches "123"
  • Searching for "generic*" matches "generic"
  • Books online says, The asterisk matches zero, one, or more characters

What if i remove the * just for s&g:

SELECT ResidentialAddress1 FROM Patrons
WHERE CONTAINS(Patrons.ResidentialAddress1, '"way"')

Server: Msg 7619, Level 16, State 1, Line 1
A clause of the query contained only ignored words. 

So one might think that you are just not allowed to even search for way, either alone, or as a root. But this isn't true either:

SELECT * FROM Patrons
WHERE CONTAINS(Patrons.*, '"way*"')

AccountNumber FirstName Lastname
------------- --------- --------
33589         JOHN      WAYNE                    

So sum up, the user is searching for rows that contain all the words:

123 generic way

Which i, correctly, translate into the WHERE clauses:

SELECT * FROM Patrons
WHERE CONTAINS(Patrons.*, '"123*"')
AND CONTAINS(Patrons.*, '"generic*"')
AND CONTAINS(Patrons.*, '"way*"')

which returns no rows. Tell me this just isn't going to work, that it's not my fault, and SQL Server is crazy.

Note: i've emptied the FT index and rebuilt it.

Update One

SELECT Lastname, ResidentialAddress1 FROM Patrons
WHERE CONTAINS(Patrons.*, '"gen*"')

Lastname                  ResidentialAddress1            
------------------------- ------------------------------ 
SAVE                      123 GENERIC WAY
Genders                   
SAVE                      123 GENERIC WAY
Patron                    123 GENERIC WAY
SAVE                      123 GENERIC WAY
SAVE                      234 GENERIC WAY
SAVE                      123 GENERIC WAY

(7 row(s) affected)

Update Two

Pretending the user typed in:

123 generic wa

SELECT ResidentialAddress1 FROM Patrons
WHERE  CONTAINS(Patrons.ResidentialAddress1, '"123*"')
AND CONTAINS(Patrons.ResidentialAddress1, '"generic*"')
AND CONTAINS(Patrons.ResidentialAddress1, '"wa*"')

ResidentialAddress1            
------------------------------ 

(0 row(s) affected)

The real problem is that the user is typing in something perfectly valid, and they would expect to see what any human being would expect to see.


Update Three

Someone asked for all this, it's not my fault!:

CREATE TABLE [dbo].[Patrons] (
    [PatronGUID]  uniqueidentifier ROWGUIDCOL  NOT NULL ,
    [AccountNumber] [bigint] NULL ,
    [FirstName] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [MiddleInitial] [varchar] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [Lastname] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [EyeColor] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [HairColor] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [Gender] [varchar] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [Birthday] [datetime] NULL ,
    [Height] [int] NULL ,
    [Weight] [int] NULL ,
    [FacialHair] [tinyint] NULL ,
    [Nationality] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [IdentifyingMarks] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [DriversLicenseNumber] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [DriversLicenseRegion] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [DriversLicenseCountry] [varchar] (2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [DriversLicenseExpires] [datetime] NULL ,
    [DriversLicenseDateVerified] [datetime] NULL ,
    [PassportNumber] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [PassportRegion] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [PassportCountry] [varchar] (2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [PassportExpires] [datetime] NULL ,
    [PassportDateVerified] [datetime] NULL ,
    [OtherIdentificationNumber] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [OtherIdentificationRegion] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [OtherIdentificationCountry] [varchar] (2) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [OtherIdentificationType] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [OtherIdentificationExpires] [datetime] NULL ,
    [OtherIdentificationDateVerified] [datetime] NULL ,
    [ResidentialAddress1] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ResidentialAddress2] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ResidentialAddress3] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ResidentialCity] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ResidentialZipCode] [varchar] (15) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ResidentialRegion] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ResidentialCountry] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ResidentialPhoneNumber] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [CountryOfResidence] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessAddress1] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessAddress2] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessAddress3] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessCity] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessRegion] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessZipCode] [varchar] (15) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessCountry] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessName] [varchar] (25) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [BusinessPhone] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [PositionWithFirm] [varchar] (30) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [EmployerTelephone] [varchar] (20) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [MemberCardType] [varchar] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [PlayerStatusCode] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [AccountType] [varchar] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [AccountStatus1] [varchar] (1) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [AccountStatus2] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [IsVIPExchangeRate] [tinyint] NULL ,
    [ChangedUserGUID_Depricated] [uniqueidentifier] NULL ,
    [ChangedUser] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [ChangedDate] [datetime] NULL ,
    [ChangedWorkstation] [varchar] (50) COLLATE SQL_Latin1_General_CP1_CI_AS NULL ,
    [PendingUpdates_Depricated] [varchar] (255) COLLATE SQL_Latin1_General_CP1_CI_AS NULL 
) ON [PRIMARY]
GO

ALTER TABLE [dbo].[Patrons] ADD 
    CONSTRAINT [DF_Patrons_PatronGUID] DEFAULT (newid()) FOR [PatronGUID],
    CONSTRAINT [PK_Patrons] PRIMARY KEY  NONCLUSTERED 
    (
        [PatronGUID]
    ) WITH  FILLFACTOR = 90  ON [PRIMARY] 
GO

if (select DATABASEPROPERTY(DB_NAME(), N'IsFullTextEnabled')) <> 1 
exec sp_fulltext_database N'enable' 

GO

if not exists (select * from dbo.sysfulltextcatalogs where name = N'TheFullTextCatalog')
exec sp_fulltext_catalog N'TheFullTextCatalog', N'create' 

GO

exec sp_fulltext_table N'[dbo].[Patrons]', N'create', N'TheFullTextCatalog', N'PK_Patrons'
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'FirstName', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'MiddleInitial', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'Lastname', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'EyeColor', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'IdentifyingMarks', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialAddress1', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialAddress2', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialAddress3', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialCity', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialZipCode', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialRegion', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialCountry', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'ResidentialPhoneNumber', N'add', 1033  
GO

exec sp_fulltext_column N'[dbo].[Patrons]', N'CountryOfResidence', N'add', 1033  
GO

exec sp_fulltext_table N'[dbo].[Patrons]', N'activate'  
GO

Here's the screenshots for the guy who didn't believe me:

The query that should work but doesn't:

enter image description here

The query that works, but isn't useful:

enter image description here

The query that works, but isn't useful, with the proof content:

enter image description here


Update Four

The query cannot be written as

CONTAINS(Patrons.*, 'words...')

Since there are items not logically or physically covered by the FT index. e.g. the user queries for:

6/4/2010 ian boyd 619

Presents four keywords:

  • 6/4/2010
  • ian
  • boyd
  • 619

This means they want all the conditions to hold true, with pseudo-code being:

WHERE 6/4/2010 is in the row
AND ian is in the row
AND boyd is in the row
AND 619 is in the row

Which is translated into a partial query of:

WHERE --Keyword 1: 6/4/2010
(
   ((ChangedDate >= '20100604') AND (ChangedDate < '20100605'))
   OR 
   ((LastTransactionDate >= '20100604') AND (LastTransactionDate < '20100605'))
   OR 
   (CONTAINS(Patrons.*, '"6/4/2010*"')
)
AND --Keyword 2: ian
(
    CONTAINS(Patrons.*, '"ian*"')
)
AND --Keyword 3: boyd
(
    CONTAINS(Patrons.*, '"boyd*"')
)
AND --Keyword 4: 619
(
    (AccountNumber IN (SELECT CAST(619 AS bigint)))
    OR
    (CONTAINS(Patrons.*, '"619*"'))
)

One of the answerers was looking at the simplified example presented in the original question; not the real world. To say that it is incorrect to have multiple AND clauses is nieve.

like image 236
Ian Boyd Avatar asked Jun 03 '10 13:06

Ian Boyd


4 Answers

The message is telling you that "way" is a stopword, which means it's ignored and not indexed. That's why you can find "wayne" but not "way".

So, no, it's not crazy and neither are you. There's just a simple misunderstanding.

like image 65
Steven Sudit Avatar answered Sep 22 '22 15:09

Steven Sudit


You probably used the system stoplist when you created the FT index. The word way happens to be in there. You can see it with this query:

SELECT *
FROM sys.fulltext_system_stopwords
WHERE stopword = 'way'
AND language_id = 1033

You can turn off the stoplist or create a custom one, but a better solution would be to write the query properly; don't use multiple WHERE CONTAINS clauses, combine them into one. Otherwise SQL Server might not be able to use the FT index as effectively.

Your query should look like this instead:

SELECT ResidentialAddress1 FROM Patrons
WHERE  CONTAINS(Patrons.ResidentialAddress1, '"123*" AND "generic*" AND "way*"')

If you do it this way, the stop word simply gets ignored; it'll still return all of the same results it would have returned if you hadn't included the term way*.


Edit: Just noticed that you tagged this sql-server-2000, so the first query might not work. In SQL 2000, they are "noise words" and I believe that the configuration is global, you don't have individual stoplists. Nevertheless, you'll still get results if you write a single WHERE CONTAINS clause instead of several.

To edit the noise words in SQL Server 2000, you have to edit the language-specific file in the SQL Server FTDATA configuration folder. More details are here: SQL Server Full Text Search Noise Words and Thesaurus Configurations.

like image 32
Aaronaught Avatar answered Sep 23 '22 15:09

Aaronaught


Solution 1:

You want to try the Transform Noise Word option (SQL 2008).

Turning this off, should stop word removal.

example:

sp_configure 'show advanced options', 1
RECONFIGURE
GO
sp_configure 'transform noise words', 1
RECONFIGURE
GO

Edit 1:

Hopefully there may be something similar for older versions of MS SQL?

like image 33
Darknight Avatar answered Sep 21 '22 15:09

Darknight


http://www.codinghorror.com/blog/2008/11/stop-me-if-you-think-youve-seen-this-word-before.html

In that post, Jeff mentions you can turn them off.

like image 21
Josh Smeaton Avatar answered Sep 22 '22 15:09

Josh Smeaton