We have an application that using a SQL Server 2008 database, and full-text search. I'm trying to understand why the following searches behave differently:
First, a phrase containing a hyphenated word, like this:
contains(column_name, '"one two-three-four five"')
And second, an identical phrase, where the hyphens are replaced by spaces:
contains(column_name, '"one two three four five"')
The full-text index uses the ENGLISH (1033) locale, and the default system stoplist.
From my observations of other full-text searches containing hyphenated words, the first one should allow for matches on either one two three four five
or one twothreefour five
. Instead, it only matches one twothreefour five
(and not one two-three-four five
).
Test Case
Setup:
create table ftTest
(
Id int identity(1,1) not null,
Value nvarchar(100) not null,
constraint PK_ftTest primary key (Id)
);
insert ftTest (Value) values ('one two-three-four five');
insert ftTest (Value) values ('one twothreefour five');
create fulltext catalog ftTest_catalog;
create fulltext index on ftTest (Value language 1033)
key index PK_ftTest on ftTest_catalog;
GO
Queries:
--returns one match
select * from ftTest where contains(Value, '"one two-three-four five"')
--returns two matches
select * from ftTest where contains(Value, '"one two three four five"')
select * from ftTest where contains(Value, 'one and "two-three-four five"')
select * from ftTest where contains(Value, '"one two-three-four" and five')
GO
Cleanup:
drop fulltext index on ftTest
drop fulltext catalog ftTest_catalog;
drop table ftTest;
http://support.microsoft.com/default.aspx?scid=kb;en-us;200043
"Where non-alphanumeric character must be used in the search critera (primarily the dash '-' character), use the Transact-SQL LIKE clause instead of the FULLTEXT or CONTAINS predicates."
In cases like these where you can't anticipate the behavior of the word-breaker it's always a good idea to run sys.dm_fts_parser on your strings to get an idea of how the words are going to be split and stored in the internal index.
For instance, running sys.dm_fts_parser on '"one two-three-four five"' results in the following -
select * from sys.dm_fts_parser('"one two-three-four five"', 1033, NULL, 0)
--edited--
1 0 1 Exact Match one
1 0 2 Exact Match two-three-four
1 0 2 Exact Match two
1 0 3 Exact Match three
1 0 4 Exact Match four
1 0 5 Exact Match five
As you can see from the returned results, the word-breaker parses the string and outputs six forms which may explain the results you see when running your CONTAINS query.
A full-text search considers a word to be a string of characters without spaces or punctuation. The occurrence of a non-alphanumeric character can "break" a word during a search. Because the SQL Server full-text search is a word-based engine, punctuation generally is not considered and is ignored when searching the index. Therefore, a CONTAINS clause like 'CONTAINS(testing, "computer-failure")' would match a row with the value, "The failure to find my computer would be expensive.".
Please, follow the link for WHY:https://support.microsoft.com/en-us/kb/200043
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With