Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SQL Server Full Text Search around numbers and underscores

Using SQL Server 2012 (in general using SQL Server 2008 R2 up to SQL Server 2016)

This question is a more specific re wording of SQL-Server Full Text Index Unexpected results. Please see here for how we have got to this point and what has been tried already.

I am re-posting now we have found the specific fault. Many thanks to @HoneyBadger.

His help has been invaluable getting to this point.

Table structure:

CREATE TABLE TestFullTextSearch (Id INT NOT NULL, AllText NVARCHAR(400))

CREATE UNIQUE INDEX test_tfts ON TestFullTextSearch(Id)

CREATE FULLTEXT CATALOG ftcat_tfts

CREATE FULLTEXT INDEX ON TestFullTextSearch(AllText)
    KEY INDEX test_tfts ON ftcat_tfts
    WITH CHANGE_TRACKING AUTO, STOPLIST OFF

Data:

INSERT INTO TestFullTextSearch 
VALUES (1, ' 123_456 789 '), (2, ' 789 123_456 '),
       (3, ' 123_456 ABC '), (4, ' ABC 123_456 ')

Please note this data is purely to demonstrate the issue, and is not indicative of a live data set. Our live data sets can be over 500,000 rows, with paragraphs of data in a single field being searched - hence using full text searches.

Select 1: Results as expected

SELECT *
FROM TestFullTextSearch
WHERE CONTAINS (AllText, '"123*"')

Id          AllText
----------- ------------
1            123_456 789
2            789 123_456
3            123_456 ABC
4            ABC 123_456

SELECT 2: Misses row 2 in the result set

SELECT *
FROM TestFullTextSearch
WHERE CONTAINS (AllText, '"123_*"')

Id          AllText
----------- ------------
1            123_456 789
3            123_456 ABC
4            ABC 123_456

SELECT 3: Only returns row 2

SELECT *
FROM TestFullTextSearch
WHERE CONTAINS (AllText, '"123\_*"')

Id          AllText
----------- ------------
2            789 123_456

Conclusion: Searching for a number string suffixed with an underscore fails if preceding word is a number string.

Problem: Our clients use full text search and expect a result to be forth coming around part numbers and catalogue references, which may or may not be in a section of text including other number strings. Full text search does not seem to support this in a consistent manner.

Any help gratefully received.

NB: This issue does not happen on SQL SERVER 2008, but does on 2012+

I have also tried switching to the older version of FTS parser. Testing with

SELECT * FROM sys.dm_fts_parser (' "789 123_456" ',1033,0,0)
SELECT * FROM sys.dm_fts_parser (' "789 123_456" ',2057,0,0)

I had with current parser: enter image description here

And after reverting to the legacy parser: enter image description here

So it has had an effect, however I'm still getting the same results.

Are there any other differences in the full text search between 2008 & 2012 that could be having this effect?

like image 850
Matthew Baker Avatar asked Feb 15 '18 07:02

Matthew Baker


People also ask

How do you do a full text search in SQL Server?

The SQL Server process uses the following components for full-text search: User tables. These tables contain the data to be full-text indexed. Full-text gatherer. The full-text gatherer works with the full-text crawl threads. Thesaurus files. These files contain synonyms of search terms. Stoplist objects.

What is full text index in SQL Server?

A full-text index stores information about significant words and their location within one or more columns of a database table. A full-text index is a special type of token-based functional index that is built and maintained by the Full-Text Engine for SQL Server.

How is the full text portion of a query performed?

The full text portion of a query is performed by the full text engine. When a full text crawl is initiated, the full text engine (one of the SQL Server Process Components) pushes large batches of data into memory and invokes the filter daemon host for further processing. The filter daemon host is a process that is started by the Full-Text Engine.

What is the difference between like and full text search in SQL?

In contrast to full-text search, the LIKE Transact-SQL predicate works on character patterns only. Also, you cannot use the LIKE predicate to query formatted binary data. Furthermore, a LIKE query against a large amount of unstructured text data is much slower than an equivalent full-text query against the same data.


2 Answers

They changed the full text parsers/stemmers between SQL 2008 and SQL 2012.

With a registry change, you can use the legacy parser, which should work better in your situation.

See https://technet.microsoft.com/en-us/library/gg509108(v=sql.110).aspx for details.

If you need to support both old and new style, then you can revert US English to the old and keep UK English the new (or vice versa)

Using SQL 2016, I reverted UK English and kept US English the same:

exec sp_help_fulltext_system_components 'wordbreaker', 1033

exec sp_help_fulltext_system_components 'wordbreaker', 2057

Returns: Screenshot showing fts components

I created another table using UK English and populated it.

CREATE TABLE TestFullTextSearch2 (Id INT NOT NULL, AllText NVARCHAR(400))

CREATE UNIQUE INDEX test_tfts2 ON TestFullTextSearch2(Id)


CREATE FULLTEXT INDEX ON TestFullTextSearch2(AllText language 2057)
    KEY INDEX test_tfts2 ON ftcat_tfts 
    WITH CHANGE_TRACKING AUTO, STOPLIST OFF

    INSERT INTO TestFullTextSearch2
VALUES (1, ' 123_456 789 '), (2, ' 789 123_456 '),
       (3, ' 123_456 ABC '), (4, ' ABC 123_456 ')

I'm getting the expected 4 results for all 3 queries.

Results of FTS queries

Verify that your changes have taken effect.

exec sp_help_fulltext_system_components 'wordbreaker', 1033

exec sp_help_fulltext_system_components 'wordbreaker', 2057

select t.name, c.* from sys.tables t inner join sys.fulltext_index_columns c on t.object_id = c.object_id
like image 51
StrayCatDBA Avatar answered Sep 23 '22 10:09

StrayCatDBA


The problem here is basically a difference in how MSSQL 2012 stores the index and how the query itself is handling the underscore _.

This becomes clear when inspecting the index keywords & fts parser. For row 2 the keyword 123_456 is not stored as such, because of the numeric value coming in front of it. However the fts parser will search for an exact match on "123_" and does not remove the underscore.

select * from sys.dm_fts_index_keywords_by_document
( 
    DB_ID('TestDatabase'),     
    OBJECT_ID('TestFullTextSearch') 
) order by document_id

select * from sys.dm_fts_parser('"123_*"', 0, 0, 0)

One solution would be to change the word breaker for a specific language. You could easily replace it with the word breaker dll from MSSQL 2008 or 2016 where this problem does not occur. (e.g. take the one for Neutral Language: NaturalLanguage6.dll). Make sure to create the Full Text index for the same language.

To find the registered word breakers and where dll's are located use this query:

EXEC sp_help_fulltext_system_components 'wordbreaker';
like image 40
redrobot Avatar answered Sep 20 '22 10:09

redrobot