I have a database with fields that can contain long phrases of words. I wanted the ability to quickly search for a keyword or phrase in these columns, but when searching a phrase, I want to be able to search the phrase like Google would, returning all rows that contain all of the specified words, but in no particular order or "nearness" to each other. Ranking the results by relevance is unnecessary at this point.
After reading about SQL Server's Full-Text Search, I thought it would be just what I needed: a searchable index based on each word in a text-based column. My end goal is to safely accept user input and turn it into a query that leverages the speed of Full-Text Search, while maintaining ease-of-use for the users.
I see the FREETEXT
function can take an entire phrase, break it up into "useful" words (ignoring words like 'and', 'or', 'the', etc), and then return a list of matching rows very quickly, even with a complex search term. But when you try to use it, you may notice that instead of an AND
search for each of the terms, it seems to only do an OR
search. Maybe there's a way to change its behavior, but I haven't found anything useful.
Then there's CONTAINS
, which can accept a boolean query phrase, but sometimes with odd results.
Take a look at the following queries on this table:
PKID Name
----- -----
1 James Kirk
2 James Cameron
3 Kirk Cameron
4 Kirk For Cameron
Q1: SELECT Name FROM tblName WHERE FREETEXT(Name, 'james')
Q2: SELECT Name FROM tblName WHERE FREETEXT(Name, 'james kirk')
Q3: SELECT Name FROM tblName WHERE FREETEXT(Name, 'kirk for cameron')
Q4: SELECT Name FROM tblName WHERE CONTAINS(Name, 'james')
Q5: SELECT Name FROM tblName WHERE CONTAINS(Name, '"james kirk"')
Q6: SELECT Name FROM tblName WHERE CONTAINS(Name, '"kirk james"')
Q7: SELECT Name FROM tblName WHERE CONTAINS(Name, 'james AND kirk')
Q8: SELECT Name FROM tblName WHERE CONTAINS(Name, 'kirk AND for AND cameron')
SELECT Name FROM tblName WHERE FREETEXT(Name, 'james')
Returns "James Kirk" and "James Cameron". Alright, lets narrow it down...
SELECT Name FROM tblName WHERE FREETEXT(Name, 'james kirk')
Guess what. Now you'll get "James Kirk", "James Cameron", and "Kirk For Cameron". Same thing happens for Query 3, so let's just skip that.
SELECT Name FROM tblName WHERE CONTAINS(Name, 'james')
Same results as Query 1. Okay. Narrow the results maybe...?
SELECT Name FROM tblName WHERE CONTAINS(Name, '"james kirk"')
After discovering that you need to enclose the string in double-quotes if there are spaces, I find that this query works great on this particular dataset for the results I desire! Only "James Kirk" is returned. Wonderful! Or is it...
SELECT Name FROM tblName WHERE CONTAINS(Name, '"kirk james"')
Crap. No. It is matching that exact phrase. Hmmm... After checking the syntax for T-SQL's CONTAINS function, I see that you can throw boolean keywords in there, and it looks like that might be the answer. Let's see...
SELECT Name FROM tblName WHERE CONTAINS(Name, 'james AND kirk')
Neat. I get all three results, as expected. Now I just write a function to cram the word AND
between all the words. Done, right? What now...
SELECT Name FROM tblName WHERE CONTAINS(Name, 'kirk AND for AND cameron')
This query knows exactly what it's looking for, except for some reason, there are no results. Why? Well after reading about Stopwords and Stoplists, I will make an educated guess and say that because I'm asking for the intersection of the index results for "kirk", "for", and "cameron", and the word "for" will not have any results (what with it being a stopword and all), then the result of any intersection with that result is also empty. Whether or not it actually functions like that is irrelevant to me, since that is the observable behavior of the CONTAINS
function every time I do a boolean search with a stopword in there.
So I need a new solution.
NEAR
Looks promising. If I can take a user query and put commas between it, this will... wait this is the same thing as using boolean AND
in CONTAINS
queries. But does it ignore stopwords correctly?
SELECT Name FROM tblName WHERE CONTAINS(Name, 'NEAR(kirk, for, cameron)')
Nope. No results. Remove the word "for", and you get all three results again. :(
To implement a full-text search in a SQL database, you must create a full-text index on each column you want to be indexed. In MySQL, this would be done with the FULLTEXT keyword. Then you will be able to query the database using MATCH and AGAINST.
How can I tell if Full-Text Search is enabled on my SQL Server instance? A: You can determine if Full-Text Search is installed by querying the FULLTEXTSERVICEPROPERTY like you can see in the following query. If the query returns 1 then Full-Text Search is enabled.
Have you looked at using the Semantic Index functions in SQL Server 2012?
They are built on full text indexes but extend them to include details about word frequency. I used them just recently to build a word cloud and it was really good.
There are some good articles to be found on the internet and you can also search for words that are 'near' each other in docs. I set up the full text index across 2 nvarchar columns and then enable sematic indexing.
These links will get you started but I think it will give you what you need.
Setting up Sematic indexes
Some good info
I am using ISAbout with combination of THESAURUS, INFLECTIONAL and Wildcards
Advantages are
1 - Order of words in the search string doesn't matter
2 - Search similar words (THESAURUS)
3 - Treat runs,running,ran,run as same (INFLECTIONAL)
4 - Near doesn't return result if search string has even one element which isn't in the resulting string, but ISAbout will always return the most desirable result
5 - You can set weight of different words, which will further help you to optimize correctness of your result
SELECT K.RANK, name, Description
FROM Diagnosis AS C
INNER JOIN
CONTAINSTABLE(diagnosis,name,<br> 'isAbout(FORMSOF (THESAURUS, "CHRONIC") weight(1.0),FORMSOF (INFLECTIONAL, "CHRONIC") weight(1.0),CHRONIC* weight(1.0)
FORMSOF (THESAURUS, "FAILURE") weight(1.0),FORMSOF (INFLECTIONAL, "FAILURE") weight(1.0),FAILURE* weight(1.0),
FORMSOF (THESAURUS, "DIASTOLIC") weight(1.0),FORMSOF (INFLECTIONAL, "DIASTOLIC") weight(1.0),DIASTOLIC* weight(1.0))')
AS K
ON C.ID = K.[KEY];
I am still looking for ways to optimize it.
Note : I remove stop words from my search string programmatically.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With