Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Ranking method used by SQL Server for Fulltext indexing

I'm having some problems with the ranking used by fulltext search in SQL Server.

Suppose a user searches for two words, "foo bar". We assume the user intends to do an OR search, so we pass "foo OR bar" to our CONTAINSTABLE call. What we're finding is that an row that contains "foo" 10 times but does not contain "bar" will have a much higher rank then an row that that has both "foo" and "bar". We would want the row that has both terms to be preferred over a row that only has one term multiple times. Any advice on how to achieve this?

I have found documentation related to a RANKMETHOD modifier, but that seems to be for SQL Enterprise search only, and not available on regular SQL Server 2005 installs. We could also switch to Lucene.Net, but I would hope to verify it has the better ranking available. We might also do multiple searches and merge the results, but that seems undesireable as the number of words searched for increases.

like image 734
Frank Schwieterman Avatar asked Oct 28 '08 20:10

Frank Schwieterman


3 Answers

RANKMETHOD is not available to any edition of SQL Server. If you are building the keywords to pass into the CONTAINSTABLE search and you know something about which words were more "important" you can rank the words with ISABOUT and WEIGHT.

For instance if foo has a higher rank than bar, you could do something like this:

SELECT * FROM CONTAINSTABLE (YourFullTextTable, *, 'ISABOUT("foo" WEIGHT (0.5), "bar" WEIGHT(0.3))')

like image 62
Coolcoder Avatar answered Sep 22 '22 03:09

Coolcoder


Try rewritting the query SELECT * FROM CONTAINSTABLE (YourFullTextTable, *, '"foo bar" or "foo" ~ "bar" or "foo or bar"') . This query in essence must rank any records with exact "foo bar" > "foo" near "bar" > foo or bar

like image 29
TestCheck Avatar answered Sep 20 '22 03:09

TestCheck


I did a short discussion on Inverted Indexes and Cosine Ranking (the core algorithms and data structures of full-text retrieval systems) Here.

On a pure cosine ranking, foo (1), bar (1) should be closer to the vector occupied by foo, bar than foo (10), bar (0). One can fiddle the cosine value post-hoc, which is essentially what Pagerank does. If the full-text search is fiddling the cosine ranking after calculating the dot product rather than weighting the axes, this would explain the disproportionate significance of foo.

like image 40
ConcernedOfTunbridgeWells Avatar answered Sep 19 '22 03:09

ConcernedOfTunbridgeWells