Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is that possible to use full text index to find closest match strings? What does Statistical Semantics do in Full Text Indexing

I am looking for SQL Server 2016 full text indexes and they are awesome to make searches for finding multiple words containing strings

When i try to compose the full text index, it shows Statistical Semantics as a tickbox. What does statistical semantics do?

Moreover, I want to find did you mean queries

For example lets say i have a record as house. The user types hause

Can i use full text index to return hause as closest match and show user did you mean house efficiently ? thank you

I have tried soundex but the results it generates are terrible

It returns so many unrelated words

And since there are so many records in my database and i need very fast results, i need something SQL server natively supports

Any ideas? Any way to achieve such thing with using indexes?

I know there are multiple algorithms but they are not efficient enough for me to use online. I mean like calculating edit distance between each records. They could be used for offline projects but i need this efficiency in an online dictionary where there will be thousands of requests constantly.

I already have a plan in my mind. Storing not-found results in the database and offline calculating closest matches. And using them as cache. However, i wonder any possible online/live solution may exists? Consider that there will be over 100m nvarchar records

like image 436
MonsterMMORPG Avatar asked Mar 22 '17 09:03

MonsterMMORPG


People also ask

What is a full-text index?

What is a Full Text Index? A full-text index is a special type of index that provides index access for full-text queries against character or binary column data. A full-text index breaks the column into tokens and these tokens make up the index data.

What is a full-text keyword search?

Full-text search refers to searching some text inside extensive text data stored electronically and returning results that contain some or all of the words from the query. In contrast, traditional search would return exact matches.

How does a full-text search work?

Full-text search is the most common technique used in Web search engines and Web pages. Each page is searched and indexed, and if any matches are found, they are displayed via the indexes. Parts of original text are displayed against the user's query and then the full text.

What is the use of full-text search in SQL Server?

Full-text queries perform linguistic searches against text data in full-text indexes by operating on words and phrases based on the rules of a particular language such as English or Japanese. Full-text queries can include simple words and phrases or multiple forms of a word or phrase.


1 Answers

Short answer is no, Full Text Search cannot search for words that are similar, but different.

Full Text Search uses stemmers and thesaurus files:

The stemmer generates inflectional forms of a particular word based on the rules of that language (for example, "running", "ran", and "runner" are various forms of the word "run").

A Full-Text Search thesaurus defines a set of synonyms for a specific language.

Both stemmers and thesaurus are configurable and you can easily have FT match house for a search on hause, but only if you added hause as a synonym for house. This is obviously a non-solution as it requires you to add every possible typo as a synonym...

Semantic search is a different topic, it allows you to search for documents that are semantically close to a given example.

What you want is to find records that have a short Levenshtein distance from a given word (aka. 'fuzzy' search). I don't know of any technique for creating an index that can answer a Levenshtein search. If you're willing to scan the entire table for each term, T-SQL and CLR implementations of Levenshtein exists.

like image 90
Remus Rusanu Avatar answered Oct 24 '22 14:10

Remus Rusanu