Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mysql - fulltext index - what is natural language mode

I have a question regarding this article: http://dev.mysql.com/doc/refman/5.6/en/fulltext-natural-language.html.

Here I found queries like

SELECT * FROM articles
WHERE MATCH (title,body)
AGAINST ('database' IN NATURAL LANGUAGE MODE);

What I don't understand is what exactly is natural language mode? I find no exact definition nowhere.

Can any1 provide a definition? How does it work?

like image 787
zozo Avatar asked May 16 '13 14:05

zozo


1 Answers

MySQL's Natural Language Full-Text Searches aim to match search queries against a corpus to find the most relevant matches. So assume we have an article that contains "I love pie" and we have documents d1, d2, d3 (the database in your case). Document 1 and 2 are about sports and religion respectively, and document 3 is about food. Your query,

SELECT * FROM articles WHERE MATCH (title,body) AGAINST ('database' IN NATURAL LANGUAGE MODE);

Will return d3, and then d2, d1 (random order of d2,d1 depending on which is more equal to the article) because d3 matches the article best.

The underlying algorithm MYSQL uses is probably the tf-idf algorithm, where tf stands for term frequency and idf for inverse document frequency. tf is as it says, just the number of times a word w in article occurs in A document. idf is based on in how many documents the word occurs. So words that occur in many documents don't contribute to deciding the most representative document. The product of tf*idf produces a score, the higher, the better the word represents a document. So 'pie' will only occur in document d3 and will thus have a high tf and a high idf (since it's the inverse). Whereas 'the' will have a high tf but a low idf which will event out the tf and give a low score.

The MYSQL Natural Language Mode also comes with a set of stopwords (the, a, some etc) and removes words that are less than 4 letters. Which can be seen in the link you provided.

Some words are ignored in full-text searches:

Any word that is too short is ignored. The default minimum length of words that are found by full-text searches is three characters for InnoDB search indexes, or four characters for MyISAM. You can control the cutoff by setting a configuration option before creating the index: innodb_ft_min_token_size configuration option for InnoDB search indexes, or ft_min_word_len for MyISAM.

Words in the stopword list are ignored. A stopword is a word such as “the” or “some” that is so common that it is considered to have zero semantic value. There is a built-in stopword list, but it can be overridden by a user-defined list. The stopword lists and related configuration options are different for InnoDB search indexes and MyISAM ones. Stopword processing is controlled by the configuration options innodb_ft_enable_stopword, innodb_ft_server_stopword_table, and innodb_ft_user_stopword_table for InnoDB search indexes, and ft_stopword_file for MyISAM ones.

like image 140
Samir Alajmovic Avatar answered Sep 27 '22 18:09

Samir Alajmovic