I would like to know whether full text search in PostgreSQL 9.3 with GIN/GiST index uses tf-idf (term frequency-inverse document frequency).
In particular, in my columns of phrases, I have some words that are more popular, whereas some are quite unique (i.e., names). I want to index these columns so that the unique words matched will be weighted higher than common words.
TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents (also known as a ...
Bag of Words vectors are easy to interpret. However, TF-IDF usually performs better in machine learning models.
Google uses TF-IDF to determine which terms are topically relevant (or irrelevant) by analyzing how often a term appears on a page (term frequency — TF) and how often it's expected to appear on an average page, based on a larger set of documents (inverse document frequency — IDF).
TfidfVectorizer and CountVectorizer both are methods for converting text data into vectors as model can process only numerical data. In CountVectorizer we only count the number of times a word appears in the document which results in biasing in favour of most frequent words.
No Postgres does not use TF-IDF as a similarity measure among documents.
ts_rank
is higher if a document contains query terms more frequently. It does not take into account the global frequency of the term.
ts_rank_cd
is higher if a document contains query terms closer together and more frequently. It does not take into account the global frequency of the term.
There is an extension from the text search creators called smlar, that lets you calculate the similarity between arrays using TF-IDF. It also lets you turn tsvectors into arrays, and supports fast indexing.
No. Within the ts_rank function, there is no native method to rank results using their global (corpus) frequency. The rank algorithm does however rank based on frequency within the document:
http://www.postgresql.org/docs/9.3/static/textsearch-controls.html
So if I search for "dog|chihuahua" the following two documents would have the same rank despite the relatively lower frequency of the word "chihuahua":
"I want a dog"
"I want a chihuahua"
However, the following line would get ranked higher than the previous two lines above, because it contains the stemmed token "dog" twice in the document:
"dog lovers have an average of 1.5 dogs"
In short: higher term frequency within the document results in a higher rank, but a lower term frequency in the corpus has no impact.
One caveat: the text search does ignore stop-words, so you will not match on ultra high frequency words like "the","a","of","for" etc (assuming you have correctly set your language)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With