Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Does PostgreSQL use tf-idf?

I would like to know whether full text search in PostgreSQL 9.3 with GIN/GiST index uses tf-idf (term frequency-inverse document frequency).

In particular, in my columns of phrases, I have some words that are more popular, whereas some are quite unique (i.e., names). I want to index these columns so that the unique words matched will be weighted higher than common words.

like image 594
AdamNYC Avatar asked Aug 18 '13 06:08

AdamNYC


People also ask

Where is TF-IDF used?

TF-IDF stands for term frequency-inverse document frequency and it is a measure, used in the fields of information retrieval (IR) and machine learning, that can quantify the importance or relevance of string representations (words, phrases, lemmas, etc) in a document amongst a collection of documents (also known as a ...

Is TF-IDF better than bag of words?

Bag of Words vectors are easy to interpret. However, TF-IDF usually performs better in machine learning models.

Does Google use TF-IDF?

Google uses TF-IDF to determine which terms are topically relevant (or irrelevant) by analyzing how often a term appears on a page (term frequency — TF) and how often it's expected to appear on an average page, based on a larger set of documents (inverse document frequency — IDF).

What is alternative of TF-IDF?

TfidfVectorizer and CountVectorizer both are methods for converting text data into vectors as model can process only numerical data. In CountVectorizer we only count the number of times a word appears in the document which results in biasing in favour of most frequent words.


2 Answers

No Postgres does not use TF-IDF as a similarity measure among documents.

ts_rank is higher if a document contains query terms more frequently. It does not take into account the global frequency of the term.

ts_rank_cd is higher if a document contains query terms closer together and more frequently. It does not take into account the global frequency of the term.

There is an extension from the text search creators called smlar, that lets you calculate the similarity between arrays using TF-IDF. It also lets you turn tsvectors into arrays, and supports fast indexing.

like image 137
Neil McGuigan Avatar answered Oct 01 '22 05:10

Neil McGuigan


No. Within the ts_rank function, there is no native method to rank results using their global (corpus) frequency. The rank algorithm does however rank based on frequency within the document:

http://www.postgresql.org/docs/9.3/static/textsearch-controls.html

So if I search for "dog|chihuahua" the following two documents would have the same rank despite the relatively lower frequency of the word "chihuahua":

"I want a dog"
"I want a chihuahua"

However, the following line would get ranked higher than the previous two lines above, because it contains the stemmed token "dog" twice in the document:

"dog lovers have an average of 1.5 dogs"

In short: higher term frequency within the document results in a higher rank, but a lower term frequency in the corpus has no impact.

One caveat: the text search does ignore stop-words, so you will not match on ultra high frequency words like "the","a","of","for" etc (assuming you have correctly set your language)

like image 20
mgoldwasser Avatar answered Oct 01 '22 06:10

mgoldwasser