TF-IDF (term frequency - inverse document frequency) is a staple of information retrieval. It's not a proper model though, and it seems to break down when new terms are introduced into the corpus. How do people handle it when queries or new documents have new terms, especially if they are high frequency. Under traditional cosine matching, those would have no impact on the total match.
However, TF-IDF has several limitations: – It computes document similarity directly in the word-count space, which may be slow for large vocabularies. – It assumes that the counts of different words provide independent evidence of similarity. – It makes no use of semantic similarities between words.
Bag of Words just creates a set of vectors containing the count of word occurrences in the document (reviews), while the TF-IDF model contains information on the more important words and the less important ones as well.
As its name implies, TF-IDF vectorizes/scores a word by multiplying the word's Term Frequency (TF) with the Inverse Document Frequency (IDF). Term Frequency: TF of a term or word is the number of times the term appears in a document compared to the total number of words in the document.
TF*IDF is used by search engines to better understand the content that is undervalued. For example, when you search for “Coke” on Google, Google may use TF*IDF to figure out if a page titled “COKE” is about: a) Coca-Cola.
Er, nope, doesn't break down.
Say I have two documents, A "weasel goat" and B "cheese gopher". If we actually represented these as vectors, they might look something like:
A [1,1,0,0]
B [0,0,1,1]
and if we've allocated these vectors in an index file, yeah, we've got a problem when it comes time to add a new term. But the trick of it is, that vector never exists. The key is the inverted index.
As far as new terms not affecting a cosine match, that might be true depending on what you mean. If I search my corpus of (A,B) with the query "marmoset kungfu", neither marmoset nor kungfu exist in the corpus. So the vector representing my query will be orthogonal to all the documents in the collection, and get a bad cosine similarity score. But considering none of the terms match, that seems pretty reasonable.
When you talk about "break down" I think you mean that the new terms have no impact on the similarity measure, because they do not have any representation in the vector space defined by the original vocabulary.
One approach to handle this smoothing problem would be to consider fixing the vocabulary to a smaller vocabulary and treat all words rarer than a certain threshold as belonging to the special _UNKNOWN_
word.
However, I don't think your definition of "break down" is very clear; could you clarify what you mean there? If you could clear that up, perhaps we could discuss ways to work around those problems.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With