tf-idf and previously unseen terms

Tags:

TF-IDF (term frequency - inverse document frequency) is a staple of information retrieval. It's not a proper model though, and it seems to break down when new terms are introduced into the corpus. How do people handle it when queries or new documents have new terms, especially if they are high frequency. Under traditional cosine matching, those would have no impact on the total match.

984

asked Oct 21 '08 18:10

Gregg Lind

2 Answers

Er, nope, doesn't break down.

Say I have two documents, A "weasel goat" and B "cheese gopher". If we actually represented these as vectors, they might look something like:

A [1,1,0,0]
B [0,0,1,1]

and if we've allocated these vectors in an index file, yeah, we've got a problem when it comes time to add a new term. But the trick of it is, that vector never exists. The key is the inverted index.

As far as new terms not affecting a cosine match, that might be true depending on what you mean. If I search my corpus of (A,B) with the query "marmoset kungfu", neither marmoset nor kungfu exist in the corpus. So the vector representing my query will be orthogonal to all the documents in the collection, and get a bad cosine similarity score. But considering none of the terms match, that seems pretty reasonable.

198

answered Sep 24 '22 03:09

Jay Kominek

When you talk about "break down" I think you mean that the new terms have no impact on the similarity measure, because they do not have any representation in the vector space defined by the original vocabulary.

One approach to handle this smoothing problem would be to consider fixing the vocabulary to a smaller vocabulary and treat all words rarer than a certain threshold as belonging to the special _UNKNOWN_ word.

However, I don't think your definition of "break down" is very clear; could you clarify what you mean there? If you could clear that up, perhaps we could discuss ways to work around those problems.

answered Sep 22 '22 03:09

Trochee

Related questions
                            
                                Why my code is much slower than opencv for a simple StereoBM algorithm?
                            
                                Matches overlapping lookahead on LZ77/LZSS with suffix trees
                            
                                Extending Python's os.walk function on FTP server
                            
                                Finding a "movement direction" (angle) of a point
                            
                                Algorithm to find top K paths in graph, with no common vertices, negative weights?
                            
                                Multichannel blind deconvolution in the simplest formulation: how to solve?
                            
                                How to divide by 9 using just shifts/add/sub?
                            
                                Finding edges in a graph with similar lengths
                            
                                What's the good of using 3 states for a vertex in DFS?
                            
                                Finding a irreducible fraction
                            
                                Find the subset of a set of integers that has the maximum product
                            
                                Round Robin scheduling in java
                            
                                How does Array.sort() behave if comparison function is not transitive?
                            
                                Determine the closest free set of X/Y coordinates within an ever expanding 2D grid stored in MySQL
                            
                                How does raft handle committing entries from previous one?
                            
                                Constructing a quadtree such that there is only one level difference between adjacent nodes (LOD)
                            
                                O(n) time smallest spanning window combination of the elements in k sorted arrays
                            
                                What is the most efficient algorithm/data structure for finding the smallest range containing a point?
                            
                                Algorithm for calculating number of fitting boxes
                            
                                Find the minimum possible difference between two arrays

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

tf-idf and previously unseen terms

Tags:

algorithm

statistics

nlp

tf-idf

Gregg Lind

People also ask

2 Answers

Jay Kominek

Trochee

Recent Activity

Donate For Us