I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document.
TF-IDF (term frequency Inverse document frequency) is a popular approach that can be leveraged to eliminate stop words. This technique is language independent. The intuition here is that commonly occurring words, that occur in almost all documents are stop words.
Stop words are words like a, an, the, is, has, of, are etc. Most of the times they add noise to the features. Therefore removing stop words helps build cleaner dataset with better features for machine learning model. For text based problems, bag of words approach is a common technique.
Most frequent terms as stop wordsSort the terms in descending order of raw term frequency. You can take the top N terms to be your stop words. You can also eliminate common English words (using a publish stop list) prior to sorting so that you are sure that you target the domain specific stop words.
Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.
From "Introduction to Information Retrieval" book:
tf-idf
assigns to term t
a weight in document d
that is
t
occurs many times within a small number of documents (thus lending high discriminating power to those documents);term
occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);term
occurs in virtually all documents.So words with lowest tf-idf
can considered as stop words.
Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune.
The best (as in more representative) terms in a document are those with higher tf-idf because those terms are common in the document, while being rare in the collection.
As a quick note, as @Kevin pointed out, very common terms in the collection (i.e., stop-words) produce very low tf-idf anyway. However, they will change some computations and this would be wrong if you assume they are pure noise (which might not be true depending on the task). In addition, if they are included your algorithm would be slightly slower.
edit: As @FelipeHammel says, you can directly use the IDF (remember to invert the order) as a measure which is (inversely) proportional to df. This is completely equivalent for ranking purposes, and therefore to select the top "k" terms. However, it is not possible to use it to select based on ratios (e.g., words that appear in more than 50% of the documents), although a simple thresholding will fix that (i.e., selecting terms with idf lower than a specific value). In general, a fix number of terms is used.
I hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With