Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to select stop words using tf-idf? (non english corpus)

I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document.

like image 488
Daniel Walther Berns Avatar asked Jun 04 '13 21:06

Daniel Walther Berns


People also ask

Does TF-IDF remove stop words?

TF-IDF (term frequency Inverse document frequency) is a popular approach that can be leveraged to eliminate stop words. This technique is language independent. The intuition here is that commonly occurring words, that occur in almost all documents are stop words.

What are stop words in TF-IDF?

Stop words are words like a, an, the, is, has, of, are etc. Most of the times they add noise to the features. Therefore removing stop words helps build cleaner dataset with better features for machine learning model. For text based problems, bag of words approach is a common technique.

How do you select a stop in word?

Most frequent terms as stop wordsSort the terms in descending order of raw term frequency. You can take the top N terms to be your stop words. You can also eliminate common English words (using a publish stop list) prior to sorting so that you are sure that you target the domain specific stop words.

How do you identify stop words?

Stop words are a set of commonly used words in any language. For example, in English, “the”, “is” and “and”, would easily qualify as stop words. In NLP and text mining applications, stop words are used to eliminate unimportant words, allowing applications to focus on the important words instead.


2 Answers

From "Introduction to Information Retrieval" book:

tf-idf assigns to term t a weight in document d that is

  1. highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents);
  2. lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
  3. lowest when the term occurs in virtually all documents.

So words with lowest tf-idf can considered as stop words.

like image 88
Payam Soudachi Avatar answered Oct 19 '22 15:10

Payam Soudachi


Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune.

The best (as in more representative) terms in a document are those with higher tf-idf because those terms are common in the document, while being rare in the collection.

As a quick note, as @Kevin pointed out, very common terms in the collection (i.e., stop-words) produce very low tf-idf anyway. However, they will change some computations and this would be wrong if you assume they are pure noise (which might not be true depending on the task). In addition, if they are included your algorithm would be slightly slower.

edit: As @FelipeHammel says, you can directly use the IDF (remember to invert the order) as a measure which is (inversely) proportional to df. This is completely equivalent for ranking purposes, and therefore to select the top "k" terms. However, it is not possible to use it to select based on ratios (e.g., words that appear in more than 50% of the documents), although a simple thresholding will fix that (i.e., selecting terms with idf lower than a specific value). In general, a fix number of terms is used.

I hope this helps.

like image 30
miguelmalvarez Avatar answered Oct 19 '22 16:10

miguelmalvarez