How to select stop words using tf-idf? (non english corpus)

Tags:

I have managed to evaluate the tf-idf function for a given corpus. How can I find the stopwords and the best words for each document? I understand that a low tf-idf for a given word and document means that it is not a good word for selecting that document.

488

asked Jun 04 '13 21:06

Daniel Walther Berns

2 Answers

From "Introduction to Information Retrieval" book:

tf-idf assigns to term t a weight in document d that is

highest when t occurs many times within a small number of documents (thus lending high discriminating power to those documents);
lower when the term occurs fewer times in a document, or occurs in many documents (thus offering a less pronounced relevance signal);
lowest when the term occurs in virtually all documents.

So words with lowest tf-idf can considered as stop words.

answered Oct 19 '22 15:10

Payam Soudachi

Stop-words are those words that appear very commonly across the documents, therefore loosing their representativeness. The best way to observe this is to measure the number of documents a term appears in and filter those that appear in more than 50% of them, or the top 500 or some type of threshold that you will have to tune.

The best (as in more representative) terms in a document are those with higher tf-idf because those terms are common in the document, while being rare in the collection.

As a quick note, as @Kevin pointed out, very common terms in the collection (i.e., stop-words) produce very low tf-idf anyway. However, they will change some computations and this would be wrong if you assume they are pure noise (which might not be true depending on the task). In addition, if they are included your algorithm would be slightly slower.

edit: As @FelipeHammel says, you can directly use the IDF (remember to invert the order) as a measure which is (inversely) proportional to df. This is completely equivalent for ranking purposes, and therefore to select the top "k" terms. However, it is not possible to use it to select based on ratios (e.g., words that appear in more than 50% of the documents), although a simple thresholding will fix that (i.e., selecting terms with idf lower than a specific value). In general, a fix number of terms is used.

I hope this helps.

answered Oct 19 '22 16:10

miguelmalvarez

Related questions
                            
                                Machine learning algorithm
                            
                                Reconstructing now-famous 17-year-old's Markov-chain-based information-retrieval algorithm "Apodora"
                            
                                What are some good methods to find the "relatedness" of two bodies of text?
                            
                                Adding Encryption to Solr/lucene indexes
                            
                                Average Document Length in Okapi BM25
                            
                                PHP library for word clustering/NLP?
                            
                                Cosine Similarity of Vectors, with < O(n^2) complexity
                            
                                How can I retrieve my Google search history?
                            
                                get links from a google search in C#
                            
                                Lemmatization of non-English words?
                            
                                TFIDF calculating confusion
                            
                                Why Lucene doesn't support any type of update to an existing document
                            
                                Inferring templates from a collection of strings
                            
                                How do I evaluate a text summarization tool?
                            
                                MAP@k computation
                            
                                Are there any API's that'll let me search by image?
                            
                                How to use a BooleanQuery builder in Lucene 5.3.x?
                            
                                Is it possible to query Elastic Search with a feature vector?
                            
                                Is there a search engine that will give a direct answer? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to select stop words using tf-idf? (non english corpus)

Tags:

text-mining

information-retrieval

stop-words

tf-idf

Daniel Walther Berns

People also ask

2 Answers

Payam Soudachi

miguelmalvarez

Recent Activity

Donate For Us