Understanding Text feature extraction TfidfVectorizer in python scikit-learn

Tags:

scikit-learn

Reading the documentation for text feature extraction in scikit-learn, I am not sure how the different arguments available for TfidfVectorizer (and may be other vectorizers) affect the outcome.

Here are the arguments I am not sure how they work:

TfidfVectorizer(stop_words='english',  ngram_range=(1, 2), max_df=0.5, min_df=20, use_idf=True)

The documentation is clear on the use of stop_words/ max_df (both have similar effect and may be one can be used instead of the other). However, I am not sure if these options should be used together with ngrams. Which one occurs/handled first, ngrams or stop_words? why? Based on my experiment, stop words are removed first, but the purpose of ngrams is to extract phrases, etc. I am not sure about the effect of this sequence (Stops removed and then ngramed).

Second, does it make sense to use max_df/min_df arguments together with use_idf argument? aren't the purpose of these similar?

217

asked Nov 29 '17 16:11

valearner

1 Answers

I see several questions in this post.

How do the different arguments in TfidfVectorizer interact with one another?

You really have to use it quite a bit to develop a sense of intuition (has been my experience anyway).

TfidfVectorizer is a bag of words approach. In NLP, sequences of words and their window is important; this kind of destroys some of that context.

How do I control what tokens get outputted?

Set ngram_range to (1,1) for outputting only one-word tokens, (1,2) for one-word and two-word tokens, (2, 3) for two-word and three-word tokens, etc.

ngram_range works hand-in-hand with analyzer. Set analyzer to "word" for outputting words and phrases, or set it to "char" to output character ngrams.

If you want your output to have both "word" and "char" features, use sklearn's FeatureUnion. Example here.

How do I remove unwanted stuff?

Use stop_words to remove less-meaningful english words.

The list of stop words that sklearn uses can be found at:

from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

The logic of removing stop words has to do with the fact that these words don't carry a lot of meaning, and they appear a lot in most text:

[('the', 79808),
 ('of', 40024),
 ('and', 38311),
 ('to', 28765),
 ('in', 22020),
 ('a', 21124),
 ('that', 12512),
 ('he', 12401),
 ('was', 11410),
 ('it', 10681),
 ('his', 10034),
 ('is', 9773),
 ('with', 9739),
 ('as', 8064),
 ('i', 7679),
 ('had', 7383),
 ('for', 6938),
 ('at', 6789),
 ('by', 6735),
 ('on', 6639)]

Since stop words generally have a high frequency, it might make sense to use max_df as a float of say 0.95 to remove the top 5% but then you're assuming that the top 5% is all stop words which might not be the case. It really depends on your text data. In my line of work, it's very common that the top words or phrases are NOT stop words because I work with dense text (search query data) in very specific topics.

Use min_df as an integer to remove rare-occurring words. If they only occur once or twice, they won't add much value and are usually really obscure. Furthermore, there's generally a lot of them so ignoring them with say min_df=5 can greatly reduce your memory consumption and data size.

How do I Include stuff that's being stripped out?

token_pattern uses a regex pattern \b\w\w+\b which means that tokens have to be at least 2 characters long so words like "I", "a" are removed and also numbers like 0 - 9 are removed. You'll also notice it removes apostrophes

What happens first, ngram generation or stop word removal?

Let's do a little test.

import numpy as np
import pandas as pd

from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer
from sklearn.feature_extraction.stop_words import ENGLISH_STOP_WORDS

docs = np.array(['what is tfidf',
        'what does tfidf stand for',
        'what is tfidf and what does it stand for',
        'tfidf is what',
        "why don't I use tfidf",
        '1 in 10 people use tfidf'])

tfidf = TfidfVectorizer(use_idf=False, norm=None, ngram_range=(1, 1))
matrix = tfidf.fit_transform(docs).toarray()

df = pd.DataFrame(matrix, index=docs, columns=tfidf.get_feature_names())

for doc in docs:
    print(' '.join(word for word in doc.split() if word not in ENGLISH_STOP_WORDS))

This prints out:

tfidf
does tfidf stand
tfidf does stand
tfidf
don't I use tfidf
1 10 people use tfidf

Now let's print df:

                                           10  and  does  don  for   in   is  \
what is tfidf                             0.0  0.0   0.0  0.0  0.0  0.0  1.0   
what does tfidf stand for                 0.0  0.0   1.0  0.0  1.0  0.0  0.0   
what is tfidf and what does it stand for  0.0  1.0   1.0  0.0  1.0  0.0  1.0   
tfidf is what                             0.0  0.0   0.0  0.0  0.0  0.0  1.0   
why don't I use tfidf                     0.0  0.0   0.0  1.0  0.0  0.0  0.0   
1 in 10 people use tfidf                  1.0  0.0   0.0  0.0  0.0  1.0  0.0   

                                           it  people  stand  tfidf  use  \
what is tfidf                             0.0     0.0    0.0    1.0  0.0   
what does tfidf stand for                 0.0     0.0    1.0    1.0  0.0   
what is tfidf and what does it stand for  1.0     0.0    1.0    1.0  0.0   
tfidf is what                             0.0     0.0    0.0    1.0  0.0   
why don't I use tfidf                     0.0     0.0    0.0    1.0  1.0   
1 in 10 people use tfidf                  0.0     1.0    0.0    1.0  1.0   

                                          what  why  
what is tfidf                              1.0  0.0  
what does tfidf stand for                  1.0  0.0  
what is tfidf and what does it stand for   2.0  0.0  
tfidf is what                              1.0  0.0  
why don't I use tfidf                      0.0  1.0  
1 in 10 people use tfidf                   0.0  0.0

Notes:

use_idf=False, norm=None when these are set, it's equivalent to using sklearn's CountVectorizer. It will just return counts.
Notice the word "don't" was converted to "don". This is where you'd change token_pattern to something like token_pattern=r"\b\w[\w']+\b" to include apostrophes.
we see a lot of stop words

Let's remove stopwords and look at df again:

tfidf = TfidfVectorizer(use_idf=False, norm=None, stop_words='english', ngram_range=(1, 2))

Outputs:

                                           10  10 people  does  does stand  \
what is tfidf                             0.0        0.0   0.0         0.0   
what does tfidf stand for                 0.0        0.0   1.0         0.0   
what is tfidf and what does it stand for  0.0        0.0   1.0         1.0   
tfidf is what                             0.0        0.0   0.0         0.0   
why don't I use tfidf                     0.0        0.0   0.0         0.0   
1 in 10 people use tfidf                  1.0        1.0   0.0         0.0   

                                          does tfidf  don  don use  people  \
what is tfidf                                    0.0  0.0      0.0     0.0   
what does tfidf stand for                        1.0  0.0      0.0     0.0   
what is tfidf and what does it stand for         0.0  0.0      0.0     0.0   
tfidf is what                                    0.0  0.0      0.0     0.0   
why don't I use tfidf                            0.0  1.0      1.0     0.0   
1 in 10 people use tfidf                         0.0  0.0      0.0     1.0   

                                          people use  stand  tfidf  \
what is tfidf                                    0.0    0.0    1.0   
what does tfidf stand for                        0.0    1.0    1.0   
what is tfidf and what does it stand for         0.0    1.0    1.0   
tfidf is what                                    0.0    0.0    1.0   
why don't I use tfidf                            0.0    0.0    1.0   
1 in 10 people use tfidf                         1.0    0.0    1.0   

                                          tfidf does  tfidf stand  use  \
what is tfidf                                    0.0          0.0  0.0   
what does tfidf stand for                        0.0          1.0  0.0   
what is tfidf and what does it stand for         1.0          0.0  0.0   
tfidf is what                                    0.0          0.0  0.0   
why don't I use tfidf                            0.0          0.0  1.0   
1 in 10 people use tfidf                         0.0          0.0  1.0   

                                          use tfidf  
what is tfidf                                   0.0  
what does tfidf stand for                       0.0  
what is tfidf and what does it stand for        0.0  
tfidf is what                                   0.0  
why don't I use tfidf                           1.0  
1 in 10 people use tfidf                        1.0

Take-aways:

the token "don use" happened because don't I use had the 't stripped off and because I was less than two characters, it was removed so the words were joined to don use... which actually wasn't the structure and could potentially change the structure a bit!
Answer: stop words are removed, short characters are removed, then ngrams are generated which can return unexpected results.

does it make sense to use max_df/min_df arguments together with use_idf argument?

My opinion, the whole point of term-frequency inverse document frequency is to allow re-weighting of the highly frequent words (words that would appear a the top of a sorted frequency list). This re-weighting will take the highest frequency ngrams and move them down the list to a lower position. Therefore, it's supposed to handle max_df scenarios.

Maybe it's more of a personal choice whether you want to move them down the list ("re-weight" / de-prioritize them) or remove them completely.

I use min_df a lot and it makes sense to use min_df if you're working with a huge dataset because rare words won't add value and will just cause a lot of processing issues. I don't use max_df much but I'm sure there are scenarios when working with data like all of Wikipedia that this might make sense to remove the top x%.

103

answered Sep 29 '22 06:09

Jarad

Related questions
                            
                                Getting timezone name from UTC offset
                            
                                `object in list` behaves different from `object in dict`?
                            
                                Pandas replace multiple values at once
                            
                                Spark Sql: TypeError("StructType can not accept object in type %s" % type(obj))
                            
                                "IN" operator with empty strings in Python 3.0 [duplicate]
                            
                                how to make arrow that loops in matplotlib?
                            
                                Error in installation pycurl 7.19.0
                            
                                How do I link the CrossHairTool in bokeh over several plots?
                            
                                Fill the missing date values in a Pandas Dataframe column
                            
                                Include multiple headers in python requests
                            
                                How to save Plotly Offline graph in format png?
                            
                                Installed module using pip, not found
                            
                                Python 3.5.1 : NameError: name 'json' is not defined
                            
                                Setting up periodic tasks in Celery (celerybeat) dynamically using add_periodic_task
                            
                                debug Flask server inside Jupyter Notebook
                            
                                How to create both short and long options for one option in click (python package)?
                            
                                Sort dict of dict in jinja2 loop
                            
                                How to send urlencoded parameters in POST request in python
                            
                                How to display Runtime Statistics in Tensorboard using Estimator API in a distributed environment
                            
                                How to read a large json in pandas?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With