I have some questions about the <code>TfidfVectorizer</code>. It is unclear to me how the words are selected. We can give a minimum support, but after that, what will decide which features will be selected (e.g. higher support more chance)? If we say <code>max_features = 10000</code>, do we always get the same? If we say <code>max_features = 12000</code>, will we get the same <code>10000</code> features, but an extra added <code>2000</code>? Also, is there a way to extend the, say, <code>max_features=20000</code> features? I fit it on some text, but I know of some words that should be included for sure, and also some emoticons ":-)" etc. How to add these to the <code>TfidfVectorizer</code> object, so that it will be possible to use the object, use it to <code>fit</code> and <code>predict</code> <pre class="prettyprint"><code>to_include = [":-)", ":-P"] method = TfidfVectorizer(max_features=20000, ngram_range=(1, 3), # I know stopwords, but how about include words? stop_words=test.stoplist[:100], # include words ?? analyzer='word', min_df=5) method.fit(traindata) </code></pre> <h3>Sought result:</h3> <pre class="prettyprint"><code>X = method.transform(traindata) X <Nx20002 sparse matrix of type '<class 'numpy.int64'>' with 1135520 stored elements in Compressed Sparse Row format>], where N is sample size </code></pre>

You are asking several separate questions. Let me answer them separately: "It is unclear to me how the words are selected." From the documentation: <pre class="prettyprint"><code>max_features : optional, None by default If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus. </code></pre> All the features (in your case unigrams, bigrams and trigrams) are ordered by frequency in the entire corpus, and then the top <code>10000</code> are selected. The uncommon words are thrown out. "If we say max_features = 10000, do we always get the same? If we say max_features = 12000, will we get the same 10000 features, but an extra added 2000?" Yes. The process is deterministic: for a given corpus and a given <code>max_features</code>, you will always get the same features. I fit it on some text, but I know of some words that should be included for sure, [...] How to add these to the TfidfVectorizer object? You use the <code>vocabulary</code> parameter to specify what features should be used. For example, if you want only emoticons to be extracted, you can do the following: <pre class="prettyprint"><code>emoticons = {":)":0, ":P":1, ":(":2} vect = TfidfVectorizer(vocabulary=emoticons) matrix = vect.fit_transform(traindata) </code></pre> This will return a <code><Nx3 sparse matrix of type '<class 'numpy.int64'>' with M stored elements in Compressed Sparse Row format>]</code>. Notice there are only 3 columns, one for each feature. If you want the vocabulary to include the emoticons as well as the <code>N</code> most common features, you could calculate the most frequent features first, then merge them with the emoticons and re-vectorize like so: <pre class="prettyprint"><code># calculate the most frequent features first vect = TfidfVectorizer(vocabulary=emoticons, max_features=10) matrix = vect.fit_transform(traindata) top_features = vect.vocabulary_ n = len(top_features) # insert the emoticons into the vocabulary of common features emoticons = {":)":0, ":P":1, ":(":2)} for feature, index in emoticons.items(): top_features[feature] = n + index # re-vectorize using both sets of features # at this point len(top_features) == 13 vect = TfidfVectorizer(vocabulary=top_features) matrix = vect.fit_transform(traindata) </code></pre>

TfidfVectorizer in sklearn how to specifically INCLUDE words

Tags:

python

machine-learning

nlp

scikit-learn

I have some questions about the TfidfVectorizer.

It is unclear to me how the words are selected. We can give a minimum support, but after that, what will decide which features will be selected (e.g. higher support more chance)? If we say max_features = 10000, do we always get the same? If we say max_features = 12000, will we get the same 10000 features, but an extra added 2000?

Also, is there a way to extend the, say, max_features=20000 features? I fit it on some text, but I know of some words that should be included for sure, and also some emoticons ":-)" etc. How to add these to the TfidfVectorizer object, so that it will be possible to use the object, use it to fit and predict

to_include = [":-)", ":-P"]
method = TfidfVectorizer(max_features=20000, ngram_range=(1, 3),
                      # I know stopwords, but how about include words?
                      stop_words=test.stoplist[:100], 
                      # include words ??
                      analyzer='word',
                      min_df=5)
method.fit(traindata)

Sought result:

X = method.transform(traindata)
X
<Nx20002 sparse matrix of type '<class 'numpy.int64'>'
 with 1135520 stored elements in Compressed Sparse Row format>], 
 where N is sample size

860

asked Nov 03 '13 14:11

PascalVKooten

1 Answers

You are asking several separate questions. Let me answer them separately:

"It is unclear to me how the words are selected."

From the documentation:

max_features : optional, None by default
    If not None, build a vocabulary that only consider the top
    max_features ordered by term frequency across the corpus.

All the features (in your case unigrams, bigrams and trigrams) are ordered by frequency in the entire corpus, and then the top 10000 are selected. The uncommon words are thrown out.

"If we say max_features = 10000, do we always get the same? If we say max_features = 12000, will we get the same 10000 features, but an extra added 2000?"

Yes. The process is deterministic: for a given corpus and a given max_features, you will always get the same features.

I fit it on some text, but I know of some words that should be included for sure, [...] How to add these to the TfidfVectorizer object?

You use the vocabulary parameter to specify what features should be used. For example, if you want only emoticons to be extracted, you can do the following:

emoticons = {":)":0, ":P":1, ":(":2}
vect = TfidfVectorizer(vocabulary=emoticons)
matrix = vect.fit_transform(traindata)

This will return a <Nx3 sparse matrix of type '<class 'numpy.int64'>' with M stored elements in Compressed Sparse Row format>]. Notice there are only 3 columns, one for each feature.

If you want the vocabulary to include the emoticons as well as the N most common features, you could calculate the most frequent features first, then merge them with the emoticons and re-vectorize like so:

# calculate the most frequent features first
vect = TfidfVectorizer(vocabulary=emoticons, max_features=10)
matrix = vect.fit_transform(traindata)
top_features = vect.vocabulary_
n = len(top_features)

# insert the emoticons into the vocabulary of common features
emoticons = {":)":0, ":P":1, ":(":2)}
for feature, index in emoticons.items():
    top_features[feature] = n + index

# re-vectorize using both sets of features
# at this point len(top_features) == 13
vect = TfidfVectorizer(vocabulary=top_features)
matrix = vect.fit_transform(traindata)

answered Oct 19 '22 23:10

mbatchkarov

Related questions
                            
                                Python convert long to date
                            
                                Anisotropic diffusion 2d images [closed]
                            
                                Two assignments in single python list comprehension
                            
                                How to write a python function that adds all arguments?
                            
                                Why doesn't var = [0].extend(range(1,10)) work in python?
                            
                                Json string formatting with python
                            
                                Detect mouseover an image in Pygame
                            
                                Combine Two LIsts in Unique Way in Python
                            
                                Python: string to a list of lists
                            
                                Can I mix character classes in Python RegEx?
                            
                                finding on which page a search string is located in a pdf document using python
                            
                                How to find running time of a thread in Python
                            
                                Can python have class or instance methods that do not have "self" as the first argument? [duplicate]
                            
                                Text Shift function in Python
                            
                                Is a Python list a singly or doubly linked list?
                            
                                ('Nmap not found', <class 'nmap.nmap.PortScannerError'>)
                            
                                Django forms adding <div> after form field
                            
                                Executing Python with Gvim
                            
                                In Python, how to write a set containing a set?
                            
                                Brew Install Python Fails Due to Link gdbm link issue

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With