Right now I'm using <code>CountVectorizer</code> to extract features. However, I need to count words not seen during fitting. During transforming, the default behavior of <code>CountVectorizer</code> is to ignore words that were not observed during fitting. But I need to keep a count of how many times this happens! How can I do this? Thanks!

There is no inbuilt way in scikit-learn to do this, you need to write some additional code to be able to do this. However you could use the <code>vocabulary_</code> attribute of <code>CountVectorizer</code> to achieve this. <ol> <li>Cache the current vocabulary</li> <li>Call fit_transform</li> <li>Compute the diff with the new vocabulary and the cached vocabulary</li> </ol>

CountVectorizer and Out-Of-Vocabulary (OOV) tokens?

1 Answers

There is no inbuilt way in scikit-learn to do this, you need to write some additional code to be able to do this. However you could use the vocabulary_ attribute of CountVectorizer to achieve this.

Cache the current vocabulary
Call fit_transform
Compute the diff with the new vocabulary and the cached vocabulary

answered Sep 30 '22 12:09

vumaasha

Recent Activity
Apple Pay - authorize.net returns error 153 only when live, sandbox works
How to continue cursor loop even error occured in the loop
python find all neighbours of a given node in a list of lists
Fatal error: Call to a member function setColumn() on a non-object in Magento
Count how many of each value from a field with MySQL and PHP
Python 32-bit development on 64-bit Windows [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CountVectorizer and Out-Of-Vocabulary (OOV) tokens?

Tags:

python

scikit-learn

Jose G

1 Answers

vumaasha

Recent Activity

Donate For Us

CountVectorizer and Out-Of-Vocabulary (OOV) tokens?

Tags:

python

scikit-learn

Jose G

1 Answers

vumaasha

Related questions

Recent Activity

Donate For Us