Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CountVectorizer and Out-Of-Vocabulary (OOV) tokens?

Right now I'm using CountVectorizer to extract features. However, I need to count words not seen during fitting.

During transforming, the default behavior of CountVectorizer is to ignore words that were not observed during fitting. But I need to keep a count of how many times this happens!

How can I do this?

Thanks!

like image 256
Jose G Avatar asked Oct 25 '16 03:10

Jose G


1 Answers

There is no inbuilt way in scikit-learn to do this, you need to write some additional code to be able to do this. However you could use the vocabulary_ attribute of CountVectorizer to achieve this.

  1. Cache the current vocabulary
  2. Call fit_transform
  3. Compute the diff with the new vocabulary and the cached vocabulary
like image 52
vumaasha Avatar answered Sep 30 '22 12:09

vumaasha