Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Adding new words to text vectorizer in scikit-learn

Scikit-learn CountVectorizer for bag-of-words approach currently gives two sub-options: (a) use a custom vocabulary (b) if custom vocabulary is unavailable, then it makes a vocabulary based on all the words present in the corpus.

My question: Can we specify a custom vocabulary to begin with, but ensure that it gets updated when new words are seen while processing the corpus. I am assuming this is doable since the matrix is stored via a sparse representation.

Usefulness: It will help in cases when one has to add additional documents to the training data, and one should not have to start from the beginning.

like image 647
user2986075 Avatar asked Nov 02 '22 10:11

user2986075


1 Answers

No, this is not possible at present. It's also not "doable", and here's why.

CountVectorizer and TfidfVectorizer are designed to turn text documents into vectors. These vectors need to all have an equal number of elements, which in turn is equal to the size of the vocabulary, because that conventions is ingrained in all scikit-learn code. If the vocabulary is allowed to grow, then the vectors produced at various times have different lengths. This affects e.g. the number of parameters in a linear (or other parametric) classifiers trained on such vectors, which then also needs to be able to grow. It affects k-means and dimensionality reduction classes. It even affects something as simple as matrix multiplications, which can no longer be handled with a simple call to NumPy's dot routine, requiring custom code instead. In other words, allowing this flexibility in the vectorizers makes little sense unless you adapt all of scikit-learn to handle the result.

While this would be possible, I (as a core scikit-learn developer) would strongly oppose the change because it makes the code very complicated, probably slower, and even if it would work, it would make it impossible to distinguish between a "growing vocabulary" and the much more common situation of a user passing data in the wrong way, so that the number of dimensions comes out wrong.

If you want to feed data in in batches, then either using a HashingVectorizer (no vocabulary) or do two passes over the data to collect the vocabulary up front.

like image 94
Fred Foo Avatar answered Nov 09 '22 03:11

Fred Foo