Scikit-learn CountVectorizer for bag-of-words approach currently gives two sub-options: (a) use a custom vocabulary (b) if custom vocabulary is unavailable, then it makes a vocabulary based on all the words present in the corpus.
My question: Can we specify a custom vocabulary to begin with, but ensure that it gets updated when new words are seen while processing the corpus. I am assuming this is doable since the matrix is stored via a sparse representation.
Usefulness: It will help in cases when one has to add additional documents to the training data, and one should not have to start from the beginning.
No, this is not possible at present. It's also not "doable", and here's why.
CountVectorizer
and TfidfVectorizer
are designed to turn text documents into vectors. These vectors need to all have an equal number of elements, which in turn is equal to the size of the vocabulary, because that conventions is ingrained in all scikit-learn code. If the vocabulary is allowed to grow, then the vectors produced at various times have different lengths. This affects e.g. the number of parameters in a linear (or other parametric) classifiers trained on such vectors, which then also needs to be able to grow. It affects k-means and dimensionality reduction classes. It even affects something as simple as matrix multiplications, which can no longer be handled with a simple call to NumPy's dot
routine, requiring custom code instead. In other words, allowing this flexibility in the vectorizers makes little sense unless you adapt all of scikit-learn to handle the result.
While this would be possible, I (as a core scikit-learn developer) would strongly oppose the change because it makes the code very complicated, probably slower, and even if it would work, it would make it impossible to distinguish between a "growing vocabulary" and the much more common situation of a user passing data in the wrong way, so that the number of dimensions comes out wrong.
If you want to feed data in in batches, then either using a HashingVectorizer
(no vocabulary) or do two passes over the data to collect the vocabulary up front.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With