I am using scikit-learn to build a classifier, which works on (somewhat large) text files. I need a simple bag-of-words features at the moment, therefore I tried using TfidfVectorizer/HashingVectorizer/CountVectorizer to obtain the feature vectors.
However, processing the entire train data at once to obtain the feature vectors results in memory error in numpy/scipy (depending on which vectorizer I use). So my question is:
When extracting text features from the raw text: if I fit the data to the vectorizer in chunks, will that be the same as fitting the entire data at once?
To illustrate this with code, is the following:
vectoriser = CountVectorizer() # or TfidfVectorizer/HashingVectorizer
train_vectors = vectoriser.fit_transform(train_data)
different from the following:
vectoriser = CountVectorizer() # or TfidfVectorizer/HashingVectorizer
start = 0
while start < len(train_data):
vectoriser.fit(train_data[start:(start+500)])
start += 500
train_vectors = vectoriser.transform(train_data)
Thanks in advance and sorry if this question is completely retarded.
It depends on the vectorizer you are using.
CountVectorizer counts occurences of the words in the documents.
It outputs for each document a (n_words, 1)
vector with the number of times each word appears in the document. n_words
is the total number of words in the documents (aka the size of the vocabulary).
It also fits a vocabulary so that you can introspect the model (see what word is important, etc.). You can have a look at it using vectorizer.get_feature_names()
.
When you fit it on your first 500 documents, the vocabulary will only be made of the words from the 500 documents. Say there are 30k of them, fit_transform
outputs a 500x30k
sparse matrix.
Now you fit_transform
again with the 500 next documents, but they contain only 29k words so you get a 500x29k
matrix...
Now, how do you align your matrices to make sure all documents have a consistent representation?
I can't think of an easy way to do this at the moment.
With TfidfVectorizer you have another issue, that is the inverse document frequency: to be able to compute document frequency you need to see all the documents at once.
However a TfidfVectorizer
is just a CountVectorizer
followed by a TfIdfTransformer
, so if you manage to get the output of the CountVectorizer
right you can then apply a TfIdfTransformer
on the data.
With HashingVectorizer things are different: there is no vocabulary here.
In [51]: hvect = HashingVectorizer()
In [52]: hvect.fit_transform(X[:1000])
<1000x1048576 sparse matrix of type '<class 'numpy.float64'>'
with 156733 stored elements in Compressed Sparse Row format>
Here there are not 1M+ different words in the first 1000 documents, yet the matrix we get has 1M+ columns.
The HashingVectorizer
does not store the words in memory. This makes it more memory efficient and makes sure that the matrices it returns always have the same number of columns.
So you don't have the same problem as with the CountVectorizer
here.
This is probably the best solution for the batch processing you described. There are a couple of cons, namely that you cannot get the idf weighting and that you do not know the mapping between words and your features.
The HashingVectorizer documentation references an example that does out-of-core classification on text data. It may be a bit messy but it does what you'd like to do.
Hope this helps.
EDIT:
If you have too much data, HashingVectorizer
is the way to go.
If you still want to use CountVectorizer
, a possible workaround is to fit the vocabulary yourself and to pass it to your vectorizer so that you only need to call tranform
.
Here's an example you can adapt:
import re
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
news = fetch_20newsgroups()
X, y = news.data, news.target
Now the approach that does not work:
# Fitting directly:
vect = CountVectorizer()
vect.fit_transform(X[:1000])
<1000x27953 sparse matrix of type '<class 'numpy.int64'>'
with 156751 stored elements in Compressed Sparse Row format>
Note the size of the matrix we get.
Fitting the vocabulary 'manually':
def tokenizer(doc):
# Using default pattern from CountVectorizer
token_pattern = re.compile('(?u)\\b\\w\\w+\\b')
return [t for t in token_pattern.findall(doc)]
stop_words = set() # Whatever you want to have as stop words.
vocabulary = set([word for doc in X for word in tokenizer(doc) if word not in stop_words])
vectorizer = CountVectorizer(vocabulary=vocabulary)
X_counts = vectorizer.transform(X[:1000])
# Now X_counts is:
# <1000x155448 sparse matrix of type '<class 'numpy.int64'>'
# with 149624 stored elements in Compressed Sparse Row format>
#
X_tfidf = tfidf.transform(X_counts)
On your example you'll need to first build the entire matrix X_counts (for all documents) before applying the tfidf transform.
I'm not a text-feature extracting expert, but based on the documentation and my other classifier base experiences:
If I do several fits on chunks of the training data, will that be the same as fitting the entire data at once?
You can't directly merge the extracted features, because you will get different importances i.e. weights
for the same token/word getting from the different chunk in different proportion to other words of the chunk, represented with a different key.
You can use any feature extracting method, the usefulness of the result depends of the task, I think.
But if you can use different chunks' different features for classification on the same data. Once you get several different outputs with the features you acquired with the same feature extracting method(or you can use different extracting method too), you can use them as an input to a "merging" mechanism like bagging
, boosting
etc.
Actually after the entire process above in most case you will get a better final output, than you fed the full file in one "full-featured" but even a simple classifier.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With