What is the difference between Hashing vectorizer and Count vectorizer, when each to be used?

Question

I am trying with various SVM variants in scikit-learn along with CountVectorizer and HashingVectorizer. They use fit or fit_transform in different examples, confusing me which to be used when.

Any clarification would be much honored.

cfh · Accepted Answer

They serve a similar purpose. The documentation provides some pro's and con's for the HashingVectorizer :

This strategy has several advantages:

it is very low memory scalable to large datasets as there is no need to store a vocabulary dictionary in memory

it is fast to pickle and un-pickle as it holds no state besides the constructor parameters

it can be used in a streaming (partial fit) or parallel pipeline as there is no state computed during fit.

There are also a couple of cons (vs using a CountVectorizer with an in-memory vocabulary):

there is no way to compute the inverse transform (from feature indices to string feature names) which can be a problem when trying to introspect which features are most important to a model.

there can be collisions: distinct tokens can be mapped to the same feature index. However in practice this is rarely an issue if n_features is large enough (e.g. 2 ** 18 for text classification problems).

no IDF weighting as this would render the transformer stateful.

What is the difference between Hashing vectorizer and Count vectorizer, when each to be used?

Tags:

machine-learning

classification

svm

scikit-learn

user123

1 Answers

cfh

Recent Activity

Donate For Us

What is the difference between Hashing vectorizer and Count vectorizer, when each to be used?

Tags:

machine-learning

classification

svm

scikit-learn

user123

1 Answers

cfh

Related questions

Recent Activity

Donate For Us