Bag of Words (BOW) vs N-gram (sklearn CountVectorizer) - text documents classification

Question

As far as I know, in Bag Of Words method, features are a set of words and their frequency counts in a document. In another hand, N-grams, for example unigrams does exactly the same, but it does not take into consideration the frequency of occurance of a word.

I want to use sklearn and CountVectorizer to implement both BOW and n-gram methods.

For BOW my code looks like this:

CountVectorizer(ngram_range=(1, 1), max_features=3000)

Is is enought to set 'binary' parameter to True to perform n-gram feature selection?

CountVectorizer(ngram_range=(1, 1), max_features=3000, binary=True)

What are the advantages of n-gram over the BOW method?

gdupont · Accepted Answer

As answered by @daniel-kurniadi you need to adapt the values of the ngram_range parameter to use the n-gram. For instance by using (1, 2), the vectorizer will take into account unigrams and bigrams.

The main advantages of ngrams over BOW i to take into account the sequence of words. For instance, in the sentences:

"I love vanilla but I hate chocolate"
"I love chocolate but I hate vanilla"

The meaning is clearly different but a basic BOW representation will be the same in both cases. With n-grams (with n>=2), it will capture the order of the terms and thus the representations will be different.

Daniel Kurniadi · Answer

If you set the ngram_range params to (m, n), then it will become an N-gram implementation.

Bag of Words (BOW) vs N-gram (sklearn CountVectorizer) - text documents classification

Tags:

python

scikit-learn

n-gram

feature-selection

feature-extraction

Taldakus

2 Answers

gdupont

Daniel Kurniadi

Recent Activity

Donate For Us

Bag of Words (BOW) vs N-gram (sklearn CountVectorizer) - text documents classification

Tags:

python

scikit-learn

n-gram

feature-selection

feature-extraction

Taldakus

2 Answers

gdupont

Daniel Kurniadi

Related questions

Recent Activity

Donate For Us