Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Bag of Words (BOW) vs N-gram (sklearn CountVectorizer) - text documents classification

As far as I know, in Bag Of Words method, features are a set of words and their frequency counts in a document. In another hand, N-grams, for example unigrams does exactly the same, but it does not take into consideration the frequency of occurance of a word.

I want to use sklearn and CountVectorizer to implement both BOW and n-gram methods.

For BOW my code looks like this:

CountVectorizer(ngram_range=(1, 1), max_features=3000)

Is is enought to set 'binary' parameter to True to perform n-gram feature selection?

CountVectorizer(ngram_range=(1, 1), max_features=3000, binary=True)

What are the advantages of n-gram over the BOW method?

like image 649
Taldakus Avatar asked Jul 31 '18 20:07

Taldakus


2 Answers

As answered by @daniel-kurniadi you need to adapt the values of the ngram_range parameter to use the n-gram. For instance by using (1, 2), the vectorizer will take into account unigrams and bigrams.

The main advantages of ngrams over BOW i to take into account the sequence of words. For instance, in the sentences:

  1. "I love vanilla but I hate chocolate"
  2. "I love chocolate but I hate vanilla"

The meaning is clearly different but a basic BOW representation will be the same in both cases. With n-grams (with n>=2), it will capture the order of the terms and thus the representations will be different.

like image 86
gdupont Avatar answered Sep 28 '22 13:09

gdupont


If you set the ngram_range params to (m, n), then it will become an N-gram implementation.

like image 43
Daniel Kurniadi Avatar answered Sep 28 '22 14:09

Daniel Kurniadi