As far as I know, in Bag Of Words method, features are a set of words and their frequency counts in a document. In another hand, N-grams, for example unigrams does exactly the same, but it does not take into consideration the frequency of occurance of a word.
I want to use sklearn and CountVectorizer to implement both BOW and n-gram methods.
For BOW my code looks like this:
CountVectorizer(ngram_range=(1, 1), max_features=3000)
Is is enought to set 'binary' parameter to True to perform n-gram feature selection?
CountVectorizer(ngram_range=(1, 1), max_features=3000, binary=True)
What are the advantages of n-gram over the BOW method?
As answered by @daniel-kurniadi you need to adapt the values of the ngram_range
parameter to use the n-gram. For instance by using (1, 2)
, the vectorizer will take into account unigrams and bigrams.
The main advantages of ngrams over BOW i to take into account the sequence of words. For instance, in the sentences:
The meaning is clearly different but a basic BOW representation will be the same in both cases. With n-grams (with n>=2), it will capture the order of the terms and thus the representations will be different.
If you set the ngram_range
params to (m, n), then it will become an N-gram implementation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With