Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding the `ngram_range` argument in a CountVectorizer in sklearn

I'm a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range argument works in a CountVectorizer.

Running this code:

from sklearn.feature_extraction.text import CountVectorizer vocabulary = ['hi ', 'bye', 'run away'] cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1, 2)) print cv.vocabulary_ 

gives me:

{'hi ': 0, 'bye': 1, 'run away': 2} 

Where I was under the (obviously mistaken) impression that I would get unigrams and bigrams, like this:

{'hi ': 0, 'bye': 1, 'run away': 2, 'run': 3, 'away': 4} 

I am working with the documentation here: http://scikit-learn.org/stable/modules/feature_extraction.html

Clearly there is something terribly wrong with my understanding of how to use ngrams. Perhaps the argument is having no effect or I have some conceptual issue with what an actual bigram is! I'm stumped. If anyone has a word of advice to throw my way, I'd be grateful.

UPDATE:
I have realized the folly of my ways. I was under the impression that the ngram_range would affect the vocabulary, not the corpus.

like image 792
tumultous_rooster Avatar asked Jun 03 '14 01:06

tumultous_rooster


People also ask

What is Ngram_range in CountVectorizer?

CountVectorizer will tokenize the data and split it into chunks called n-grams, of which we can define the length by passing a tuple to the ngram_range argument. For example, 1,1 would give us unigrams or 1-grams such as “whey” and “protein”, while 2,2 would give us bigrams or 2-grams, such as “whey protein”.

What is Max features in CountVectorizer?

The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.

How does CountVectorizer work in Python?

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

How do you use a Sklearn CountVectorizer?

Word Counts with CountVectorizer You can use it as follows: Create an instance of the CountVectorizer class. Call the fit() function in order to learn a vocabulary from one or more documents. Call the transform() function on one or more documents as needed to encode each as a vector.


1 Answers

Setting the vocabulary explicitly means no vocabulary is learned from data. If you don't set it, you get:

>>> v = CountVectorizer(ngram_range=(1, 2)) >>> pprint(v.fit(["an apple a day keeps the doctor away"]).vocabulary_) {u'an': 0,  u'an apple': 1,  u'apple': 2,  u'apple day': 3,  u'away': 4,  u'day': 5,  u'day keeps': 6,  u'doctor': 7,  u'doctor away': 8,  u'keeps': 9,  u'keeps the': 10,  u'the': 11,  u'the doctor': 12} 

An explicit vocabulary restricts the terms that will be extracted from text; the vocabulary is not changed:

>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"}) >>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray() array([[1, 1]])  # unigram and bigram found 

(Note that stopword filtering is applied before n-gram extraction, hence "apple day".)

like image 130
Fred Foo Avatar answered Oct 02 '22 11:10

Fred Foo