I'm a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the <code>ngram_range</code> argument works in a CountVectorizer. Running this code: <pre class="prettyprint"><code>from sklearn.feature_extraction.text import CountVectorizer vocabulary = ['hi ', 'bye', 'run away'] cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1, 2)) print cv.vocabulary_ </code></pre> gives me: <pre class="prettyprint"><code>{'hi ': 0, 'bye': 1, 'run away': 2} </code></pre> Where I was under the (obviously mistaken) impression that I would get unigrams and bigrams, like this: <pre class="prettyprint"><code>{'hi ': 0, 'bye': 1, 'run away': 2, 'run': 3, 'away': 4} </code></pre> I am working with the documentation here: http://scikit-learn.org/stable/modules/feature_extraction.html Clearly there is something terribly wrong with my understanding of how to use ngrams. Perhaps the argument is having no effect or I have some conceptual issue with what an actual bigram is! I'm stumped. If anyone has a word of advice to throw my way, I'd be grateful. UPDATE: I have realized the folly of my ways. I was under the impression that the <code>ngram_range</code> would affect the vocabulary, not the corpus.

Setting the <code>vocabulary</code> explicitly means no vocabulary is learned from data. If you don't set it, you get: <pre class="prettyprint"><code>>>> v = CountVectorizer(ngram_range=(1, 2)) >>> pprint(v.fit(["an apple a day keeps the doctor away"]).vocabulary_) {u'an': 0, u'an apple': 1, u'apple': 2, u'apple day': 3, u'away': 4, u'day': 5, u'day keeps': 6, u'doctor': 7, u'doctor away': 8, u'keeps': 9, u'keeps the': 10, u'the': 11, u'the doctor': 12} </code></pre> An explicit vocabulary restricts the terms that will be extracted from text; the vocabulary is not changed: <pre class="prettyprint"><code>>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"}) >>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray() array([[1, 1]]) # unigram and bigram found </code></pre> (Note that stopword filtering is applied before n-gram extraction, hence <code>"apple day"</code>.)

Understanding the `ngram_range` argument in a CountVectorizer in sklearn

Tags:

python

scikit-learn

n-gram

feature-selection

I'm a little confused about how to use ngrams in the scikit-learn library in Python, specifically, how the ngram_range argument works in a CountVectorizer.

Running this code:

from sklearn.feature_extraction.text import CountVectorizer vocabulary = ['hi ', 'bye', 'run away'] cv = CountVectorizer(vocabulary=vocabulary, ngram_range=(1, 2)) print cv.vocabulary_

gives me:

{'hi ': 0, 'bye': 1, 'run away': 2}

Where I was under the (obviously mistaken) impression that I would get unigrams and bigrams, like this:

{'hi ': 0, 'bye': 1, 'run away': 2, 'run': 3, 'away': 4}

I am working with the documentation here: http://scikit-learn.org/stable/modules/feature_extraction.html

Clearly there is something terribly wrong with my understanding of how to use ngrams. Perhaps the argument is having no effect or I have some conceptual issue with what an actual bigram is! I'm stumped. If anyone has a word of advice to throw my way, I'd be grateful.

UPDATE:
I have realized the folly of my ways. I was under the impression that the ngram_range would affect the vocabulary, not the corpus.

792

asked Jun 03 '14 01:06

tumultous_rooster

1 Answers

Setting the vocabulary explicitly means no vocabulary is learned from data. If you don't set it, you get:

>>> v = CountVectorizer(ngram_range=(1, 2)) >>> pprint(v.fit(["an apple a day keeps the doctor away"]).vocabulary_) {u'an': 0,  u'an apple': 1,  u'apple': 2,  u'apple day': 3,  u'away': 4,  u'day': 5,  u'day keeps': 6,  u'doctor': 7,  u'doctor away': 8,  u'keeps': 9,  u'keeps the': 10,  u'the': 11,  u'the doctor': 12}

An explicit vocabulary restricts the terms that will be extracted from text; the vocabulary is not changed:

>>> v = CountVectorizer(ngram_range=(1, 2), vocabulary={"keeps", "keeps the"}) >>> v.fit_transform(["an apple a day keeps the doctor away"]).toarray() array([[1, 1]])  # unigram and bigram found

(Note that stopword filtering is applied before n-gram extraction, hence "apple day".)

130

answered Oct 02 '22 11:10

Fred Foo

Related questions
                            
                                Is it wrong to use the "==" operator when comparing to an empty list? [duplicate]
                            
                                When should I ever use file.read() or file.readlines()?
                            
                                How do I set up a daemon with python-daemon?
                            
                                How does keras define "accuracy" and "loss"?
                            
                                Pandas add column with value based on condition based on other columns
                            
                                How to de-import a Python module?
                            
                                Should 3.4 enums use UPPER_CASE_WITH_UNDERSCORES?
                            
                                Can json.loads ignore trailing commas?
                            
                                Python : terminology 'class' VS 'type'
                            
                                Is django prefetch_related supposed to work with GenericRelation
                            
                                Why is Python 3 is considerably slower than Python 2? [duplicate]
                            
                                Performance of Redis vs Disk in caching application
                            
                                What is the global default timeout
                            
                                What Kivy Tutorials Are Available [closed]
                            
                                Is there a way to access the original function in a mocked method/function such that I can modify the arguments and pass it to the original functions?
                            
                                How can I print the values of Keras tensors?
                            
                                Does string slicing perform copy in memory? [duplicate]
                            
                                Chrome extension in python?
                            
                                Tools for static type checking in Python
                            
                                Unpacking generalizations

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With