CountVectorizer ignoring 'I'

Tags:

python

scikit-learn

Why is CountVectorizer in sklearn ignoring the pronoun "I"?

ngram_vectorizer = CountVectorizer(analyzer = "word", ngram_range = (2,2), min_df = 1)
ngram_vectorizer.fit_transform(['HE GAVE IT TO I'])
<1x3 sparse matrix of type '<class 'numpy.int64'>'
ngram_vectorizer.get_feature_names()
['gave it', 'he gave', 'it to']

917

asked Oct 21 '15 13:10

Alex

1 Answers

The default tokenizer considers only 2-character (or more) words.

You can change this behaviour by passing an appropriate token_pattern to your CountVectorizer.

The default pattern is (see the signature in the docs):

'token_pattern': u'(?u)\\b\\w\\w+\\b'

You can get a CountVectorizer that does not drop one-letter words by changing the default, for instance:

from sklearn.feature_extraction.text import CountVectorizer
ngram_vectorizer = CountVectorizer(analyzer="word", ngram_range=(2,2), 
                                   token_pattern=u"(?u)\\b\\w+\\b",min_df=1)
ngram_vectorizer.fit_transform(['HE GAVE IT TO I'])
print(ngram_vectorizer.get_feature_names())

Which gives:

['gave it', 'he gave', 'it to', 'to i']

answered Oct 14 '22 23:10

ldirer

Related questions
                            
                                Issue when running schedule with Flask
                            
                                How do I save a workbook using xlwings?
                            
                                What should I use instead of Bootstrap?
                            
                                Filling date gaps in pandas dataframe
                            
                                MATLAB ksdensity equivalent in Python
                            
                                Pandas scalar value getting and setting: ix or iat?
                            
                                Python: Iterate over each item in nested-list-of-lists and replace specific items
                            
                                Why does this solve the 'no $DISPLAY environment' issue with matplotlib?
                            
                                Updating Anaconda's root Python to newer minor version on Windows does nothing
                            
                                Pandas, groupby where column value is greater than x
                            
                                Is it possible to run only a single step of the asyncio event loop
                            
                                How do i plot facet plots in pandas
                            
                                Can you use a concept similar to keyword args for python in Java to minimize the number of accessor methods?
                            
                                PyCharm show full diff when unittest fails for multiline string?
                            
                                PyMC3 & Theano - Theano code that works stop working after pymc3 import
                            
                                PySpark, importing schema through JSON file
                            
                                How to get the underlying socket when using Python requests
                            
                                hashing different tuples in python give identical result
                            
                                How to build a get-form post in flask
                            
                                How to use Python and OpenCV with multiprocessing?

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With