Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CountVectorizer ignoring 'I'

Why is CountVectorizer in sklearn ignoring the pronoun "I"?

ngram_vectorizer = CountVectorizer(analyzer = "word", ngram_range = (2,2), min_df = 1)
ngram_vectorizer.fit_transform(['HE GAVE IT TO I'])
<1x3 sparse matrix of type '<class 'numpy.int64'>'
ngram_vectorizer.get_feature_names()
['gave it', 'he gave', 'it to']
like image 917
Alex Avatar asked Oct 21 '15 13:10

Alex


People also ask

Is CountVectorizer case sensitive?

python - CountVectorizer ignores Upper Case - Stack Overflow. Stack Overflow for Teams – Start collaborating and sharing organizational knowledge.

What is CountVectorizer in NLP?

What is CountVectorizer In NLP? CountVectorizer means breaking down a sentence or any text into words by performing preprocessing tasks like converting all words to lowercase, thus removing special characters.

Does CountVectorizer remove punctuation?

The default tokenization in CountVectorizer removes all special characters, punctuation and single characters. If this is not the behavior you desire, and you want to keep punctuation and special characters, you can provide a custom tokenizer to CountVectorizer.

What is CountVectorizer in Sklearn?

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.


1 Answers

The default tokenizer considers only 2-character (or more) words.

You can change this behaviour by passing an appropriate token_pattern to your CountVectorizer.

The default pattern is (see the signature in the docs):

'token_pattern': u'(?u)\\b\\w\\w+\\b'

You can get a CountVectorizer that does not drop one-letter words by changing the default, for instance:

from sklearn.feature_extraction.text import CountVectorizer
ngram_vectorizer = CountVectorizer(analyzer="word", ngram_range=(2,2), 
                                   token_pattern=u"(?u)\\b\\w+\\b",min_df=1)
ngram_vectorizer.fit_transform(['HE GAVE IT TO I'])
print(ngram_vectorizer.get_feature_names())

Which gives:

['gave it', 'he gave', 'it to', 'to i']
like image 82
ldirer Avatar answered Oct 14 '22 23:10

ldirer