Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Special characters in countVectorizer Scikit-learn

Consider this runnable example:

#coding: utf-8
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = ['öåa hej ho' 'åter aba na', 'äs äp äl']
x = vectorizer.fit_transform(corpus)
l =  vectorizer.get_feature_names()

for u in l:
        print u

The output will be

aba
hej
ho
na
ter

Why is the åäö removed? Note that the vectorizer strip_accents=None is default. I would be really grateful if you could help me with this.

like image 684
user1506145 Avatar asked Apr 18 '13 10:04

user1506145


People also ask

Is CountVectorizer case sensitive?

python - CountVectorizer ignores Upper Case - Stack Overflow. Stack Overflow for Teams – Start collaborating and sharing organizational knowledge.

Does CountVectorizer remove punctuation?

The default tokenization in CountVectorizer removes all special characters, punctuation and single characters. If this is not the behavior you desire, and you want to keep punctuation and special characters, you can provide a custom tokenizer to CountVectorizer.

What is Max features in CountVectorizer?

The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.

What is CountVectorizer in Sklearn?

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.


1 Answers

This is an intentional way to reduce the dimensionality while making the vectorizer tolerant to inputs where the authors are not always consistent with the use of accentuated chars.

If you want to disable that feature, just pass strip_accents=None to CountVectorizer as explained in the documentation of this class.

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> CountVectorizer(strip_accents='ascii').build_analyzer()(u'\xe9t\xe9')
[u'ete']
>>> CountVectorizer(strip_accents=False).build_analyzer()(u'\xe9t\xe9')
[u'\xe9t\xe9']
>>> CountVectorizer(strip_accents=None).build_analyzer()(u'\xe9t\xe9')
[u'\xe9t\xe9']
like image 188
ogrisel Avatar answered Sep 19 '22 22:09

ogrisel