Special characters in countVectorizer Scikit-learn

Tags:

Consider this runnable example:

#coding: utf-8
from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
corpus = ['öåa hej ho' 'åter aba na', 'äs äp äl']
x = vectorizer.fit_transform(corpus)
l =  vectorizer.get_feature_names()

for u in l:
        print u

The output will be

aba
hej
ho
na
ter

Why is the åäö removed? Note that the vectorizer strip_accents=None is default. I would be really grateful if you could help me with this.

684

asked Apr 18 '13 10:04

user1506145

1 Answers

This is an intentional way to reduce the dimensionality while making the vectorizer tolerant to inputs where the authors are not always consistent with the use of accentuated chars.

If you want to disable that feature, just pass strip_accents=None to CountVectorizer as explained in the documentation of this class.

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> CountVectorizer(strip_accents='ascii').build_analyzer()(u'\xe9t\xe9')
[u'ete']
>>> CountVectorizer(strip_accents=False).build_analyzer()(u'\xe9t\xe9')
[u'\xe9t\xe9']
>>> CountVectorizer(strip_accents=None).build_analyzer()(u'\xe9t\xe9')
[u'\xe9t\xe9']

188

answered Sep 19 '22 22:09

ogrisel

Related questions
                            
                                Good design pattern(s) for extensible program [closed]
                            
                                Python convert colorsys RGB coordinates to hex
                            
                                Enumerate all full (labeled) binary tree
                            
                                How to Increase Matplotlib Basemap Size?
                            
                                import error: no module named dns.query
                            
                                backward slash followed by a number in python strings
                            
                                How can i get directed tree from graph?
                            
                                Generating non-consecutive combinations
                            
                                How to show a png image in Gtk3 with Python?
                            
                                Element-wise operations in mpmath
                            
                                how to set cookie in python mechanize
                            
                                Difficulty accessing json file with d3 and flask
                            
                                Python socket.recv exception
                            
                                Why do I have to do `sys.stdin = codecs.getreader(sys.stdin.encoding)(sys.stdin)`?
                            
                                Why can't I rig SciPy's constrained optimization for integer programming?
                            
                                Combine output multiprocessing python
                            
                                How do I test manual DB transaction code in Django?
                            
                                numpy - scalar multiplication of column vector times row vector
                            
                                Python dictionary keys as set of numbers
                            
                                Why are python libraries not supplied as pyc? [closed]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Special characters in countVectorizer Scikit-learn

Tags:

python

machine-learning

scikit-learn

user1506145

People also ask

1 Answers

ogrisel

Recent Activity

Donate For Us