CountVectorizer(analyzer='char_wb') not working as expected

Tags:

scikit-learn

I'm trying to use scikit-learn's CountVectorizer to count character 2-grams, ignoring spaces. In the docs it mentions the parameter analyzer which states

Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries.

However, "char_wb" doesn't appear to work as I expected. For example:

corpus = [
    "The blue dog Blue",
    "Green the green cat",
    "The green mouse",
]

# CountVectorizer character 2-grams with word boundaries
vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1) 
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()
[' b',
 ' c',
 ' d',
 ' g',
 ' m',
 ' t',
 'at',
 'bl',
 'ca', ....

Notice the examples like ' b' which include a space. What gives?

912

asked Mar 23 '16 21:03

1 Answers

I think this is a longstanding inaccuracy in the documentation, which you are welcome to help fix. It would be more correct to say that:

Option ‘char_wb’ creates character n-grams, but does not generate n-grams that cross word boundaries.

The change appears to have been made in this commit to ensure that; see the contributor's comment. It looks particularly awkward when comparing the bigrams output to that of analyzer='char', but when you increase to trigrams you will see that whitespace can begin or end an n-gram but cannot be in the middle. This helps to signify the word-initial or word-final nature of a feature without capturing noisy cross-word character n-grams. It also ensures that, unlike prior to that commit, all extracted n-grams have the same length!

180

answered Sep 21 '22 21:09

joeln

Related questions
                            
                                Testing a Generator in Python
                            
                                How comparator works for objects that are not comparable in python?
                            
                                Python expandtabs string operation
                            
                                Calculate the mode of a PySpark DataFrame column?
                            
                                Rounding floats with representation error to the closest and correct result
                            
                                JSON data convert to the django model
                            
                                Python 2 __getattr__ max recursion depth
                            
                                python - name of np array variable as string
                            
                                Can I make a file optional based on a variable's value in cookiecutter.json
                            
                                python elasticsearch-dsl parent child relationship
                            
                                Python: Open multiple images in default image viewer
                            
                                How could I pass block to a function in Python which is like the way to pass block in Ruby
                            
                                Select the first item from a drop down by index is not working. Unbound method select_by_index
                            
                                GVIM crashes when running python
                            
                                Unable to import sendgrid into GAE application
                            
                                How to assign a value_count output to a dataframe
                            
                                How to set set JSON encoder in marshmallow?
                            
                                Python find difference between file paths
                            
                                Make IPython Import What I Mean
                            
                                How To Use Django Cycle Tag

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

CountVectorizer(analyzer='char_wb') not working as expected

Tags:

python

scikit-learn

Ben

People also ask

1 Answers

joeln

Recent Activity

Donate For Us