Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

CountVectorizer(analyzer='char_wb') not working as expected

I'm trying to use scikit-learn's CountVectorizer to count character 2-grams, ignoring spaces. In the docs it mentions the parameter analyzer which states

Whether the feature should be made of word or character n-grams. Option ‘char_wb’ creates character n-grams only from text inside word boundaries.

However, "char_wb" doesn't appear to work as I expected. For example:

corpus = [
    "The blue dog Blue",
    "Green the green cat",
    "The green mouse",
]

# CountVectorizer character 2-grams with word boundaries
vectorizer = CountVectorizer(analyzer='char_wb', ngram_range=(2, 2), min_df=1) 
X = vectorizer.fit_transform(corpus)
vectorizer.get_feature_names()
[' b',
 ' c',
 ' d',
 ' g',
 ' m',
 ' t',
 'at',
 'bl',
 'ca', ....

Notice the examples like ' b' which include a space. What gives?

like image 912
Ben Avatar asked Mar 23 '16 21:03

Ben


People also ask

What is Char_wb?

Option 'char_wb' creates character n-grams only from text inside word boundaries; n-grams at the edges of words are padded with space.

What is CountVectorizer in Python?

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

How do you use count Vectorizer?

The CountVectorizer will select the words/features/terms which occur the most frequently. It takes absolute values so if you set the 'max_features = 3', it will select the 3 most common words in the data. By setting 'binary = True', the CountVectorizer no more takes into consideration the frequency of the term/word.


1 Answers

I think this is a longstanding inaccuracy in the documentation, which you are welcome to help fix. It would be more correct to say that:

Option ‘char_wb’ creates character n-grams, but does not generate n-grams that cross word boundaries.

The change appears to have been made in this commit to ensure that; see the contributor's comment. It looks particularly awkward when comparing the bigrams output to that of analyzer='char', but when you increase to trigrams you will see that whitespace can begin or end an n-gram but cannot be in the middle. This helps to signify the word-initial or word-final nature of a feature without capturing noisy cross-word character n-grams. It also ensures that, unlike prior to that commit, all extracted n-grams have the same length!

like image 180
joeln Avatar answered Sep 21 '22 21:09

joeln