Does TfidfVectorizer identify n-grams using python regular expressions?
This issue arises while reading the documentation for scikit-learn TfidfVectorizer, I see that the pattern to recognize n-grams at the word level is token_pattern=u'(?u)\b\w\w+\b'
. I am having trouble seeing how this works. Consider the bi-gram case. If I do:
In [1]: import re
In [2]: re.findall(u'(?u)\b\w\w+\b',u'this is a sentence! this is another one.')
Out[2]: []
I do not find any bigrams. Whereas:
In [2]: re.findall(u'(?u)\w+ \w*',u'this is a sentence! this is another one.')
Out[2]: [u'this is', u'a sentence', u'this is', u'another one']
finds some (but not all, e.g. u'is a'
and all other even count bigrams are missing). What am I doing wrong in interpreting the \b
character function?
Note:
According to the regular expression module documentation, the \b
character in re is supposed to:
\b Matches the empty string, but only at the beginning or end of a word. A word is defined as a sequence of alphanumeric or underscore characters, so the end of a word is indicated by whitespace or a non-alphanumeric, non-underscore character.
I see questions addressing the issue of identifying n-grams in python (see 1,2), so a secondary question is: should I do this and add joined n-grams before feeding my text to TfidfVectorizer?
You should prepend regular expressions with r
. The following works:
>>> re.findall(r'(?u)\b\w\w+\b',u'this is a sentence! this is another one.')
[u'this', u'is', u'sentence', u'this', u'is', u'another', u'one']
This is a known bug in the documentation, but if you look at the source code they do use raw literals.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With