I am using the CountVectorizer
and don't want to separate hyphenated words into different tokens. I have tried passing different pregex patterns into the token_pattern
argument, but haven't been able to get the desired result.
Here's what I have tried:
pattern = r''' (?x) # set flag to allow verbose regexps
([A-Z]\.)+ # abbreviations (e.g. U.S.A.)
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency & percentages
| \.\.\. # ellipses '''
text = 'I hate traffic-ridden streets.'
vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
analyze = vectorizer.build_analyzer()
analyze(text)
I have also tried to use nltk
's regexp_tokenize
as suggested in an earlier question but it's behaviour seems to have changed as well.
CountVectorizer) is used to fit the bag-or-words model. As a result of fitting the model, the following happens. The fit_transform method of CountVectorizer takes an array of text data, which can be documents or sentences. In the example given below, the numpay array consisting of text is passed as an argument.
The sklearn. feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.
CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.
Use the str. split() method to split a string by hyphen, e.g. my_list = my_str.
There are a couple things to note. The first is that adding in all of those spaces, line breaks and comments into your pattern string makes all of those characters part of your regular expression. See here:
import re
>>> re.match("[0-9]","3")
<_sre.SRE_Match object at 0x104caa920>
>>> re.match("[0-9] #a","3")
>>> re.match("[0-9] #a","3 #a")
<_sre.SRE_Match object at 0x104caa718>
The second is that you need to escape special sequences when constructing your regex pattern within a string. For example pattern = "\w"
really needs to be pattern = "\\w"
. Once you account for those things you should be able to write the regex for your desired tokenizer. For example if you just wanted to add in hyphens something like this will work:
>>> from sklearn.feature_extraction.text import CountVectorizer
>>> pattern = "(?u)\\b[\\w-]+\\b"
>>>
>>> text = 'I hate traffic-ridden streets.'
>>> vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
>>> analyze = vectorizer.build_analyzer()
>>> analyze(text)
[u'hate', u'traffic-ridden', u'streets']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With