Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikit-learn: don't separate hyphenated words while tokenization

I am using the CountVectorizer and don't want to separate hyphenated words into different tokens. I have tried passing different pregex patterns into the token_pattern argument, but haven't been able to get the desired result.

Here's what I have tried:

pattern = r''' (?x)         # set flag to allow verbose regexps 
([A-Z]\.)+          # abbreviations (e.g. U.S.A.)
| \w+(-\w+)*        # words with optional internal hyphens
| \$?\d+(\.\d+)?%?  # currency & percentages
| \.\.\.            # ellipses '''

text = 'I hate traffic-ridden streets.'
vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
analyze = vectorizer.build_analyzer()
analyze(text)

I have also tried to use nltk's regexp_tokenize as suggested in an earlier question but it's behaviour seems to have changed as well.

like image 862
Ankesh Anand Avatar asked Jun 30 '16 06:06

Ankesh Anand


People also ask

Is CountVectorizer bag of words?

CountVectorizer) is used to fit the bag-or-words model. As a result of fitting the model, the following happens. The fit_transform method of CountVectorizer takes an array of text data, which can be documents or sentences. In the example given below, the numpay array consisting of text is passed as an argument.

What is sklearn feature_ extraction text?

The sklearn. feature_extraction module can be used to extract features in a format supported by machine learning algorithms from datasets consisting of formats such as text and image.

What is CountVectorizer in Sklearn?

CountVectorizer is a great tool provided by the scikit-learn library in Python. It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.

How do you split hyphenated words in Python?

Use the str. split() method to split a string by hyphen, e.g. my_list = my_str.


1 Answers

There are a couple things to note. The first is that adding in all of those spaces, line breaks and comments into your pattern string makes all of those characters part of your regular expression. See here:

import re
>>> re.match("[0-9]","3")
<_sre.SRE_Match object at 0x104caa920>
>>> re.match("[0-9] #a","3")
>>> re.match("[0-9] #a","3 #a")
<_sre.SRE_Match object at 0x104caa718>

The second is that you need to escape special sequences when constructing your regex pattern within a string. For example pattern = "\w" really needs to be pattern = "\\w". Once you account for those things you should be able to write the regex for your desired tokenizer. For example if you just wanted to add in hyphens something like this will work:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> pattern = "(?u)\\b[\\w-]+\\b"
>>> 
>>> text = 'I hate traffic-ridden streets.'
>>> vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
>>> analyze = vectorizer.build_analyzer()
>>> analyze(text)
[u'hate', u'traffic-ridden', u'streets']
like image 101
David Avatar answered Oct 24 '22 15:10

David