scikit-learn: don't separate hyphenated words while tokenization

Tags:

I am using the CountVectorizer and don't want to separate hyphenated words into different tokens. I have tried passing different pregex patterns into the token_pattern argument, but haven't been able to get the desired result.

Here's what I have tried:

pattern = r''' (?x)         # set flag to allow verbose regexps 
([A-Z]\.)+          # abbreviations (e.g. U.S.A.)
| \w+(-\w+)*        # words with optional internal hyphens
| \$?\d+(\.\d+)?%?  # currency & percentages
| \.\.\.            # ellipses '''

text = 'I hate traffic-ridden streets.'
vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
analyze = vectorizer.build_analyzer()
analyze(text)

I have also tried to use nltk's regexp_tokenize as suggested in an earlier question but it's behaviour seems to have changed as well.

862

asked Jun 30 '16 06:06

Ankesh Anand

1 Answers

There are a couple things to note. The first is that adding in all of those spaces, line breaks and comments into your pattern string makes all of those characters part of your regular expression. See here:

import re
>>> re.match("[0-9]","3")
<_sre.SRE_Match object at 0x104caa920>
>>> re.match("[0-9] #a","3")
>>> re.match("[0-9] #a","3 #a")
<_sre.SRE_Match object at 0x104caa718>

The second is that you need to escape special sequences when constructing your regex pattern within a string. For example pattern = "\w" really needs to be pattern = "\\w". Once you account for those things you should be able to write the regex for your desired tokenizer. For example if you just wanted to add in hyphens something like this will work:

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> pattern = "(?u)\\b[\\w-]+\\b"
>>> 
>>> text = 'I hate traffic-ridden streets.'
>>> vectorizer = CountVectorizer(stop_words='english',token_pattern=pattern)
>>> analyze = vectorizer.build_analyzer()
>>> analyze(text)
[u'hate', u'traffic-ridden', u'streets']

101

answered Oct 24 '22 15:10

David

Related questions
                            
                                ImportError: No module named gdal
                            
                                Create nested dictionary on the fly in Python
                            
                                One line tree implementation
                            
                                Labelling a matplotlib histogram bin with an arrow
                            
                                Execute a Jupyter notebook including inline markdown with nbconvert
                            
                                concat a DataFrame with a Series in Pandas
                            
                                Multiple signup, registration forms using django-allauth
                            
                                Wait for condition in given timeout
                            
                                how to use ajax function to send form without page getting refreshed, what am I missing?Do I have to use rest-framework for this?
                            
                                Python click pass unspecified number of kwargs [duplicate]
                            
                                pyqt5 - closing/terminating application
                            
                                pandas multi-index how to mask the data by the second level
                            
                                Plot contours of distribution on all three axes in 3D plot
                            
                                TensorFlow: Hadamard Product:: How do I get this?
                            
                                The difference between np.random.seed(int) and np.random.seed(array_like)?
                            
                                Aligning table to x-axis using matplotlib python
                            
                                Comparing pandas DataFrame to Series
                            
                                Run python program without 'python' in Windows [duplicate]
                            
                                Flask server cannot read file uploaded by POST request
                            
                                Django run code on application start but not on migrations

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

scikit-learn: don't separate hyphenated words while tokenization

Tags:

python

regex

nltk

scikit-learn

Ankesh Anand

People also ask

1 Answers

David

Recent Activity

Donate For Us