How can I prevent TfidfVectorizer to get numbers as vocabulary

Tags:

scikit-learn

I use TfidfVectorizer like this:

from sklearn.feature_extraction.text import TfidfVectorizer
stop_words = stopwords.words("english")
vectorizer = TfidfVectorizer(stop_words=stop_words, min_df=200)
xs['train'] = vectorizer.fit_transform(docs['train'])
xs['test'] = vectorizer.transform(docs['test']).toarray()

But when inspecting vectorizer.vocabulary_ I've noticed that it learns pure number features:

[(u'00', 0), (u'000', 1), (u'0000', 2), (u'00000', 3), (u'000000', 4)

I don't want this. How can I prevent it?

574

asked Aug 07 '17 13:08

1 Answers

You could define the token_pattern when initing the vectorizer. The default one is u'(?u)\b\w\w+\b' (the (?u) part is just turning the re.UNICODE flag on). Could fiddle with that until you get what you need.

Something like:

vectorizer = TfidfVectorizer(stop_words=stop_words,
                             min_df=200,
                             token_pattern=u'(?u)\b\w*[a-zA-Z]\w*\b')

Another option (if the fact that numbers appear in your samples matter) is to mask all the numbers before vectorizing.

re.sub('\b[0-9][0-9.,-]*\b', 'NUMBER-SPECIAL-TOKEN', sample)

This way numbers will hit the same spot in your vectorizer's vocabulary and you won't completely ignore them either.

172

answered Oct 03 '22 16:10

Iulius Curt

Related questions
                            
                                python 3.5 asyncio and aiohttp Errno 101 Network is unreachable
                            
                                CNTK Complaining about Dynamic Axis in LSTM
                            
                                How to truncate decimal type & preserve as decimal type without rounding?
                            
                                Pandas: transform column's values in independent columns
                            
                                Are there number limitations in python?
                            
                                What's psycopg2 doing when I iterate a cursor?
                            
                                TypeError: 'DataFrame' object is not callable
                            
                                how to clip pandas dataframe column-wise?
                            
                                Draw two-sided arrow on image using opencv python
                            
                                Tkinter get mouse coordinates on click and use them as variables
                            
                                Calling a lambda with a numpy array
                            
                                Reading PCAP file with scapy
                            
                                Add 15 minutes to current timestamp using timedelta
                            
                                Keras - All layer names should be unique
                            
                                Django 1.10 & Socket.IO with Python 3
                            
                                Changing class attributes by reference
                            
                                Can I change the way keys are compared in a Python dict? I want to use the operator 'is' instead of ==
                            
                                Including data files with setup.py
                            
                                async - sync - async calls in one python event loop
                            
                                RectangleSelector Disappears on Zoom

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How can I prevent TfidfVectorizer to get numbers as vocabulary

Tags:

python

scikit-learn

Martin Thoma

People also ask

1 Answers

Iulius Curt

Recent Activity

Donate For Us