Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

SKLearn TF-IDF to drop numbers?

I'm doing text analysis, and I want to disregard 'words' that are just numbers. Eg. from the text "This is 000 Sparta!" only the words 'this', 'is' and 'Sparta' should be used. Is there a way to do this? How?

like image 593
lte__ Avatar asked Dec 18 '22 04:12

lte__


1 Answers

The default token pattern for TfidfVectorizer is u'(?u)\\b\\w\\w+\\b' which matches a word that has at least two word characters, i.e, [a-zA-Z0-9_]; You can modify the token_pattern to your needs, for instance, regex (?ui)\\b\\w*[a-z]+\\w*\\b makes sure it matches a word but contains at least one letter:

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b')
​
text = ["This is 000 Sparta!"]
tfidf_matrix =  tf.fit_transform(text)
feature_names = tf.get_feature_names() 
​
print(feature_names)
[u'is', u'sparta', u'this']

like image 143
Psidom Avatar answered Dec 26 '22 12:12

Psidom