SKLearn TF-IDF to drop numbers?

Question

I'm doing text analysis, and I want to disregard 'words' that are just numbers. Eg. from the text "This is 000 Sparta!" only the words 'this', 'is' and 'Sparta' should be used. Is there a way to do this? How?

Psidom · Accepted Answer

The default token pattern for TfidfVectorizer is u'(?u)\b\w\w+\b' which matches a word that has at least two word characters, i.e, [a-zA-Z0-9_]; You can modify the token_pattern to your needs, for instance, regex (?ui)\b\w*[a-z]+\w*\b makes sure it matches a word but contains at least one letter:

from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(token_pattern=u'(?ui)\b\w*[a-z]+\w*\b')

text = ["This is 000 Sparta!"]
tfidf_matrix =  tf.fit_transform(text)
feature_names = tf.get_feature_names() 

print(feature_names)
[u'is', u'sparta', u'this']

SKLearn TF-IDF to drop numbers?

Tags:

scikit-learn

tf-idf

lte__

1 Answers

Psidom

Recent Activity

Donate For Us

SKLearn TF-IDF to drop numbers?

Tags:

scikit-learn

tf-idf

lte__

1 Answers

Psidom

Related questions

Recent Activity

Donate For Us