I'm doing text analysis, and I want to disregard 'words' that are just numbers. Eg. from the text "This is 000 Sparta!" only the words 'this', 'is' and 'Sparta' should be used. Is there a way to do this? How?
The default token pattern for TfidfVectorizer
is u'(?u)\\b\\w\\w+\\b'
which matches a word that has at least two word characters, i.e, [a-zA-Z0-9_]
; You can modify the token_pattern
to your needs, for instance, regex (?ui)\\b\\w*[a-z]+\\w*\\b
makes sure it matches a word but contains at least one letter:
from sklearn.feature_extraction.text import TfidfVectorizer
tf = TfidfVectorizer(token_pattern=u'(?ui)\\b\\w*[a-z]+\\w*\\b')
text = ["This is 000 Sparta!"]
tfidf_matrix = tf.fit_transform(text)
feature_names = tf.get_feature_names()
print(feature_names)
[u'is', u'sparta', u'this']
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With