I have a list of tokenized sentences and would like to fit a tfidf Vectorizer. I tried the following:
tokenized_list_of_sentences = [['this', 'is', 'one'], ['this', 'is', 'another']]
def identity_tokenizer(text):
return text
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english')
tfidf.fit_transform(tokenized_list_of_sentences)
which errors out as
AttributeError: 'list' object has no attribute 'lower'
is there a way to do this? I have a billion sentences and do not want to tokenize them again. They are tokenized before for another stage before this.
Try initializing the TfidfVectorizer
object with the parameter lowercase=False
(assuming this is actually desired as you've lowercased your tokens in previous stages).
tokenized_list_of_sentences = [['this', 'is', 'one', 'basketball'], ['this', 'is', 'a', 'football']]
def identity_tokenizer(text):
return text
tfidf = TfidfVectorizer(tokenizer=identity_tokenizer, stop_words='english', lowercase=False)
tfidf.fit_transform(tokenized_list_of_sentences)
Note that I changed the sentences as they apparently only contained stop words which caused another error due to an empty vocabulary.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With