Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using counts and tfidf as features with scikit learn

I'm trying to use both counts and tfidf as features for a multinomial NB model. Here's my code:

text = ["this is spam", "this isn't spam"]
labels = [0,1]
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)

tf_transformer = TfidfTransformer(use_idf=True)
combined_features = FeatureUnion([("counts", self.count_vectorizer), ("tfidf", tf_transformer)]).fit(self.text)

classifier = MultinomialNB()
classifier.fit(combined_features, labels)

But I'm getting an error with FeatureUnion and tfidf:

TypeError: no supported conversion for types: (dtype('S18413'),)

Any idea why this could be happening? Is it not possible to have both counts and tfidf as features?

like image 683
Aloke Desai Avatar asked Dec 02 '14 23:12

Aloke Desai


People also ask

How TF-IDF is calculated in Sklearn?

The formula that is used to compute the tf-idf for a term t of a document d in a document set is tf-idf(t, d) = tf(t, d) * idf(t), and the idf is computed as idf(t) = log [ n / df(t) ] + 1 (if smooth_idf=False ), where n is the total number of documents in the document set and df(t) is the document frequency of t; the ...

Is TF-IDF a feature extraction?

Abstract: The TF-IDF (term frequency-inverse document frequency) algorithm is based on word statistics for text feature extraction. Which considers only the expressions of words that are same in all texts, such as ASCLL, without considering that they could be represented by their synonyms.


1 Answers

The error didn't come from the FeatureUnion, it came from the TfidfTransformer

You should use TfidfVectorizer instead of TfidfTransformer, the transformer expects a numpy array as input and not plaintext, hence the TypeError

Also your test sentence is too small for Tfidf testing so try using a bigger one, here's an example:

from nltk.corpus import brown

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.naive_bayes import MultinomialNB

# Let's get more text from NLTK
text = [" ".join(i) for i in brown.sents()[:100]]
# I'm just gonna assign random tags.
labels = ['yes']*50 + ['no']*50
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)
tf_transformer = TfidfVectorizer(use_idf=True)
combined_features = FeatureUnion([("counts", count_vectorizer), ("tfidf", tf_transformer)]).fit_transform(text)
classifier = MultinomialNB()
classifier.fit(combined_features, labels)
like image 187
alvas Avatar answered Sep 22 '22 21:09

alvas