Using counts and tfidf as features with scikit learn

Tags:

I'm trying to use both counts and tfidf as features for a multinomial NB model. Here's my code:

text = ["this is spam", "this isn't spam"]
labels = [0,1]
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)

tf_transformer = TfidfTransformer(use_idf=True)
combined_features = FeatureUnion([("counts", self.count_vectorizer), ("tfidf", tf_transformer)]).fit(self.text)

classifier = MultinomialNB()
classifier.fit(combined_features, labels)

But I'm getting an error with FeatureUnion and tfidf:

TypeError: no supported conversion for types: (dtype('S18413'),)

Any idea why this could be happening? Is it not possible to have both counts and tfidf as features?

683

asked Dec 02 '14 23:12

Aloke Desai

1 Answers

The error didn't come from the FeatureUnion, it came from the TfidfTransformer

You should use TfidfVectorizer instead of TfidfTransformer, the transformer expects a numpy array as input and not plaintext, hence the TypeError

Also your test sentence is too small for Tfidf testing so try using a bigger one, here's an example:

from nltk.corpus import brown

from sklearn.feature_extraction.text import CountVectorizer, TfidfVectorizer
from sklearn.pipeline import FeatureUnion
from sklearn.naive_bayes import MultinomialNB

# Let's get more text from NLTK
text = [" ".join(i) for i in brown.sents()[:100]]
# I'm just gonna assign random tags.
labels = ['yes']*50 + ['no']*50
count_vectorizer = CountVectorizer(stop_words="english", min_df=3)
tf_transformer = TfidfVectorizer(use_idf=True)
combined_features = FeatureUnion([("counts", count_vectorizer), ("tfidf", tf_transformer)]).fit_transform(text)
classifier = MultinomialNB()
classifier.fit(combined_features, labels)

187

answered Sep 22 '22 21:09

alvas

Related questions
                            
                                Python combining the format method with long strings that use LaTeX
                            
                                Django 1.5 is finally insecure?
                            
                                What is the difference between a mongoengine.DynamicEmbeddedDocument vs mongoengine.DictField?
                            
                                Pandas Dataframe CSV export, how to prevent additional double-quote characters
                            
                                Removing certain tags with beautifulsoup and python
                            
                                Django 1.7 makemigrations - ValueError: Cannot serialize function: lambda
                            
                                Any way to create a new worksheet using xlwings?
                            
                                Connection is closed when a SQLAlchemy event triggers a Celery task
                            
                                Plotting Precision-Recall curve when using cross-validation in scikit-learn
                            
                                Access to variables from outside function
                            
                                Finding k-mers in a sliding window
                            
                                Reassign a function attribute makes it 'unreachable'
                            
                                Does python logging replace print?
                            
                                Sorl-thumbnail generates black square instead of image
                            
                                python os module does not recognize ~ as shortcut for the user home directory
                            
                                No module named thrift in Python script
                            
                                Plot smooth curves of Pandas Series data
                            
                                Can I pass self as the first argument for class methods in python
                            
                                Jinja2 dictonary lookup using a variable key
                            
                                RuntimeError: working outside of request context

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Using counts and tfidf as features with scikit learn

Tags:

python

numpy

nlp

scikit-learn

ml

Aloke Desai

People also ask

1 Answers

alvas

Recent Activity

Donate For Us