Using Categorical Features along with Text for classification

Question

I'm trying to classify movies into two arbitrary classes. I am given a plot synopsis of the movie along with its genre. While I use the TfidfVectorizer, to convert my synopsis into features, I need to use the genre of the movie as a separate feature.

I am currently just appending the genre to the text of the synopsis and feeding it to the classifier.

The problem is that these two features are of different kinds. While the words are converted to a tfidf matrix, I feel the genre should be treated differently and not just as any other word. Is there anyway I could accomplish this?

Ibraim Ganiev · Accepted Answer

You should use DictVectorizer, for each possible categorial feature (genre) it creates new binary feature and sets 1 on corresponding feature only when your movie from that genre.

from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{'genre':'action'}, {'genre':'drama'}, {'genre':'comedy'}, {'genre':'drama'}]
v.fit_transform(D)
v.feature_names_

results in:

array([[ 1.,  0.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

['genre=action', 'genre=comedy', 'genre=drama']

You can also use FeatureUnion to concatenate features from TfidfVectorizer and DictVectorizer

ldirer · Answer

It's hard to find a clean way to include the categorical feature.

Appending the genre to the synopsis is indeed a way to proceed. You could append it multiple times if you want to give it more importance (e.g if you're using bag of words).

Another technique is to train two different classifiers, one with your text data and one with your regular features. You can then ensemble the results (taking the average of predicted probabilities for instance).
If you have only one categorical feature you could just use it to infer some prior on the final classes.

Hope this helps.

Using Categorical Features along with Text for classification

Tags:

classification

scikit-learn

Airmine

2 Answers

Ibraim Ganiev

ldirer

Recent Activity

Donate For Us

Using Categorical Features along with Text for classification

Tags:

classification

scikit-learn

Airmine

2 Answers

Ibraim Ganiev

ldirer

Related questions

Recent Activity

Donate For Us