Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using Categorical Features along with Text for classification

I'm trying to classify movies into two arbitrary classes. I am given a plot synopsis of the movie along with its genre. While I use the TfidfVectorizer, to convert my synopsis into features, I need to use the genre of the movie as a separate feature.

I am currently just appending the genre to the text of the synopsis and feeding it to the classifier.

The problem is that these two features are of different kinds. While the words are converted to a tfidf matrix, I feel the genre should be treated differently and not just as any other word. Is there anyway I could accomplish this?

like image 936
Airmine Avatar asked Oct 20 '22 02:10

Airmine


2 Answers

You should use DictVectorizer, for each possible categorial feature (genre) it creates new binary feature and sets 1 on corresponding feature only when your movie from that genre.

from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{'genre':'action'}, {'genre':'drama'}, {'genre':'comedy'}, {'genre':'drama'}]
v.fit_transform(D)
v.feature_names_

results in:

array([[ 1.,  0.,  0.],
       [ 0.,  0.,  1.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

['genre=action', 'genre=comedy', 'genre=drama']

You can also use FeatureUnion to concatenate features from TfidfVectorizer and DictVectorizer

like image 68
Ibraim Ganiev Avatar answered Oct 22 '22 23:10

Ibraim Ganiev


It's hard to find a clean way to include the categorical feature.

Appending the genre to the synopsis is indeed a way to proceed. You could append it multiple times if you want to give it more importance (e.g if you're using bag of words).

Another technique is to train two different classifiers, one with your text data and one with your regular features. You can then ensemble the results (taking the average of predicted probabilities for instance).
If you have only one categorical feature you could just use it to infer some prior on the final classes.

Hope this helps.

like image 32
ldirer Avatar answered Oct 22 '22 23:10

ldirer