I'm trying to classify movies into two arbitrary classes. I am given a plot synopsis of the movie along with its genre. While I use the TfidfVectorizer, to convert my synopsis into features, I need to use the genre of the movie as a separate feature.
I am currently just appending the genre to the text of the synopsis and feeding it to the classifier.
The problem is that these two features are of different kinds. While the words are converted to a tfidf matrix, I feel the genre should be treated differently and not just as any other word. Is there anyway I could accomplish this?
You should use DictVectorizer, for each possible categorial feature (genre) it creates new binary feature and sets 1 on corresponding feature only when your movie from that genre.
from sklearn.feature_extraction import DictVectorizer
v = DictVectorizer(sparse=False)
D = [{'genre':'action'}, {'genre':'drama'}, {'genre':'comedy'}, {'genre':'drama'}]
v.fit_transform(D)
v.feature_names_
results in:
array([[ 1., 0., 0.],
[ 0., 0., 1.],
[ 0., 1., 0.],
[ 0., 0., 1.]])
['genre=action', 'genre=comedy', 'genre=drama']
You can also use FeatureUnion to concatenate features from TfidfVectorizer and DictVectorizer
It's hard to find a clean way to include the categorical feature.
Appending the genre to the synopsis is indeed a way to proceed. You could append it multiple times if you want to give it more importance (e.g if you're using bag of words).
Another technique is to train two different classifiers, one with your text data and one with your regular features. You can then ensemble the results (taking the average of predicted probabilities for instance).
If you have only one categorical feature you could just use it to infer some prior on the final classes.
Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With