Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Use topic modeling information from LDA as features to perform text classification through SVM

I want to perform text classification using topic modeling information as features that are fed to an svm classifier. So I was wondering how is it possible to generate topic modeling features by performing LDA on both the training and test partitions of the dataset since the corprus changes for the two partitions of the dataset?

Am I making a wrong assumption?

Could you provide an example on how to do it by using scikit learn?

like image 612
asterix Avatar asked Dec 06 '16 22:12

asterix


1 Answers

Your assumption is right. What you do is that you train your LDA on your training data and then transform both training and testing data based on that trained model.

So you'll have something like this:

from sklearn.decomposition import LatentDirichletAllocation as LDA
lda = LDA(n_topics=10,...)
lda.fit(training_data)
training_features = lda.transform(training_data)
testing_features = lda.transform(testing_data)

If I were you, I would concatenate the LDA features with Bag of words features using numpy.hstack or scipy.hstack if your bow features are sparse.

like image 72
Ash Avatar answered Sep 22 '22 16:09

Ash