I want to perform text classification using topic modeling information as features that are fed to an svm classifier. So I was wondering how is it possible to generate topic modeling features by performing LDA on both the training and test partitions of the dataset since the corprus changes for the two partitions of the dataset?
Am I making a wrong assumption?
Could you provide an example on how to do it by using scikit learn?
Your assumption is right. What you do is that you train your LDA on your training data and then transform both training and testing data based on that trained model.
So you'll have something like this:
from sklearn.decomposition import LatentDirichletAllocation as LDA
lda = LDA(n_topics=10,...)
lda.fit(training_data)
training_features = lda.transform(training_data)
testing_features = lda.transform(testing_data)
If I were you, I would concatenate the LDA features with Bag of words features using numpy.hstack or scipy.hstack if your bow features are sparse.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With