I am using sklearn's NMF and LDA sub-modules to analyze unlabeled text. I read the documentation but I am not sure if the transform functions in these modules (NMF and LDA) are the same as the posterior function in R's topicmodels (please see Predicting LDA topics for new data). Basically, I am looking for a function that will allow me to predict the topics in test set using the model trained on training set data. I predicted topics on the entire dataset. Then I split the data into train and test sets, trained a model on train set and transformed test set using that model. though it was expected that I would not get the same results, comparing the two runs topics is not assuring me that the transform function serves the same function as R's package. I would appreciate your response.
thank you
As can be read in the paper Topic Models by Blei and Lafferty (e.g. p.6 - Visualizing Topics and p. 12), the tf-idf score can be very useful for LDA. It can be used to visualize topics or to chose the vocabulary. "It is often computationally expensive to use the entire vocabulary.
Linear Discriminant Analysis, or LDA for short, is a classification machine learning algorithm. It works by calculating summary statistics for the input features by class label, such as the mean and standard deviation. These statistics represent the model learned from the training data.
The call to transform
on a LatentDirichletAllocation
model returns an unnormalized document topic distribution. To get proper probabilities, you can simply normalize the result. Here is an example:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import numpy as np
# grab a sample data set
dataset = fetch_20newsgroups(shuffle=True, remove=('headers', 'footers', 'quotes'))
train,test = dataset.data[:100], dataset.data[100:200]
# vectorizer the features
tf_vectorizer = TfidfVectorizer(max_features=25)
X_train = tf_vectorizer.fit_transform(train)
# train the model
lda = LatentDirichletAllocation(n_topics=5)
lda.fit(X_train)
# predict topics for test data
# unnormalized doc-topic distribution
X_test = tf_vectorizer.transform(test)
doc_topic_dist_unnormalized = np.matrix(lda.transform(X_test))
# normalize the distribution (only needed if you want to work with the probabilities)
doc_topic_dist = doc_topic_dist_unnormalized/doc_topic_dist_unnormalized.sum(axis=1)
To find the top ranking topic you can do something like:
doc_topic_dist.argmax(axis=1)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With