Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python - sklearn Latent Dirichlet Allocation Transform v. Fittransform

I am using sklearn's NMF and LDA sub-modules to analyze unlabeled text. I read the documentation but I am not sure if the transform functions in these modules (NMF and LDA) are the same as the posterior function in R's topicmodels (please see Predicting LDA topics for new data). Basically, I am looking for a function that will allow me to predict the topics in test set using the model trained on training set data. I predicted topics on the entire dataset. Then I split the data into train and test sets, trained a model on train set and transformed test set using that model. though it was expected that I would not get the same results, comparing the two runs topics is not assuring me that the transform function serves the same function as R's package. I would appreciate your response.

thank you

like image 510
valearner Avatar asked Nov 14 '16 20:11

valearner


People also ask

Can you use TF IDF with LDA?

As can be read in the paper Topic Models by Blei and Lafferty (e.g. p.6 - Visualizing Topics and p. 12), the tf-idf score can be very useful for LDA. It can be used to visualize topics or to chose the vocabulary. "It is often computationally expensive to use the entire vocabulary.

How does LDA work in Python?

Linear Discriminant Analysis, or LDA for short, is a classification machine learning algorithm. It works by calculating summary statistics for the input features by class label, such as the mean and standard deviation. These statistics represent the model learned from the training data.


1 Answers

The call to transform on a LatentDirichletAllocation model returns an unnormalized document topic distribution. To get proper probabilities, you can simply normalize the result. Here is an example:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import numpy as np

# grab a sample data set
dataset = fetch_20newsgroups(shuffle=True, remove=('headers', 'footers', 'quotes'))
train,test = dataset.data[:100], dataset.data[100:200]

# vectorizer the features
tf_vectorizer = TfidfVectorizer(max_features=25)
X_train = tf_vectorizer.fit_transform(train)

# train the model
lda = LatentDirichletAllocation(n_topics=5)
lda.fit(X_train)

# predict topics for test data
# unnormalized doc-topic distribution
X_test = tf_vectorizer.transform(test)
doc_topic_dist_unnormalized = np.matrix(lda.transform(X_test))

# normalize the distribution (only needed if you want to work with the probabilities)
doc_topic_dist = doc_topic_dist_unnormalized/doc_topic_dist_unnormalized.sum(axis=1)

To find the top ranking topic you can do something like:

doc_topic_dist.argmax(axis=1)
like image 109
Ryan Walker Avatar answered Dec 14 '22 01:12

Ryan Walker