python - sklearn Latent Dirichlet Allocation Transform v. Fittransform

Tags:

I am using sklearn's NMF and LDA sub-modules to analyze unlabeled text. I read the documentation but I am not sure if the transform functions in these modules (NMF and LDA) are the same as the posterior function in R's topicmodels (please see Predicting LDA topics for new data). Basically, I am looking for a function that will allow me to predict the topics in test set using the model trained on training set data. I predicted topics on the entire dataset. Then I split the data into train and test sets, trained a model on train set and transformed test set using that model. though it was expected that I would not get the same results, comparing the two runs topics is not assuring me that the transform function serves the same function as R's package. I would appreciate your response.

thank you

510

asked Nov 14 '16 20:11

valearner

1 Answers

The call to transform on a LatentDirichletAllocation model returns an unnormalized document topic distribution. To get proper probabilities, you can simply normalize the result. Here is an example:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.decomposition import LatentDirichletAllocation
from sklearn.datasets import fetch_20newsgroups
import numpy as np

# grab a sample data set
dataset = fetch_20newsgroups(shuffle=True, remove=('headers', 'footers', 'quotes'))
train,test = dataset.data[:100], dataset.data[100:200]

# vectorizer the features
tf_vectorizer = TfidfVectorizer(max_features=25)
X_train = tf_vectorizer.fit_transform(train)

# train the model
lda = LatentDirichletAllocation(n_topics=5)
lda.fit(X_train)

# predict topics for test data
# unnormalized doc-topic distribution
X_test = tf_vectorizer.transform(test)
doc_topic_dist_unnormalized = np.matrix(lda.transform(X_test))

# normalize the distribution (only needed if you want to work with the probabilities)
doc_topic_dist = doc_topic_dist_unnormalized/doc_topic_dist_unnormalized.sum(axis=1)

To find the top ranking topic you can do something like:

doc_topic_dist.argmax(axis=1)

109

answered Dec 14 '22 01:12

Ryan Walker

Related questions
                            
                                Fastest way to extract dictionary of sums in numpy in 1 I/O pass
                            
                                Save a file generated by app running on docker to a given path in the host machine
                            
                                Datetime from year and week number
                            
                                How to import a function from python file by Boost.Python
                            
                                Python fluent filter, map, etc
                            
                                Most efficient way to turn dictionary into symmetric/distance matrix in Pandas
                            
                                Regex Matching - A letter not preceded by another letter
                            
                                Why is this Haskell code so slow?
                            
                                How to remap ids to consecutive numbers quickly
                            
                                Vim double-indents python files
                            
                                'None' is not displayed as I expected in Python interactive mode
                            
                                What is the equivalent of Matlab's imadjust in python?
                            
                                How to calculate count and percentage in groupby in Python
                            
                                ServerSelectionTimeoutError when connecting to aws with pymongo
                            
                                Pandas: query string where column name contains special characters
                            
                                Conditionally calculated column for a Pandas DataFrame
                            
                                How can I change the (locale) thousands separator in Python to Arabic Unicode separator?
                            
                                python use Pyyaml and keep format
                            
                                Python pandas select rows by list of dates
                            
                                Vertical alignment of matplotlib legend labels with LaTeX math

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

python - sklearn Latent Dirichlet Allocation Transform v. Fittransform

Tags:

python

scikit-learn

valearner

People also ask

1 Answers

Ryan Walker

Recent Activity

Donate For Us