Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn LatentDirichletAllocation topic inference on new corpus

I have been using the sklearn.decomposition.LatentDirichletAllocation module to explore a corpus of documents. After a number of iterations of training and adjusting the model (i.e. adding stopwords and synonyms, varying the number of topics), I am fairly happy and familiar with the distilled topics. As a next step I would like to apply the trained model to a new corpus.

Is it possible to apply the fitted model to a new set of documents to determine the topic distributions.

I know this is possible within the gensim library, where you can train a model:

from gensim.test.utils import common_texts
from gensim.corpora.dictionary import Dictionary

# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

lda = LdaModel(common_corpus, num_topics=10)

And subsequently apply the trained model to a new corpus:

Topic_distribtutions = lda[unseen_doc]

from: https://radimrehurek.com/gensim/models/ldamodel.html

How does one do this using the scikit-learn application of LDA?

like image 500
J. Veenkamp Avatar asked Oct 16 '25 10:10

J. Veenkamp


1 Answers

Doesn't transform do that?

>>> from sklearn.feature_extraction.text import CountVectorizer
>>> from sklearn.decomposition import LatentDirichletAllocation
>>> from sklearn.datasets import fetch_20newsgroups
>>> 
>>> n_samples = 2000
>>> n_features = 1000
>>> n_components = 10
>>> 
>>> dataset = fetch_20newsgroups(shuffle=True, random_state=1,
...                              remove=('headers', 'footers', 'quotes'))
>>> data_samples = dataset.data[:n_samples]
>>> 
>>> tf_vectorizer = CountVectorizer(max_df=0.95, min_df=2,
...                                 max_features=n_features,
...                                 stop_words='english')
>>> tf = tf_vectorizer.fit_transform(data_samples)
>>> 
>>> lda = LatentDirichletAllocation(n_components=n_components, max_iter=5,
...                                 learning_method='online',
...                                 learning_offset=50.,
...                                 random_state=0)
>>> 
>>> lda.fit(tf)
LatentDirichletAllocation(batch_size=128, doc_topic_prior=None,
             evaluate_every=-1, learning_decay=0.7,
             learning_method='online', learning_offset=50.0,
             max_doc_update_iter=100, max_iter=5, mean_change_tol=0.001,
             n_components=10, n_jobs=1, n_topics=None, perp_tol=0.1,
             random_state=0, topic_word_prior=None,
             total_samples=1000000.0, verbose=0)
>>> 
>>> print(lda.transform(tf_vectorizer.transform(dataset.data[-3:])))
[[0.0142868  0.63695359 0.01428674 0.01428686 0.01428606 0.01429304
  0.014286   0.24874298 0.01429136 0.01428656]
 [0.01111385 0.45234109 0.01111409 0.45875254 0.01111215 0.01111384
  0.01111214 0.01111282 0.01111441 0.01111307]
 [0.001786   0.68840635 0.00178639 0.00178615 0.00178625 0.00178627
  0.00178587 0.00178627 0.29730378 0.00178667]]
like image 58
adrin Avatar answered Oct 18 '25 23:10

adrin