I have a bunch of already human-classified documents in some groups. Is there a modified version of lda which I can use to train a model and then later classify unknown documents with it?

For what it's worth, LDA as a classifier is going to be fairly weak because it's a generative model, and classification is a discriminative problem. There is a variant of LDA called supervised LDA which uses a more discriminative criterion to form the topics (you can get source for this in various places), and there's also a paper with a max margin formulation that I don't know the status of source-code-wise. I would avoid the Labelled LDA formulation unless you're sure that's what you want, because it makes a strong assumption about the correspondence between topics and categories in the classification problem. However, it's worth pointing out that none of these methods use the topic model directly to do the classification. Instead, they take documents, and instead of using word-based features use the posterior over the topics (the vector that results from inference for the document) as its feature representation before feeding it to a classifier, usually a Linear SVM. This gets you a topic model based dimensionality reduction, followed by a strong discriminative classifier, which is probably what you're after. This pipeline is available in most languages using popular toolkits.

Supervised Latent Dirichlet Allocation for Document Classification?

2 Answers

You can implement supervised LDA with PyMC that uses Metropolis sampler to learn the latent variables in the following graphical model: sLDA graphical model

The training corpus consists of 10 movie reviews (5 positive and 5 negative) along with the associated star rating for each document. The star rating is known as a response variable which is a quantity of interest associated with each document. The documents and response variables are modeled jointly in order to find latent topics that will best predict the response variables for future unlabeled documents. For more information, check out the original paper. Consider the following code:

Click to copy

import pymc as pm
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer

train_corpus = ["exploitative and largely devoid of the depth or sophistication ",
                "simplistic silly and tedious",
                "it's so laddish and juvenile only teenage boys could possibly find it funny",
                "it shows that some studios firmly believe that people have lost the ability to think",
                "our culture is headed down the toilet with the ferocity of a frozen burrito",
                "offers that rare combination of entertainment and education",
                "the film provides some great insight",
                "this is a film well worth seeing",
                "a masterpiece four years in the making",
                "offers a breath of the fresh air of true sophistication"]
test_corpus =  ["this is a really positive review, great film"]
train_response = np.array([3, 1, 3, 2, 1, 5, 4, 4, 5, 5]) - 3

#LDA parameters
num_features = 1000  #vocabulary size
num_topics = 4       #fixed for LDA

tfidf = TfidfVectorizer(max_features = num_features, max_df=0.95, min_df=0, stop_words = 'english')

#generate tf-idf term-document matrix
A_tfidf_sp = tfidf.fit_transform(train_corpus)  #size D x V

print "number of docs: %d" %A_tfidf_sp.shape[0]
print "dictionary size: %d" %A_tfidf_sp.shape[1]

#tf-idf dictionary    
tfidf_dict = tfidf.get_feature_names()

K = num_topics # number of topics
V = A_tfidf_sp.shape[1] # number of words
D = A_tfidf_sp.shape[0] # number of documents

data = A_tfidf_sp.toarray()

#Supervised LDA Graphical Model
Wd = [len(doc) for doc in data]        
alpha = np.ones(K)
beta = np.ones(V)

theta = pm.Container([pm.CompletedDirichlet("theta_%s" % i, pm.Dirichlet("ptheta_%s" % i, theta=alpha)) for i in range(D)])
phi = pm.Container([pm.CompletedDirichlet("phi_%s" % k, pm.Dirichlet("pphi_%s" % k, theta=beta)) for k in range(K)])    

z = pm.Container([pm.Categorical('z_%s' % d, p = theta[d], size=Wd[d], value=np.random.randint(K, size=Wd[d])) for d in range(D)])

@pm.deterministic
def zbar(z=z):    
    zbar_list = []
    for i in range(len(z)):
        hist, bin_edges = np.histogram(z[i], bins=K)
        zbar_list.append(hist / float(np.sum(hist)))                
    return pm.Container(zbar_list)

eta = pm.Container([pm.Normal("eta_%s" % k, mu=0, tau=1.0/10**2) for k in range(K)])
y_tau = pm.Gamma("tau", alpha=0.1, beta=0.1)

@pm.deterministic
def y_mu(eta=eta, zbar=zbar):
    y_mu_list = []
    for i in range(len(zbar)):
        y_mu_list.append(np.dot(eta, zbar[i]))
    return pm.Container(y_mu_list)

#response likelihood
y = pm.Container([pm.Normal("y_%s" % d, mu=y_mu[d], tau=y_tau, value=train_response[d], observed=True) for d in range(D)])

# cannot use p=phi[z[d][i]] here since phi is an ordinary list while z[d][i] is stochastic
w = pm.Container([pm.Categorical("w_%i_%i" % (d,i), p = pm.Lambda('phi_z_%i_%i' % (d,i), lambda z=z[d][i], phi=phi: phi[z]),
                  value=data[d][i], observed=True) for d in range(D) for i in range(Wd[d])])

model = pm.Model([theta, phi, z, eta, y, w])
mcmc = pm.MCMC(model)
mcmc.sample(iter=1000, burn=100, thin=2)

#visualize topics    
phi0_samples = np.squeeze(mcmc.trace('phi_0')[:])
phi1_samples = np.squeeze(mcmc.trace('phi_1')[:])
phi2_samples = np.squeeze(mcmc.trace('phi_2')[:])
phi3_samples = np.squeeze(mcmc.trace('phi_3')[:])
ax = plt.subplot(221)
plt.bar(np.arange(V), phi0_samples[-1,:])
ax = plt.subplot(222)
plt.bar(np.arange(V), phi1_samples[-1,:])
ax = plt.subplot(223)
plt.bar(np.arange(V), phi2_samples[-1,:])
ax = plt.subplot(224)
plt.bar(np.arange(V), phi3_samples[-1,:])
plt.show()

Given the training data (observed words and response variables), we can learn the global topics (beta) and regression coefficients (eta) for predicting the response variable (Y) in addition to topic proportions for each document (theta). In order to make predictions of Y given the learned beta and eta, we can define a new model where we do not observe Y and use the previously learned beta and eta to obtain the following result:

sLDA prediction

Here we predicted a positive review (approx 2 given review rating range of -2 to 2) for the test corpus consisting of one sentence: "this is a really positive review, great film" as shown by the mode of the posterior histogram on the right. See ipython notebook for a complete implementation.

184

answered Nov 10 '22 06:11

Vadim Smolyakov

For what it's worth, LDA as a classifier is going to be fairly weak because it's a generative model, and classification is a discriminative problem. There is a variant of LDA called supervised LDA which uses a more discriminative criterion to form the topics (you can get source for this in various places), and there's also a paper with a max margin formulation that I don't know the status of source-code-wise. I would avoid the Labelled LDA formulation unless you're sure that's what you want, because it makes a strong assumption about the correspondence between topics and categories in the classification problem.

However, it's worth pointing out that none of these methods use the topic model directly to do the classification. Instead, they take documents, and instead of using word-based features use the posterior over the topics (the vector that results from inference for the document) as its feature representation before feeding it to a classifier, usually a Linear SVM. This gets you a topic model based dimensionality reduction, followed by a strong discriminative classifier, which is probably what you're after. This pipeline is available in most languages using popular toolkits.

answered Nov 10 '22 06:11

Ben Allison

Related questions
                            
                                Tensorflow feature column for variable list of values
                            
                                Combining Rolling Origin Forecast Resampling and Group V-Fold Cross-Validation in rsample
                            
                                LSTM Followed by Mean Pooling
                            
                                EM score in SQuAD Challenge
                            
                                Pytorch ValueError: optimizer got an empty parameter list
                            
                                What algorithms are suitable for this simple machine learning problem?
                            
                                SVM in Matlab: Meaning of Parameter 'box constraint' in function fitcsvm
                            
                                Intuition for perceptron weight update rule
                            
                                Which feature scaling method to use before PCA?
                            
                                Pooling vs Pooling-over-time
                            
                                What are the uses of TimeDistributed wrapper for LSTM or any other layers
                            
                                Help--100% accuracy with LibSVM?
                            
                                Multiple Output Neural Network
                            
                                MFCC feature descriptors for audio classification using librosa
                            
                                Add trend line to pandas
                            
                                Feature importances - Bagging, scikit-learn
                            
                                Plot k-Nearest-Neighbor graph with 8 features?
                            
                                WEKA Tutorials / Examples for a Newbie [closed]
                            
                                Use Azure Machine learning to detect symbol within an image
                            
                                How to avoid overfitting on a simple feed forward network

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Supervised Latent Dirichlet Allocation for Document Classification?

Tags:

machine-learning

classification

nlp

lda

document-classification

snøreven

People also ask

2 Answers

Vadim Smolyakov

Ben Allison

Recent Activity

Donate For Us