Text Clustering and topic extraction

Question

I'm doing some text mining using the excellent scikit-learn module. I'm trying to cluster and classify scientific abstracts.

I'm looking for a way to cluster my set of tf-id representations, without having to specify the number of clusters in advance. I haven't been able to find a good algorithm that can do that, and still handle large sparse matrixes decently. I have been looking into simply using scikit-learn's kmeans, but it doesn't have a way to determine the optimal number of clusters (for example using BIC). I have also tried using the gaussian mixture models (using the best BIC score to select the model), but they are awfully slow.

After I have clustered the documents, I would like to be able to look into the topics of each cluster, meaning the words they tend to use. Is there a way to extract this information, given the data matrix and cluster-labels? Maybe taking the mean of the cluster and inverse transforming it using the tf-id-vectorizer? I've previously tried to use chi-square and randomforest to rank feature importance, but that doesn't say which label-class uses what.

I've tried using the NMF decomposition method (using simply the example code from scikit-learns website) to do topic detection. It worked great, and produced very meaningful topics very quickly. However, i did not find a way of using it to assign each datapoint to a cluster, nor automatically determine the 'optimal' number of clusters. But it's the sort of thing i'm looking for.

I also read somewhere that it's possible to extract topic information directly from a fitted LDA model, but i don't understand how it's done. Since I already have implemented an LDA as a baseline classifier and visualisation tool, this might be an easy solution.

If I manage to produce meaningful cluster/topics, I am going to compare them to some human made labels (not topic based), to see how they correspond. But that's a topic for another thread :-)

ogrisel · Accepted Answer

You can try TF-IDF with a low max_df, e.g. max_df=0.5 and then k-means (or MiniBatchKMeans). To find a good value for K you can try one of those heuristics:

the gap statistic
the prediction strength

Executive descriptions are provided in this blog post: http://blog.echen.me/2011/03/19/counting-clusters/

None of those method are implemented in sklearn. I would be very interested if you find any of them useful for your problem. If so it would probably be interesting to discuss how to best contribute a default implementation in scikit-learn.

Phani · Answer

There are two ways to go about this:

Clustering approach: Use the transformed feature set given out by NMF as input for a clustering algorithm. For example, if you use k-means algorithm, you can set k to the number of topics (i.e. new features/components) that you have. I think this paper talks about something like that.
Tagging approach: This is the approach I have used recently. This allows you tag posts with one or more topics. Use the transform() function of the NMF model object to get a n * n_topics matrix. Then, set a threshold for each topic. For my example, "0.02" worked well for me. Assign a topic to a document if that respective value is greater than that threshold. Note that, this will mean that while some documents have more than one topic assigned to them, some documents will not have any topics assigned to them. But, I found that this approach gave very meaningful and interesting results.

Text Clustering and topic extraction

Tags:

python-2.7

text-mining

scikit-learn

topic-modeling

Misconstruction

2 Answers

ogrisel

Phani

Recent Activity

Donate For Us

Text Clustering and topic extraction

Tags:

python-2.7

text-mining

scikit-learn

topic-modeling

Misconstruction

2 Answers

ogrisel

Phani

Related questions

Recent Activity

Donate For Us