Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Visualizing topics with Spark LDA

I'm using pySpark ML LDA library to fit a topic model on the 20 newsgroups dataset from sklearn. I'm doing the standard tokenization, stop-word removal and tf-idf transformations on the training corpus. In the end, I can get the topics and print out word indices and their weights:

topics = model.describeTopics()
topics.show()
+-----+--------------------+--------------------+
|topic|         termIndices|         termWeights|
+-----+--------------------+--------------------+
|    0|[5456, 6894, 7878...|[0.03716766297248...|
|    1|[5179, 3810, 1545...|[0.12236370744240...|
|    2|[5653, 4248, 3655...|[1.90742686393836...|
...

However, how do I map from term indices to actual words to visualize the topics? I'm using a HashingTF applied to a tokenized list of strings to derive the term indices. How do I generate a dictionary (map from indices to words) for visualizing topics?

like image 956
Vadim Smolyakov Avatar asked May 29 '17 02:05

Vadim Smolyakov


2 Answers

An alternative to a HashingTF is a CountVectorizer that generates a vocabulary:

count_vec = CountVectorizer(inputCol="tokens_filtered", outputCol="tf_features", vocabSize=num_features, minDF=2.0)
count_vec_model = count_vec.fit(newsgroups)  
newsgroups = count_vec_model.transform(newsgroups)
vocab = count_vec_model.vocabulary

Given a vocabulary as a list of words we can index into it to visualize topics:

topics = model.describeTopics()   
topics_rdd = topics.rdd

topics_words = topics_rdd\
       .map(lambda row: row['termIndices'])\
       .map(lambda idx_list: [vocab[idx] for idx in idx_list])\
       .collect()

for idx, topic in enumerate(topics_words):
    print "topic: ", idx
    print "----------"
    for word in topic:
       print word
    print "----------"
like image 105
Vadim Smolyakov Avatar answered Sep 22 '22 17:09

Vadim Smolyakov


HashingTF is irreversible, that is, from your output index for a word you cannot get the input word. Multiple words migth map to the same output index. You can use CountVectorizer, that is a similar but reversible process.

like image 28
Hernan C. Vazquez Avatar answered Sep 22 '22 17:09

Hernan C. Vazquez