Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Understanding LDA in Spark

I am running Latent Dirichlet Allocation in Spark(LDA). And am trying to understand the output it gives out.

Here is my sample dataset after I carried out the text-feature transform using Tokenizer, StopwordsRemover, CountVectorizer

[Row(Id=u'39', tf_features=SparseVector(1184, {89: 1.0, 98: 2.0, 108: 1.0, 168: 3.0, 210: 1.0, 231: 1.0, 255: 1.0, 290: 1.0, 339: 1.0, 430: 1.0, 552: 1.0, 817: 1.0, 832: 1.0, 836: 1.0, 937: 1.0, 999: 1.0, 1157: 1.0})),
 Row(Id=u'7666', tf_features=SparseVector(1184, {15: 2.0, 186: 2.0, 387: 2.0, 429: 2.0, 498: 2.0}))]

AS per Spark's Sparse Vector Representation tf_features stand for: (Vocab_zise,{term_id:term_freq...}

Now I ran the below initial code:

from pyspark.ml.clustering import LDA
lda = LDA(featuresCol="tf_features",k=10, seed=1, optimizer="online")
ldaModel=lda.fit(tf_df)

lda_df=ldaModel.transform(tf_df)

First I inspect the resulting transformed data frame.

lda_df.take(3)
Out[73]:
[Row(Id=u'39', tf_features=SparseVector(1184, {89: 1.0, 98: 2.0, 108: 1.0, 168: 3.0, 210: 1.0, 231: 1.0, 255: 1.0, 290: 1.0, 339: 1.0, 430: 1.0, 552: 1.0, 817: 1.0, 832: 1.0, 836: 1.0, 937: 1.0, 999: 1.0, 1157: 1.0}), topicDistribution=DenseVector([0.0049, 0.0045, 0.0041, 0.0048, 0.9612, 0.004, 0.004, 0.0041, 0.0041, 0.0042])),
 Row(Id=u'7666', tf_features=SparseVector(1184, {15: 2.0, 186: 2.0, 387: 2.0, 429: 2.0, 498: 2.0}), topicDistribution=DenseVector([0.0094, 0.1973, 0.0079, 0.0092, 0.0082, 0.0077, 0.7365, 0.0078, 0.0079, 0.008])),
 Row(Id=u'44', tf_features=SparseVector(1184, {2: 1.0, 9: 1.0, 122: 1.0, 444: 1.0, 520: 1.0, 748: 1.0}), topicDistribution=DenseVector([0.0149, 0.8831, 0.0124, 0.0146, 0.013, 0.0122, 0.0122, 0.0124, 0.0125, 0.0127]))]

My understanding again is that topicDistribution column represents the weights of each topic in that row documents. So it's basically is topics distribution over a documents. Makes sense.

Now I inspect the two methods for LdaModel.

ldaModel.describeTopics().show(2,truncate=False)
+-----+---------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|topic|termIndices                            |termWeights                                                                                                                                                                                                               |
+-----+---------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0    |[0, 39, 68, 43, 50, 59, 49, 84, 2, 116]|[0.06362107696025378, 0.012284342954240298, 0.012104887652365797, 0.01066583226047289, 0.01022196994114675, 0.008836060842769776, 0.007638318779273158, 0.006478523079841644, 0.006421040016045976, 0.0057849412030562125]|
|1    |[3, 1, 8, 6, 4, 11, 14, 7, 9, 2]       |[0.03164821806301453, 0.031039573066565747, 0.018856890552836778, 0.017520190459705844, 0.017243870770548828, 0.01717645631844006, 0.017147930104624565, 0.01706912474813669, 0.016946362395557312, 0.016722361546119266] |
+-----+---------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 2 rows

This seems to show the distribution of words or terms in each topics by their term id. Shows ten terms (can be changed in the method as parameter). Again makes sense.

Second method is below:

In [82]:

ldaModel.topicsMatrix()
Out[82]:
DenseMatrix(1184, 10, [132.7645, 3.0036, 13.3994, 3.6061, 9.3199, 2.4725, 9.3927, 3.4243, ..., 0.5774, 0.8335, 0.49, 0.6366, 0.546, 0.8509, 0.5081, 0.6627], 0)

Now as per docs, it says topicsMatrix is a matrix of Topics and it's terms where topics are columns and terms in that topics are rows. size would be vocab_size X k(no_of_topics).

I don't seem to see that here and not sure what this output mean?.

Secondly, how do I associate these term id back to actual word names. In the end I want a list of topics (as columns or rows whatever) with the top 10-15 words/terms in that so that I can interpret the topics after seeing the kind of words present there. Here I just have some ids and no word names.

Any idea on these two?

Edit II:

When I just do topics[0][1] I get an error as mentioned in comment below.

So I convert it to numpy array like below:

topics.toArray()

Looks like below

array([[ 132.76450545,    2.26966742,    0.73646762,    7.35362275,
           0.57789645,    0.58248036,    0.65876465,    0.6695292 ,
           0.70034004,    0.63875301],
       [   3.00362754,   68.80842798,    0.48662529,  100.31770907,
           0.57867623,    0.5357196 ,    0.58895636,    0.83408602,
           0.53400242,    0.56291545],
       [  13.39943055,   37.070078

This is a 1184 X 10 array so I am assuming it is a matrix of topics with distribution of words.

If that is the case then the distribution should be probabilities but here we see numbers more than 1 like 132.76 etc. What is this then?

like image 348
Baktaawar Avatar asked Feb 15 '17 20:02

Baktaawar


1 Answers

The method topicsMatrix() returns a DenseMatrix object.

What you're seeing as output is a representation of such object. The "attributes" of these objects are:

numRows, numCols, values, isTransposed=False

so, from the output you got you can identify these attributes as:

  • numRows: the vocabulary size (1184 in your case).
  • numCols: the amount of topics (10 in your case).
  • values: the array with elements. These elements are represented as a plain vector inside DenseMatrix.
  • isTransposed: whether the matrix is transposed (0, i.e. False in your case).

So, the important thing here is how to get a proper representation of that DenseMatrix.

From the guide of pyspark I found a example that will be useful to you:

topics = ldaModel.topicsMatrix()
for topic in range(10):
    print("Topic " + str(topic) + ":")
    for word in range(0, ldaModel.vocabSize()):
        print(" " + str(topics[word][topic]))

According to the documentation of DenseMatrix, you could try with these two methods if you want to get a more useful representation:

  • asML()
  • toArray() to return an numpy.ndarray
  • toSparse()
like image 59
Nicolás Ozimica Avatar answered Sep 29 '22 07:09

Nicolás Ozimica