I am running Latent Dirichlet Allocation in Spark(LDA). And am trying to understand the output it gives out.
Here is my sample dataset after I carried out the text-feature transform using Tokenizer, StopwordsRemover, CountVectorizer
[Row(Id=u'39', tf_features=SparseVector(1184, {89: 1.0, 98: 2.0, 108: 1.0, 168: 3.0, 210: 1.0, 231: 1.0, 255: 1.0, 290: 1.0, 339: 1.0, 430: 1.0, 552: 1.0, 817: 1.0, 832: 1.0, 836: 1.0, 937: 1.0, 999: 1.0, 1157: 1.0})),
Row(Id=u'7666', tf_features=SparseVector(1184, {15: 2.0, 186: 2.0, 387: 2.0, 429: 2.0, 498: 2.0}))]
AS per Spark's Sparse Vector Representation tf_features stand for: (Vocab_zise,{term_id:term_freq...}
Now I ran the below initial code:
from pyspark.ml.clustering import LDA
lda = LDA(featuresCol="tf_features",k=10, seed=1, optimizer="online")
ldaModel=lda.fit(tf_df)
lda_df=ldaModel.transform(tf_df)
First I inspect the resulting transformed data frame.
lda_df.take(3)
Out[73]:
[Row(Id=u'39', tf_features=SparseVector(1184, {89: 1.0, 98: 2.0, 108: 1.0, 168: 3.0, 210: 1.0, 231: 1.0, 255: 1.0, 290: 1.0, 339: 1.0, 430: 1.0, 552: 1.0, 817: 1.0, 832: 1.0, 836: 1.0, 937: 1.0, 999: 1.0, 1157: 1.0}), topicDistribution=DenseVector([0.0049, 0.0045, 0.0041, 0.0048, 0.9612, 0.004, 0.004, 0.0041, 0.0041, 0.0042])),
Row(Id=u'7666', tf_features=SparseVector(1184, {15: 2.0, 186: 2.0, 387: 2.0, 429: 2.0, 498: 2.0}), topicDistribution=DenseVector([0.0094, 0.1973, 0.0079, 0.0092, 0.0082, 0.0077, 0.7365, 0.0078, 0.0079, 0.008])),
Row(Id=u'44', tf_features=SparseVector(1184, {2: 1.0, 9: 1.0, 122: 1.0, 444: 1.0, 520: 1.0, 748: 1.0}), topicDistribution=DenseVector([0.0149, 0.8831, 0.0124, 0.0146, 0.013, 0.0122, 0.0122, 0.0124, 0.0125, 0.0127]))]
My understanding again is that topicDistribution column represents the weights of each topic in that row documents. So it's basically is topics distribution over a documents. Makes sense.
Now I inspect the two methods for LdaModel.
ldaModel.describeTopics().show(2,truncate=False)
+-----+---------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|topic|termIndices |termWeights |
+-----+---------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0 |[0, 39, 68, 43, 50, 59, 49, 84, 2, 116]|[0.06362107696025378, 0.012284342954240298, 0.012104887652365797, 0.01066583226047289, 0.01022196994114675, 0.008836060842769776, 0.007638318779273158, 0.006478523079841644, 0.006421040016045976, 0.0057849412030562125]|
|1 |[3, 1, 8, 6, 4, 11, 14, 7, 9, 2] |[0.03164821806301453, 0.031039573066565747, 0.018856890552836778, 0.017520190459705844, 0.017243870770548828, 0.01717645631844006, 0.017147930104624565, 0.01706912474813669, 0.016946362395557312, 0.016722361546119266] |
+-----+---------------------------------------+--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
only showing top 2 rows
This seems to show the distribution of words or terms in each topics by their term id. Shows ten terms (can be changed in the method as parameter). Again makes sense.
Second method is below:
In [82]:
ldaModel.topicsMatrix()
Out[82]:
DenseMatrix(1184, 10, [132.7645, 3.0036, 13.3994, 3.6061, 9.3199, 2.4725, 9.3927, 3.4243, ..., 0.5774, 0.8335, 0.49, 0.6366, 0.546, 0.8509, 0.5081, 0.6627], 0)
Now as per docs, it says topicsMatrix is a matrix of Topics and it's terms where topics are columns and terms in that topics are rows. size would be vocab_size X k(no_of_topics).
I don't seem to see that here and not sure what this output mean?.
Secondly, how do I associate these term id back to actual word names. In the end I want a list of topics (as columns or rows whatever) with the top 10-15 words/terms in that so that I can interpret the topics after seeing the kind of words present there. Here I just have some ids and no word names.
Any idea on these two?
Edit II:
When I just do topics[0][1] I get an error as mentioned in comment below.
So I convert it to numpy array like below:
topics.toArray()
Looks like below
array([[ 132.76450545, 2.26966742, 0.73646762, 7.35362275,
0.57789645, 0.58248036, 0.65876465, 0.6695292 ,
0.70034004, 0.63875301],
[ 3.00362754, 68.80842798, 0.48662529, 100.31770907,
0.57867623, 0.5357196 , 0.58895636, 0.83408602,
0.53400242, 0.56291545],
[ 13.39943055, 37.070078
This is a 1184 X 10 array so I am assuming it is a matrix of topics with distribution of words.
If that is the case then the distribution should be probabilities but here we see numbers more than 1 like 132.76 etc. What is this then?
The method topicsMatrix()
returns a DenseMatrix
object.
What you're seeing as output is a representation of such object. The "attributes" of these objects are:
numRows, numCols, values, isTransposed=False
so, from the output you got you can identify these attributes as:
numRows
: the vocabulary size (1184 in your case).numCols
: the amount of topics (10 in your case).values
: the array with elements. These elements are represented as a plain vector inside DenseMatrix
.isTransposed
: whether the matrix is transposed (0, i.e. False in your case).So, the important thing here is how to get a proper representation of that DenseMatrix
.
From the guide of pyspark I found a example that will be useful to you:
topics = ldaModel.topicsMatrix()
for topic in range(10):
print("Topic " + str(topic) + ":")
for word in range(0, ldaModel.vocabSize()):
print(" " + str(topics[word][topic]))
According to the documentation of DenseMatrix, you could try with these two methods if you want to get a more useful representation:
asML()
toArray()
to return an numpy.ndarray
toSparse()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With