The current Mahout 0.8-SNAPSHOT includes a Collapsed Variational Bayes (cvb) version for Topic Modeling and removed the Latent Dirichlet Analysis (lda) approach, because cvb can be parallelized way better. Unfortunately there is only documentation for lda on how to run an example and generate meaningful output.
Thus, I want to:
So here are the subsequent Mahout commands I had to call in a linux shell to do it. $MAHOUT_HOME points to my mahout/bin folder.
$MAHOUT_HOME/mahout seqdirectory \
-i path/to/directory/with/texts \
-o out/sequenced
$MAHOUT_HOME/mahout seq2sparse -i out/sequenced \
-o out/sparseVectors \
--namedVector \
-wt tf
$MAHOUT_HOME/mahout rowid \
-i out/sparseVectors/tf-vectors/ \
-o out/matrix
$MAHOUT_HOME/mahout cvb0_local \
-i out/matrix/matrix \
-d out/sparseVectors/dictionary.file-0 \
-a 0.5 \
-top 4 -do out/cvb/do_out \
-to out/cvb/to_out
Inspect the output by showing the top 10 words of each topic:
$MAHOUT_HOME/mahout vectordump \
-i out/cvb/to_out \
--dictionary out/sparseVectors/dictionary.file-0 \
--dictionaryType sequencefile \
--vectorSize 10 \
-sort out/cvb/to_out
Thanks to JoKnopp for the detail commands.
If you get: Exception in thread "main" java.lang.ClassCastException: java.lang.Integer cannot be cast to java.lang.String
you need to add the command line option "maxIterations": --maxIterations (-m) maxIterations
I use -m 20 and it works
refer to: https://issues.apache.org/jira/browse/MAHOUT-1141
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With