Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

mahout lucene document clustering howto?

I'm reading that i can create mahout vectors from a lucene index that can be used to apply the mahout clustering algorithms. http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text

I would like to apply K-means clustering algorithm in the documents in my Lucene index, but it is not clear how can i apply this algorithm (or hierarchical clustering) to extract meaningful clusters with these documents.

In this page http://cwiki.apache.org/confluence/display/MAHOUT/k-Means says that the algorithm accepts two input directories: one for the data points and one for the initial clusters. My data points are the documents? How can i "declare" that these are my documents (or their vectors) , simply take them and do the clustering?

sorry in advance for my poor grammar

Thank you

like image 462
maiky Avatar asked Dec 04 '09 10:12

maiky


2 Answers

If you have vectors, you can run KMeansDriver. Here is the help for the same.

Usage:
 [--input <input> --clusters <clusters> --output <output> --distance <distance>
--convergence <convergence> --max <max> --numReduce <numReduce> --k <k>
--vectorClass <vectorClass> --overwrite --help]
Options
  --input (-i) input                The Path for input Vectors. Must be a
                                    SequenceFile of Writable, Vector
  --clusters (-c) clusters          The input centroids, as Vectors.  Must be a
                                    SequenceFile of Writable, Cluster/Canopy.
                                    If k is also specified, then a random set
                                    of vectors will be selected and written out
                                    to this path first
  --output (-o) output              The Path to put the output in
  --distance (-m) distance          The Distance Measure to use.  Default is
                                    SquaredEuclidean
  --convergence (-d) convergence    The threshold below which the clusters are
                                    considered to be converged.  Default is 0.5
  --max (-x) max                    The maximum number of iterations to
                                    perform.  Default is 20
  --numReduce (-r) numReduce        The number of reduce tasks
  --k (-k) k                        The k in k-Means.  If specified, then a
                                    random selection of k Vectors will be
                                    chosen as the Centroid and written to the
                                    clusters output path.
  --vectorClass (-v) vectorClass    The Vector implementation class name.
                                    Default is SparseVector.class
  --overwrite (-w)                  If set, overwrite the output directory
  --help (-h)                       Print out help

Update: Get the result directory from HDFS to local fs. Then use ClusterDumper utility to get the cluster and list of documents in that cluster.

like image 91
Shashikant Kore Avatar answered Nov 07 '22 03:11

Shashikant Kore


A pretty good howto is here: integrating apache mahout with apache lucene

like image 40
Chris J Avatar answered Nov 07 '22 03:11

Chris J