Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to perform k-means clustering in mahout with vector data stored as CSV?

Tags:

k-means

mahout

I have a file containing vectors of data, where each row contains a comma-separated list of values. I am wondering how to perform k-means clustering on this data using mahout. The example provided in the wiki mentions creating sequenceFiles, but otherwise I am not sure if I need to do some type of conversion in order to obtain these sequenceFiles.

like image 466
Dan Q Avatar asked Jan 09 '12 08:01

Dan Q


People also ask

What kind of pre processing can we do to improve the performance of K-means?

• K-means clustering algorithm can be significantly improved by using a better initialization technique, and by repeating (re-starting) the algorithm.


1 Answers

I would recommend manually reading in the entries from the CSV file, creating NamedVectors from them, and then using a sequence file writer to write the vectors in a sequence file. From there on, the KMeansDriver run method should know how to handle these files.

Sequence files encode key-value pairs, so the key would be an ID of the sample (it should be a string), and the value is a VectorWritable wrapper around the vectors.

Here is a simple code sample on how to do this:

    List<NamedVector> vector = new LinkedList<NamedVector>();
    NamedVector v1;
    v1 = new NamedVector(new DenseVector(new double[] {0.1, 0.2, 0.5}), "Item number one");
    vector.add(v1);

    Configuration config = new Configuration();
    FileSystem fs = FileSystem.get(config);

    Path path = new Path("datasamples/data");

    //write a SequenceFile form a Vector
    SequenceFile.Writer writer = new SequenceFile.Writer(fs, config, path, Text.class, VectorWritable.class);
    VectorWritable vec = new VectorWritable();
    for(NamedVector v:vector){
        vec.set(v);
        writer.append(new Text(v.getName()), v);
    }
    writer.close();

Also, I would recommend reading chapter 8 of Mahout in Action. It gives more details on data representation in Mahout.

like image 100
Bojana Popovska Avatar answered Nov 15 '22 15:11

Bojana Popovska