How to perform k-means clustering in mahout with vector data stored as CSV?

Tags:

I have a file containing vectors of data, where each row contains a comma-separated list of values. I am wondering how to perform k-means clustering on this data using mahout. The example provided in the wiki mentions creating sequenceFiles, but otherwise I am not sure if I need to do some type of conversion in order to obtain these sequenceFiles.

466

asked Jan 09 '12 08:01

Dan Q

1 Answers

I would recommend manually reading in the entries from the CSV file, creating NamedVectors from them, and then using a sequence file writer to write the vectors in a sequence file. From there on, the KMeansDriver run method should know how to handle these files.

Sequence files encode key-value pairs, so the key would be an ID of the sample (it should be a string), and the value is a VectorWritable wrapper around the vectors.

Here is a simple code sample on how to do this:

    List<NamedVector> vector = new LinkedList<NamedVector>();
    NamedVector v1;
    v1 = new NamedVector(new DenseVector(new double[] {0.1, 0.2, 0.5}), "Item number one");
    vector.add(v1);

    Configuration config = new Configuration();
    FileSystem fs = FileSystem.get(config);

    Path path = new Path("datasamples/data");

    //write a SequenceFile form a Vector
    SequenceFile.Writer writer = new SequenceFile.Writer(fs, config, path, Text.class, VectorWritable.class);
    VectorWritable vec = new VectorWritable();
    for(NamedVector v:vector){
        vec.set(v);
        writer.append(new Text(v.getName()), v);
    }
    writer.close();

Also, I would recommend reading chapter 8 of Mahout in Action. It gives more details on data representation in Mahout.

100

answered Nov 15 '22 15:11

Bojana Popovska

Related questions
                            
                                How to install mahout using ambari server
                            
                                Interpreting output from mahout clusterdumper
                            
                                How to use Mahout in a Windows environment?
                            
                                Using Apache Mahout with Ruby on Rails
                            
                                Most effective similarity measure for list-ranked items
                            
                                Just how much Java does one need to use Hadoop and Mahout effectively?
                            
                                Mahout: CSV to vector and running the program
                            
                                Using the Apache Mahout machine learning libraries [closed]
                            
                                Web page recommender system
                            
                                Run cvb in mahout 0.8
                            
                                How to find whether a url is of ecommerce or non ecommerce website, programatically?
                            
                                How to classify images using Apache Mahout?
                            
                                is there any seqFileDir option for "clusterdump" in the latest "apache mahout" library?
                            
                                Recommendations using R with SimpleDB or BigQuery or using PHP with SimpleDB
                            
                                Choice of Machine Learning Platform [closed]
                            
                                Mahout runs out of heap space
                            
                                Mahout for sentiment analysis
                            
                                How to directly send the output of a mapper-reducer to a another mapper-reducer without saving the output into the hdfs
                            
                                Hadoop 2.2.0 is compatible with Mahout 0.8?
                            
                                Mahout : To read a custom input file

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to perform k-means clustering in mahout with vector data stored as CSV?

Tags:

k-means

mahout

Dan Q

People also ask

1 Answers

Bojana Popovska

Recent Activity

Donate For Us