Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate the nearest neighbors using weka from the command line?

I have a csv file, where each row is a vector of numbers representing a data point. I want to use weka from the command line to calculate the nearest neighbor of each data point in the csv file. I know how to do k nearest neighbor classification from the command line, but that's not what I want. I want the actual neighbors. How do I do this?

I want to do this using weka and not some other tool.

like image 238
Mike Izbicki Avatar asked Jul 10 '15 21:07

Mike Izbicki


1 Answers

Weka doesn't have a one liner to do what I think you are suggesting (ingest a file, convert it to instances, and then find all the N nearest neighbors of each instance)

but you can set up a command line style one liner by leveraging Weka and a couple of lines of Java in the following way:

Compile the following code. I used Eclipse, but you can just as easily use javac at the command line - just make sure that you have weka.jar in your classpath. I show you an example of how to call this as a one liner from the cammand line after the code below

import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.core.neighboursearch.LinearNNSearch;

public class WekaCLFindNN {
     public static void main(String[] args) throws Exception {

            //report that the code is running
            System.out.println("Weka Command Line Find Nearest " + args[0] + " Neighbors for each Instance in "  + args[1]); // Display the string.

            //setup datasources, grab instances, and calculate the nearest neighbors
            DataSource source = new DataSource(""+args[1]);
            Instances instances = source.getDataSet();  
            weka.core.neighboursearch.LinearNNSearch knn = new LinearNNSearch(instances);

            //cycle through the dataset and get instances for the nearestneighbors
            for(int j=0;j<instances.numInstances();j++){
            Instances nearestInstances= knn.kNearestNeighbours(instances.instance(j), Integer.parseInt(args[0]));

            //cycle through the instances and printout the nearestneighbors
            System.out.println("\n\n" + instances.instance(j));
            for(int i =0;i<Integer.parseInt(args[0]);i++) 
            {
                System.out.println("\n\t" + nearestInstances.instance(i));

            }

            }

            //close the code
            System.out.println("\n"+"Nearest Neighbors found"); // Display the string.

     }
}

Now just run it from the command line using the following command.

java -cp weka.jar;. WekaCLFindNN numNN csvfile

here is a screen shot of it working on my machine. Note that I have the weka.jar file and the WekaCLFindNN file in the directory I am in when I run java. Also note that I am running this under Windows where the classpath separater is a semicolon (;) if you were running this under Linux you would have to use a colon (:)

weka working from command line

You can ignore the part about the database driver that's just Weka throwing something out to stderr. but as you can see the vectors are left aligned and their nearest neighbors are listed just like you asked for.

if you want the data in a log file just execute it this way

java -cp weka.jar;. WekaCLFindNN > outputlog

the log file will look like this, and notice it doesn't have the error about the database:

outputlog

While it would be nice to have both the nearest neighbors and their index in the original instance dataset, I checked the kNearestNeighbours method and found that the index data is thrown away right before reporting. If you want it than you are going to have to inherit the LinearNNSearch class and write a new method that outputs both the instances and the indices.

So I hope this helps. It's unfortunate that Weka doesn't offer this out of the box, but you can do it in just a few lines of code.

like image 71
Semicolons and Duct Tape Avatar answered Oct 05 '22 18:10

Semicolons and Duct Tape