I have a csv file, where each row is a vector of numbers representing a data point. I want to use weka from the command line to calculate the nearest neighbor of each data point in the csv file. I know how to do k nearest neighbor classification from the command line, but that's not what I want. I want the actual neighbors. How do I do this?
I want to do this using weka and not some other tool.
Weka doesn't have a one liner to do what I think you are suggesting (ingest a file, convert it to instances, and then find all the N nearest neighbors of each instance)
but you can set up a command line style one liner by leveraging Weka and a couple of lines of Java in the following way:
Compile the following code. I used Eclipse, but you can just as easily use javac at the command line - just make sure that you have weka.jar in your classpath. I show you an example of how to call this as a one liner from the cammand line after the code below
import weka.core.Instances;
import weka.core.converters.ConverterUtils.DataSource;
import weka.core.neighboursearch.LinearNNSearch;
public class WekaCLFindNN {
public static void main(String[] args) throws Exception {
//report that the code is running
System.out.println("Weka Command Line Find Nearest " + args[0] + " Neighbors for each Instance in " + args[1]); // Display the string.
//setup datasources, grab instances, and calculate the nearest neighbors
DataSource source = new DataSource(""+args[1]);
Instances instances = source.getDataSet();
weka.core.neighboursearch.LinearNNSearch knn = new LinearNNSearch(instances);
//cycle through the dataset and get instances for the nearestneighbors
for(int j=0;j<instances.numInstances();j++){
Instances nearestInstances= knn.kNearestNeighbours(instances.instance(j), Integer.parseInt(args[0]));
//cycle through the instances and printout the nearestneighbors
System.out.println("\n\n" + instances.instance(j));
for(int i =0;i<Integer.parseInt(args[0]);i++)
{
System.out.println("\n\t" + nearestInstances.instance(i));
}
}
//close the code
System.out.println("\n"+"Nearest Neighbors found"); // Display the string.
}
}
Now just run it from the command line using the following command.
java -cp weka.jar;. WekaCLFindNN numNN csvfile
here is a screen shot of it working on my machine. Note that I have the weka.jar file and the WekaCLFindNN file in the directory I am in when I run java. Also note that I am running this under Windows where the classpath separater is a semicolon (;) if you were running this under Linux you would have to use a colon (:)
You can ignore the part about the database driver that's just Weka throwing something out to stderr. but as you can see the vectors are left aligned and their nearest neighbors are listed just like you asked for.
if you want the data in a log file just execute it this way
java -cp weka.jar;. WekaCLFindNN > outputlog
the log file will look like this, and notice it doesn't have the error about the database:
While it would be nice to have both the nearest neighbors and their index in the original instance dataset, I checked the kNearestNeighbours method and found that the index data is thrown away right before reporting. If you want it than you are going to have to inherit the LinearNNSearch class and write a new method that outputs both the instances and the indices.
So I hope this helps. It's unfortunate that Weka doesn't offer this out of the box, but you can do it in just a few lines of code.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With