I am using Weka in Scala (although the syntax is virtually identical to Java). I am trying to evaluate my data with a SimpleKMeans clusterer, but the clusterer won't accept string data. I don't want to cluster on the string data; I just want to use it to label the points.
Here is the data I am using:
@relation Locations
@attribute ID string
@attribute Latitude numeric
@attribute Longitude numeric
@data
'Carnegie Mellon University', 40.443064, -79.944163
'Stanford University', 37.427539, -122.170169
'Massachusetts Institute of Technology', 42.358866, -71.093823
'University of California Berkeley', 37.872166, -122.259444
'University of Washington', 47.65601, -122.30934
'University of Illinois Urbana Champaign', 40.091022, -88.229992
'University of Southern California', 34.019372, -118.28611
'University of California San Diego', 32.881494, -117.243079
As you can see, it's essentially a collection of points on an x and y coordinate plane. The value of any patterns is negligible; this is simply an exercise in working with Weka.
Here is the code that is giving me trouble:
val instance = new Instances(new StringReader(wekaHeader + wekaData))
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
val eval = new ClusterEvaluation()
eval.setClusterer(simpleKMeans)
eval.evaluateClusterer(new Instances(instance))
Logger.info(eval.clusterResultsToString)
I get the following error on simpleKMeans.buildClusterer(instance)
:
[UnsupportedAttributeTypeException: weka.clusterers.SimpleKMeans: Cannot handle string attributes!]
How do I get Weka to retain IDs while doing clustering?
Here are a couple of other steps I have taken to troubleshoot this:
I used the Weka Explorer and loaded this data as a CSV:
ID, Latitude, Longitude
'Carnegie Mellon University', 40.443064, -79.944163
'Stanford University', 37.427539, -122.170169
'Massachusetts Institute of Technology', 42.358866, -71.093823
'University of California Berkeley', 37.872166, -122.259444
'University of Washington', 47.65601, -122.30934
'University of Illinois Urbana Champaign', 40.091022, -88.229992
'University of Southern California', 34.019372, -118.28611
'University of California San Diego', 32.881494, -117.243079
This does what I want it to do in the Weka Explorer. Weka clusters the points and retains the ID column to identify each point. I would do this in my code, but I'm trying to do this without generating additional files. As you can see from the Weka Java API, Instances
interprets a java.io.Reader
only as an ARFF.
I have also tried the following code:
val instance = new Instances(new StringReader(wekaHeader + wekaData))
instance.deleteAttributeAt(0)
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
val eval = new ClusterEvaluation()
eval.setClusterer(simpleKMeans)
eval.evaluateClusterer(new Instances(instance))
Logger.info(eval.clusterResultsToString)
This works in my code, and displays results. That proves that Weka is working in general, but since I am deleting the ID attribute, I can't really map the clustered points back on the original values.
I am answering my own question, and in doing so, there are two issues that I would like to address:
As Sentry points out in the comments, the ID does in fact get converted to a nominal attribute when loaded from a CSV.
If the data must be in an ARFF
format (like in my example where the Instances
object is created from a StringReader
), then the StringToNominal
filter can be applied:
val instances = new Instances(new StringReader(wekaHeader + wekaData))
val filter = new StringToNominal()
filter.setAttributeRange("first")
filter.setInputFormat(instances)
val filteredInstance = Filter.useFilter(instances, filter)
val simpleKMeans = new SimpleKMeans()
simpleKMeans.buildClusterer(instance)
...
This allows for "string" values to be used in clustering, although it's really just treated as a nominal value. It doesn't impact the clustering (if the ID is unique), but it doesn't contribute to the evaluation as I had hoped, which brings me to the next issue.
I was hoping to be able to get a nice map of cluster and data, like cluster: Int -> Array[(ID, latitude, longitude)]
or ID -> cluster: Int
. However, the cluster results are not that convenient. In my experience these past few days, there are two approaches that can be used to find the cluster of each point of data.
To get the cluster assignments, simpleKMeans.getAssignments
returns an array of integers that is the cluster assignments for each data element. The array of integers is in the same order as the original data items and has to be manually related back to the original data items. This can be easily accomplished in Scala by using the zip
method on the original list of data items and then using other methods like groupBy
or map
to get the collection in your favorite format. Keep in mind that this method alone does not use the ID attribute at all, and the ID attribute could be omitted from the data points entirely.
However, you can also get the cluster centers with simpleKMeans.getClusterCentroids
or eval.clusterResultsToString()
. I have not used this very much, but it does seem to me that the ID attribute can be recovered from the cluster centers here. As far as I can tell, this is the only situation in which the ID data can be utilized or recovered from the cluster evaluation.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With