I'm trying to use Weka to do K-Means clustering on a set of data, examining how different weights affect different attributes.
However, when I adjust the weights of each attribute, I'm not seeing any difference in the clustering.
//Initialize file readers
...
Instances dataSet = readDataFile(dataReader);
double[][] modifiers = readNormalizationFile(normReader, dataSet.numAttributes());
normalize(dataSet, modifiers);
SimpleKMeans kMeans = new SimpleKMeans();
kMeans.setPreserveInstancesOrder(true);
int[] clusters = null;
try
{
System.out.println(kMeans.getSeed());
if(distMet != 0)
kMeans.setDistanceFunction(new ManhattanDistance(dataSet));
kMeans.setNumClusters(k);
kMeans.buildClusterer(dataSet);
clusters = kMeans.getAssignments();
}
//Print clusters
The first dimension of the "modifiers" array corresponds to each attribute, and within each there are two values. The first is subtracted from the attribute value, and then the result is divided by the second value.
The normalization goes like this:
public static void normalize(Instances dataSet, double[][] modifiers)
{
for(int i = 0; i < dataSet.numInstances(); i++)
{
Instance currInst = dataSet.instance(i);
double[] values = currInst.toDoubleArray();
for(int j = 0; j < values.length; j++)
{
currInst.setValue(j, (values[j] - modifiers[j][0]) / modifiers[j][1]);
}
}
}
My expectation is that increasing the second normalization should reduce the importance of a particular attribute to the clustering and therefore change how clusters are assigned, but that isn't what I'm observing. My debugger is showing that the correctly normalized values are being sent into the clusterer, but I find it hard to believe that Weka is messing up instead of me.
Have I used Weka's K-Means correctly, or have I left out something important?
There is an option for NormalizableDistance Distance Measures (such as Euclidean and Manhattan) called dontNormalize, which may automatically be normalizing the values for you. By default, this would be enabled, which could possibly undo all of the work that was done in your normalize function call.
I ran tests for a random dataset, then manipulated one of the attributes data for a second trial, and the two clusters ended up being identical. Setting the value to true led to different clusters and therefore allocations of the instances in the dataset.
Hope this Helps!
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With