Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Implementing a custom agglomerative algorithm from scratch

I know about agglomerative clustering algorithms, the way it starts with each data point as individual clusters and then combines points to form clusters.

Now, I have a n dimensional space and several data points that have values across each of these dimensions. I want to cluster two points/clusters based on business rules like:

  • Cluster two points c1 and c2 if the distance between the clusters across dimension 1 is < T1, and the distance across dimension 2 < T2, ... and distance across dimension n < Tn.
  • If the rule across dimension 1 is met and rule across dimension 2 is met, then cluster them without bothering about the other dimensions...

.... and similar custom rules.

In addition, I have my own way of defining and measuring the distance between any two clusters in any particular dimension. The dimension may be holding just strings, and I want to define my own string distance metric. In another dimension, it may hold names of locations, and the distance between the two points along this dimension is the geographical distance between location named, and so on for other dimensions as well.

Is there a framework/software that lets me implement this way of defining custom distance metrics, and then to implement agglomerative clustering? Of course, the agglomerative clustering stops when the business rules are not met at any point of time, and we have clusters formed in the n dimensional space at the end.

Thanks Abhishek S

like image 361
London guy Avatar asked May 27 '12 12:05

London guy


2 Answers

You could do it with Weka.

You would have to implement a Distance Function, and pass it to the Hierarchical Clusterer using the setDistanceFunction(DistanceFunction distanceFunction) method.

The other available clusterers in Weka are: Cobweb, EM, FarthestFirst, FilteredClusterer, MakeDensityBasedClusterer, RandomizableClusterer, RandomizableDensityBasedClusterer, RandomizableSingleClustererEnhancer, SimpleKMeans, SingleClustererEnhancer.

An example distance function, from the NormalizableDistance class:

  /** Index in ranges for MIN. */
  public static final int R_MIN = 0;

  /** Index in ranges for MAX. */

  public static final int R_MAX = 1;

  /** Index in ranges for WIDTH. */
  public static final int R_WIDTH = 2;

  /** the instances used internally. */
  protected Instances m_Data = null;

  /** True if normalization is turned off (default false).*/
  protected boolean m_DontNormalize = false;

  /** The range of the attributes. */
  protected double[][] m_Ranges;

  /** The range of attributes to use for calculating the distance. */
  protected Range m_AttributeIndices = new Range("first-last");

  /** The boolean flags, whether an attribute will be used or not. */
  protected boolean[] m_ActiveIndices;

  /** Whether all the necessary preparations have been done. */
  protected boolean m_Validated;


public double distance(Instance first, Instance second, double cutOffValue, PerformanceStats stats) {
    double distance = 0;
    int firstI, secondI;
    int firstNumValues = first.numValues();
    int secondNumValues = second.numValues();
    int numAttributes = m_Data.numAttributes();
    int classIndex = m_Data.classIndex();

    validate();

    for (int p1 = 0, p2 = 0; p1 < firstNumValues || p2 < secondNumValues; ) {
      if (p1 >= firstNumValues)
        firstI = numAttributes;
      else
        firstI = first.index(p1); 

      if (p2 >= secondNumValues)
        secondI = numAttributes;
      else
        secondI = second.index(p2);

      if (firstI == classIndex) {
        p1++; 
        continue;
      }
      if ((firstI < numAttributes) && !m_ActiveIndices[firstI]) {
        p1++; 
        continue;
      }

      if (secondI == classIndex) {
        p2++; 
        continue;
      }
      if ((secondI < numAttributes) && !m_ActiveIndices[secondI]) {
        p2++;
        continue;
      }

      double diff;

      if (firstI == secondI) {
        diff = difference(firstI,
                  first.valueSparse(p1),
                  second.valueSparse(p2));
        p1++;
        p2++;
      }
      else if (firstI > secondI) {
        diff = difference(secondI, 
                  0, second.valueSparse(p2));
        p2++;
      }
      else {
        diff = difference(firstI, 
                  first.valueSparse(p1), 0);
        p1++;
      }
      if (stats != null)
        stats.incrCoordCount();

      distance = updateDistance(distance, diff);
      if (distance > cutOffValue)
        return Double.POSITIVE_INFINITY;
    }

    return distance;
  }

Shows that you can treat separately the various dimensions (that are called attributes in Weka). So you can define a different distance for each dimension/attribute.

About the business rules to avoid clustering together some instances. I think that you can create a distance function that returns Double.positiveInfinity when the business rules are not satisfied.

like image 66
Vitaly Olegovitch Avatar answered Sep 27 '22 22:09

Vitaly Olegovitch


ELKI is another option. It has much more clustering algorithms than Weka (which is mostly useful for classification). They even have a Wiki Tutorial explaining how to implement custom distance functions (which you then should be able to use in hierarchical clustering): distance function tutorial.

Note that "business rules" is not a very common way to specify a distance function...

like image 36
Has QUIT--Anony-Mousse Avatar answered Sep 27 '22 22:09

Has QUIT--Anony-Mousse