Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Choose the best cluster partition based on a cost function

I've a string that I'd like to cluster:

s = 'AAABBCCCCC'

I don't know in advance how many clusters I'll get. All I have, is a cost function that can take a clustering and give it a score.

There is also a constraint on the cluster sizes: they must be in a range [a, b]

In my exemple, for a=3 and b=4, all possible clustering are:

[
    ['AAA', 'BBC', 'CCCC'],
    ['AAA', 'BBCC', 'CCC'],
    ['AAAB', 'BCC', 'CCC'],
]

Concatenation of each clustering must give the string s

The cost function is something like this

cost(clustering) = alpha*l + beta*e + gamma*d

where:

  • l = variance(cluster_lengths)
  • e = mean(clusters_entropies)
  • d = 1 - nb_characters_in_b_that_are_not_in_a)/size_of_b (for b the consecutive cluster of a)
  • alpha, beta, gamma are weights

This cost function gives a low cost (0) for the best case:

  1. Where all clusters have the same size.
  2. Content inside each cluster is the same.
  3. Consecutive clusters don't have the same content.

Theoretically, the solution is to calculate the cost of all possible compositions for this string and choose the lowest. but It will take too much time.

Is there any clustering algorithme that can find the best clustering according to this cost function in a reasonable time ?

like image 701
Ghilas BELHADJ Avatar asked Aug 19 '16 11:08

Ghilas BELHADJ


1 Answers

A dynamic programming approach should work here.

Imagine, first, that a cost(clustering) equals to the sum of cost(cluster) for all all clusters that constitute the clustering.

Then, a simple DP function is defined as follows:

F[i] = minimal cost of clustering the substring s[0:i]

and calculated in the following way:

for i = 0..length(s)-1:
    for j = a..b:
        last_cluster = s[i-j..i]
        F[i] = min(F[i], F[i - j] + cost(last_cluster))

Of course, first you have to initialize values of F to some infinite values or nulls to correctly apply min function.

To actually restore the answer, you can store additional values P[i], which would contain the lengths of the last cluster with optimal clustering of string s[0..i]. When you update F[i], you also update P[i].

Then, restoring answer is little trouble:

current_pos = length(s) - 1
while (current_pos >= 0):
    current_cluster_length = P[current_pos]
    current_cluster = s[(current_pos - current_cluster_length + 1)..current_pos]
    // grab current_cluster to the answer
    current_pos -= current_cluster_length

Note that in this approach you will get the clsuters in the inverse order, meaning from the last cluster all the way to the first one.

Let's now apply this idea to the initial problem. What we would like is to make cost(clustering) more or less linear, so that we can compute it cluster by cluster instead of computing it for the whole clustering.

The first parameter of our DP function F will be, as before, i, the number of chars in the substring s[0:i] we have found optimal answer to. The meaning of the F function is, as usual, the minimal cost we can achieve with the given parameters.

The parameter e = mean(clusters_entropies) of the cost function is already linear and can be computed cluster by cluster, so this is not a problem.

The parameter l = variance(cluster_lengths) is a little bit more complex. The variance of n values is defined as Sum[(x[i] - mean)^2] / n. mean is expected value, namely mean = Sum[x[i]] / n. Note also that Sum[x[i]] is the sum of lengths of all clusters and in our case it is always fixed and equals to length(s). Therefore, mean = length(s) / n. Okay, we have more or less made our l part of cost function linear except the n parameter. We will add this parameter, namely the number of clusters in the desired clustering, as a parameter to our F function. We will also have a parameter cur which will mean the number of clusters currently assembled in the given state.

The parameter d of the cost function also requires adding additional parameter to our DP function F, namely j, sz, the size of the last cluster in our partition.

Overall, we have come up with a DP function F[i][n][cur][sz] that gives us the minimal cost function of partitioning string s[0:i] into n clusters of which cur are currently constructed with the size of the last cluster equal to sz. Of course, our responsibility is to make sure that a<=sz<=b. The answer in terms of the minimal cost function will be the minimum among all possible n and a<=sz<=b values of DP function F[length(s)-1][n][n][sz]. Now notice that this time we do not even require the companion P function to store the length of the last cluster as we already included that information as the last sz parameter into our F function. We will, however, store in P[i][n][cur][sz] the length of the next to last cluster in the optimal clustering with the specified parameters. We will use that value to restore our solution. Thus, we will be able to restore an answer in the following way, assuming the minimum of F is achieved in the parameters n=n0 and sz=sz0:

current_pos = length(s) - 1
current_n = n0
current_cluster_size = sz0
while (current_n > 0):
    current_cluster = s[(current_pos - current_cluster_size + 1)..current_pos]
    next_cluster_size = P[current_pos][n0][current_n][current_cluster_size]
    current_n--;
    current_pos -= current_cluster_size;
    current_cluster_size = next_cluster_size

Let's now get to the computation of F. I will omit the corner cases and range checks, but it will be enough to just initialize F with some infinite values.

// initialize for the case of one cluster
// d = 0, l = 0, only have to calculate entropy
for i=0..length(s)-1:
    for n=1..length(s):
        F[i][n][1][i+1] = cluster_entropy(s[0..i]);
        P[i][n][1][i+1] = -1; // initialize with fake value as in this case there is no previous cluster

// general case computation
for i=0..length(s)-1:
    for n=1..length(s):
        for cur=2..n:
            for sz=a..b:
                for prev_sz=a..b:
                    cur_cluster = s[i-sz+1..i]
                    prev_cluster = s[i-sz-prev_sz+1..i-sz]
                    F[i][n][cur][sz] = min(F[i][n][cur][sz], F[i-sz][n][cur - 1][prev_sz] + gamma*calc_d(prev_cluster, cur_cluster) + beta*cluster_entropy(cur_cluster)/n + alpha*(sz - s/n)^2)
like image 92
Alexey Subach Avatar answered Nov 01 '22 17:11

Alexey Subach