So I have a 2D scatter filled with points (x,y)
. I want to draw k vertical lines (x_1 = a, x_2 = b, ..., x_k = k)
, so as to partition the points into k groups.
The optimal solution would minimize the average variance of each group's y_value
.
What is the appropriate algorithm? It sounded like k-means but I have the constraint that the lines must be vertical.
Here is an idea based on dynamic programming.
With the following notations:
(x_1, y_1), ..., (x_n, y_n)
the points, with x_1 <= x_2 <= ... <= x_n
to cut in K
groups.Var(i, j)
the variance of the y
's: y_i, ..., y_j
.F_K((x_1,y_1), ..., (x_n, y_n)) = F_k(1,n)
the value of the best solution for the problem.
Then we have the following:F_k(i,j) = min for l in i...j-k+1 of (Var(i,l) + F_(k-1)(l+1, j)
andF_1(i,j) = Var(i,j)
.
The property above simply means that the best way to split your points in 'k' groups is to select the leftmost cut (the choice of l
), and the best choice of k-1
cuts for the remaining points.
From there you can go for a dynamic program. You'll need a 3D array A
of dimensions n*n*K
to store the value of F_k(i,j) for all i,j,k
.
The program would look like:
function get_value(P: points, A: 3D array, i, j, k){
if A[i][j][k] is defined{
result = A[i][j][k]
} else if k == 1 {
A[i][j][k] = get_var(P, i, j)
result = A[i][j][k]
} else {
result = +INF
for l in i ... j-k+1 {
tmp = get_value(P, A, i, l, 1) + get_value(P, A, l+1, j, k-1)
if tmp < result {
result = tmp
}
}
}
return result
}
NB: I was a bit quick about the range to iterate on for l
, that might be something to look into.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With