Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

scikits.learn clusterization methods for curve fitting parameters

I would like some suggestion on the best clusterization technique to be used, using python and scikits.learn. Our data comes from a Phenotype Microarray, which measures the metabolism activity of a cell on various substrates over time. The output are a series of sigmoid curves for which we extract a series of curve parameters through a fitting to a sigmoid function.

We would like to "rank" this activity curves through clusterization, using a fixed number of clusters. For now we are using the k-means algorithm provided by the package, with (init='random', k=10, n_init=100, max_iter=1000). The input is a matrix with n_samples and 5 parameters for each sample. The number of samples can vary, but it is usually around several thousands (i.e. 5'000). The clustering seems efficient and effective, but I would appreciate any suggestion on different methods or on the best way to perform an assessment of the clustering quality.

Here a couple of diagrams that may help:

  • the scatterplot of the input parameters (some of them are quite correlated), the color of the single samples is relative to the assigned cluster. Scatterplot of input parameters

  • the sigmoid curves from which the input parameters have been extracted, whose color is relative to their assigned cluster enter image description here

EDIT

Below some elbow plots and the silhouette score for each number of cluster. clustering stats

like image 829
mgalardini Avatar asked Dec 20 '22 04:12

mgalardini


1 Answers

Have you noticed the striped pattern in your plots?

This indicates that you didn't normalize your data good enough.

"Area" and "Height" are highly correlated and probably on the largest scale. All the clustering happened on this axis.

You absolutely must:

  • perform careful preprocessing
  • check that your distance functions produce a meaningful (to you, not just the computer) notion of similarity
  • reality-check your results, and check that they aren't too simple, determined e.g. by a single attribute

Don't blindly follow the numbers. K-means will happily produce k clusters no matter what data you give. It just optimizes some number. It's up to you to check that the results are useful, and analyze what their semantic meaning is - and it might well be that it just is mathematically a local optimum, but meaningless for your task.

like image 61
Has QUIT--Anony-Mousse Avatar answered Dec 26 '22 12:12

Has QUIT--Anony-Mousse