scikits.learn clusterization methods for curve fitting parameters

Question

I would like some suggestion on the best clusterization technique to be used, using python and scikits.learn. Our data comes from a Phenotype Microarray, which measures the metabolism activity of a cell on various substrates over time. The output are a series of sigmoid curves for which we extract a series of curve parameters through a fitting to a sigmoid function.

We would like to "rank" this activity curves through clusterization, using a fixed number of clusters. For now we are using the k-means algorithm provided by the package, with (init='random', k=10, n_init=100, max_iter=1000). The input is a matrix with n_samples and 5 parameters for each sample. The number of samples can vary, but it is usually around several thousands (i.e. 5'000). The clustering seems efficient and effective, but I would appreciate any suggestion on different methods or on the best way to perform an assessment of the clustering quality.

Here a couple of diagrams that may help:

the scatterplot of the input parameters (some of them are quite correlated), the color of the single samples is relative to the assigned cluster.
the sigmoid curves from which the input parameters have been extracted, whose color is relative to their assigned cluster

EDIT

Below some elbow plots and the silhouette score for each number of cluster. clustering stats

Has QUIT--Anony-Mousse · Accepted Answer

Have you noticed the striped pattern in your plots?

This indicates that you didn't normalize your data good enough.

"Area" and "Height" are highly correlated and probably on the largest scale. All the clustering happened on this axis.

You absolutely must:

perform careful preprocessing
check that your distance functions produce a meaningful (to you, not just the computer) notion of similarity
reality-check your results, and check that they aren't too simple, determined e.g. by a single attribute

Don't blindly follow the numbers. K-means will happily produce k clusters no matter what data you give. It just optimizes some number. It's up to you to check that the results are useful, and analyze what their semantic meaning is - and it might well be that it just is mathematically a local optimum, but meaningless for your task.

scikits.learn clusterization methods for curve fitting parameters

Tags:

python

cluster-analysis

scikit-learn

data-mining

mgalardini

1 Answers

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us

scikits.learn clusterization methods for curve fitting parameters

Tags:

python

cluster-analysis

scikit-learn

data-mining

mgalardini

1 Answers

Has QUIT--Anony-Mousse

Related questions

Recent Activity

Donate For Us