Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

DBSCAN with python and scikit-learn: What exactly are the integer labes returned by make_blobs?

I'm trying to comprehend the example for the DBSCAN algorithm implemented by scikit (http://scikit-learn.org/0.13/auto_examples/cluster/plot_dbscan.html).

I changed the line

X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4)

with X = my_own_data, so I can use my own data for the DBSCAN.

now, the variable labels_true, which is the second returned argument of make_blobs is used to calculate some values of the results, like this:

print "Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels)
print "Completeness: %0.3f" % metrics.completeness_score(labels_true, labels)
print "V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels)
print "Adjusted Rand Index: %0.3f" % \
    metrics.adjusted_rand_score(labels_true, labels)
print "Adjusted Mutual Information: %0.3f" % \
    metrics.adjusted_mutual_info_score(labels_true, labels)
print ("Silhouette Coefficient: %0.3f" %
       metrics.silhouette_score(D, labels, metric='precomputed'))

how can I calculate labels_true from my data X? what exactly do scikit mean with label on this case?

thanks for your help!

like image 776
otmezger Avatar asked Apr 04 '13 18:04

otmezger


People also ask

What is Make_blobs in Sklearn datasets?

Generate isotropic Gaussian blobs for clustering. Read more in the User Guide. If int, it is the total number of points equally divided among clusters.

What is the use of Make_blobs () function which library does it belong to?

The make_blobs() function can be used to generate blobs of points with a Gaussian distribution. You can control how many blobs to generate and the number of samples to generate, as well as a host of other properties.


1 Answers

labels_true is the "true" assignment of points to labels: which cluster they should actually belong on. This is available because make_blobs knows which "blob" it generated the point from.

You can't get that for your own arbitrary data X, unless you have some kind of true labels for the points (in which case you wouldn't be doing clustering anyway). This just shows some measures of how well the clustering performed in a fake case where you know the true answer.

like image 171
Danica Avatar answered Sep 29 '22 11:09

Danica