Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to find the success rate of a clustering algorithm?

I have implemented several clustering algorithms on an image dataset. I'm interested in deriving the success rate of clustering. I have to detect the tumor area, in the original image I know where the tumor is located, I would like to compare the two images and obtain the percentage of success. Following images:

Original image: I know the position of cancer

Image after clustering algorithm

I'm using python 2.7.

like image 417
GuroTozzi Avatar asked Jul 25 '18 17:07

GuroTozzi


People also ask

How do you measure cluster success?

Clustering Performance Evaluation Metrics Here clusters are evaluated based on some similarity or dissimilarity measure such as the distance between cluster points. If the clustering algorithm separates dissimilar observations apart and similar observations together, then it has performed well.

How do you evaluate the performance of a clustering algorithm?

The C-H Index is a great way to evaluate the performance of a Clustering algorithm as it does not require information on the ground truth labels. The higher the Index, the better the performance.

How do you measure performance of K-means clustering?

You can evaluate the performance of k-means by convergence rate and by the sum of squared error(SSE), making the comparison among SSE. It is similar to sums of inertia moments of clusters.

How is clustering accuracy calculated?

Computing accuracy for clustering can be done by reordering the rows (or columns) of the confusion matrix so that the sum of the diagonal values is maximal. The linear assignment problem can be solved in O(n3) instead of O(n!). Coclust library provides an implementation of the accuracy for clustering results.


1 Answers

Segmentation Accuracy

This is a pretty common problem addressed in image segmentation literature, e.g., here is a StackOverflow post

One common approach is to consider the ratio of "correct pixels" to "incorrect pixels," which is common in image segmentation for safety domain, e.g., Mask RCNN, PixelNet.

Treating it as more of an object detection task, you could take the overlap of the hull of the objects and just measure accuracy (commonly broken down into precision, recall, f-score, and other measures with various bias/skews). This allows you to produce an ROC curve that can be calibrated for false positives/false negatives.

There is no domain-agnostic consensus on what's correct. KITTI provides both.

Mask RCNN is open source state-of-the-art, and provides implemenations in python of

  • Computing image matching between segmented and original
  • Displaying the differences

In your domain (medicine), standard statistical rules apply. Use a holdout set. Cross validate. Etc. (*)

Note: although the literature space is dauntingly large, I'd caution you to take a look at some domain-relevant papers, as they may take fewer "statistical short cuts" than other vision (digit recognition e.g.) projects accept.

  • "Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool" provides some summary methods in your your domain
  • "Current methods in image segmentation" has about 2500 citations but is a little older.
  • "Review of MR image segmentation techniques using pattern recognition" is a little older still and will get you safely into "traditional" vision models.
  • Automated Segmentation of MR Images of Brain Tumors is largely about its segmentation validation process

Python

Besides the mask rcnn links above, scikit-learn provides some extremely user friendly tools and is considered part of the standard science "stack" for python.

Implementing the difference between images in python is trivial (using numpy). Here's an overkill SO link.

Bounding box intersection in python is easy to implement on one's own; I'd use a library like shapely if you want to measure general polygon intersection.

Scikit-learn has some nice machine-learning evaluation tools, for example,

  • ROC curves
  • Cross validation
  • Model selection
  • A million others

Literature Searching

One reason that you may have trouble searching for the answer is because you're trying to measure performance of an unsupervised method, clustering, in a supervised learning arena. "Clusters" are fundamentally under-defined in mathematics (**). You want to be looking at the supervised learning literature for accuracy measures.

There is literature on unsupervised learning/clustering, too, which looks for topological structure, generally. Here's a very introductory summary. I don't think that is what you want.

A common problem, especially at scale, is that supervised methods require labels, which can be time consuming to produce accurately for dense segmentation. Object detection makes it a little easier.

There are some existing datasets for medicine ([1], [2], e.g.) and some ongoing research in label-less metrics. If none of these are options for you, then you may have to revert to considering it an unsupervised problem, but evaluation becomes very different in scope and utility.


Footnotes

[*] Vision people sometimes skip cross validation even though they shouldn't, mainly because the models are slow to fit and they're a lazy bunch. Please don't skip a train/test/validation split, or your results may be dangerously useless

[**] You can find all sorts of "formal" definitions, but never two people to agree on which one is correct or most useful. Here's denser reading

like image 175
en_Knight Avatar answered Oct 02 '22 22:10

en_Knight