Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Should we used k-means++ instead of k-means?

The k-means++ algorithm helps in two following points of the original k-means algorithm:

  1. The original k-means algorithm has the worst case running time of super-polynomial in input size, while k-means++ has claimed to be O(log k).
  2. The approximation found can yield a not so satisfactory result with respect to objective function compared to the optimal clustering.

But are there any drawbacks of k-means++? Should we always used it instead of k-means from now on?

like image 453
Karl Avatar asked Jan 16 '11 16:01

Karl


1 Answers

Nobody claims k-means++ runs in O(lg k) time; it's solution quality is O(lg k)-competitive with the optimal solution. Both k-means++ and the common method, called Lloyd's algorithm, are approximations to an NP-hard optimization problem.

I'm not sure what the worst case running time of k-means++ is; note that in Arthur & Vassilvitskii's original description, steps 2-4 of the algorithm refer to Lloyd's algorithm. They do claim that it works both better and faster in practice because it starts from a better position.

The drawbacks of k-means++ are thus:

  1. It too can find a suboptimal solution (it's still an approximation).
  2. It's not consistently faster than Lloyd's algorithm (see Arthur & Vassilvitskii's tables).
  3. It's more complicated than Lloyd's algo.
  4. It's relatively new, while Lloyd's has proven it's worth for over 50 years.
  5. Better algorithms may exist for specific metric spaces.

That said, if your k-means library supports k-means++, then by all means try it out.

like image 74
Fred Foo Avatar answered Sep 22 '22 10:09

Fred Foo