I'm a newbie with Spark Streaming and I want to use K-means, but when I study this I can't understand how many times K-means on Spark Streaming use the same data?
That is to say, the K-means algorithm is iterative, so how can I control the number of times that it runs on the same data?.
K-means essentially is about having k cluster centroids. In every iteration you update the cluster to which data points belong and then recompute the k centroids. So right off the bat, the best way to stop the k-means run is not by how many times have you run the algorithm but by whether the centroid computed in this run is the same as the previous one.
When the points will stabilize in the clusters, the centroids will stabilize too and this means that any further iteration would not change the clusters and that is where you should stop.
But, you can, if you wish to, stop earlier too. You can program the algo to run a max number of iterations. Since k-means runs over and over, it has some kind of loop (while, for, foreach..). You can put a loop counter in there and stop when you have had the desired runs. Alternatively you can stop when the change in previous cluster centroids and the new cluster centroids is below a certain threshold.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With