Am trying to follow a paper called An Optimized Version of K-Means Algorithm. I have the idea on how K-Means algorithm works. That is, grouping the tuples/points into clusters and updating the centroids.
Am trying to implement the method mentioned in the above paper. Their proposed algorithm is this:

So my doubt is in the second step. I didn't understood what it is being done there! In the paper it says that, we group our data to wider intervals based on the value of e, so that later we could avoid iterating through the entire dataset. So, actually how we store it in that I (intervals) ? Are we supposed to define a multidimensional array? It's not making sense to me(probably am dumb enough not to get the idea).
The next thing I have doubt is about the step 5. In there, Cw is said to be the closest centroid of that point. But how do we figure that out? At first, we would be randomly assigning a point as the centroid. So, are we supposed to loop through the points and find out the Cw (the closest centroid) before calculating the e ?
The next doubt is on Step 6, which I guess will be able to understand after I get the idea about my previous question regarding Step2.
And the final doubt is regarding Step7. What does it mean by CjCj' ? Distance between the previous centroid position to the updated centroid position ?
I have been brain storming about this since the whole day. Any clues would be highly appreciated. Thank you
This algorithm revolves around the idea that points which are close to one cluster center and further away from all other cluster centers will stick to that cluster center. Thus, in subsequent iterations these points don't have to be compared to all cluster centers.
Imagine a point P which has distance 3 to its assigned cluster (i.e. that is the closest distance) and all other clusters are at least 8 or more away. Now, compute the new cluster centers. Let's say that the most any cluster center moved was 2 (that is the value of D in the algorithm, line 7).
You can now be sure that your point still belongs to the same cluster. This is easy to understand if you think about the worst case scenario: The assigned cluster could have moved away from the point so it could be at worst have a new distance of 3+2. The closest other cluster could have moved towards the point so its distance could be now 8-2. Therefore, you don't need to update that point.
Here's a picture:

Step 2: Create Intervals into which you later put the points. In step 5 you'll create e values. If e is 5, you'd put that point into the interval [4, 6) (if you'd have that interval).
Step 5: Compute the distance of the point to all clusters. The closest cluster is Cw. If the next closest cluster is C3 than e = C3 - Cw. I'd call e the safety margin. It gives you the maximal distance a cluster can move before assignments might switch.
Step 6: explained with step 2.
Step 7: CjCj' is the distance that Cj moved when it was updated.
Step 7 & 8: These are done as part of the for loop. (In Step 9 they say "go back to 4").
Step 9: Continue the loop with all points that might have changed clusters.
This algorithm assumes that you know k-means. I believe this is a very fair assumption. It is not very explicit when to terminate and how to initialise the algorithm. Both of these facts I'd consider common knowledge.
Second step: The intervals are on the distance metric, not on coordinates of data.
Fifth step: This is essentially distance to the second closest cluster.
Let me know if you still don't understand the rest.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With