I need to find naturally occurring classes of nouns based on their distribution with different preposition (like agentive, instrumental, time, place etc.). I tried using k-means clustering but of less help, it didn't work well, there was a lot of overlap over the classes that I was looking for (probably because of non-globular shape of classes and random initialisation in k-means).
I am now working on using DBSCAN, but I have trouble understanding the epsilon value and mini-points value in this clustering algorithm. Can I use random values or do I need to compute them. Can anybody help? Particularly with epsilon, at least how to compute it if I need to?
Use your domain knowledge to choose the parameters. Epsilon is a radius. You can think of it as a minimum cluster size.
Obviously random values won't work very well. As a heuristic, you can try to look at a k-distance plot; but it's not automatic either.
The first thing to do either way is to choose a good distance function for your data. And perform appropriate normalization.
As for "minPts" it again depends on your data and needs. One user may want a very different value than another. And of course minPts and Epsilon are coupled. If you double epsilon, you will roughly need to increase your minPts by 2^d (for Euclidean distance, because that is how the volume of a hypersphere increases!)
If you want lots of small and fine detailed clusters, choose a low minpts. If you want larger and fewer clusters (and more noise), use a larger minpts. If you don't want any clusters at all, choose minpts larger than your data set size...
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With