I've had encouraging results clustering a set of entity names using scikit-learn's affinity propagation implementation, with a modified Jaro-Winkler distance as the similarity metric, but my clusters are still too numerous (ie. too many false positives.)
I see in the scikit-learn documentation that there exists a 'preference' parameter that affects the number of clusters, with the following description:
preference : array-like, shape (n_samples,) or float, optional
Preferences for each point - points with larger values of preferences are more likely to be chosen as exemplars. The number of exemplars, ie of clusters, is influenced by the input preferences value. If the preferences are not passed as arguments, they will be set to the median of the input similarities.[0]
However, when I began tinkering with this value, I found that a very narrow range of values was giving me either too many clusters (preference=-11.13
) or too few clusters (preference=-11.11
).
Is there some way to determine what a 'reasonable' value of the preference parameter should be? And why would it be that I'm unable to obtain a non-extreme number of clusters?
Similar questions:
Affinity Propagation - Cluster Imbalance
Affinity Propagation preferences initialization
You could try using sklearn.model_selection.GridSearchCV
or sklearn.model_selection.RandomizedSearchCV
.
You could define a custom error measure that encourages the hyper-parameter search to generate smaller clusters. Then you can search several values to find one that is good for your dataset based on a validation set.
More info: http://scikit-learn.org/stable/modules/grid_search.html
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With