Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Affinity propagation preference parameter

I've had encouraging results clustering a set of entity names using scikit-learn's affinity propagation implementation, with a modified Jaro-Winkler distance as the similarity metric, but my clusters are still too numerous (ie. too many false positives.)

I see in the scikit-learn documentation that there exists a 'preference' parameter that affects the number of clusters, with the following description:

preference : array-like, shape (n_samples,) or float, optional

Preferences for each point - points with larger values of preferences are more likely to be chosen as exemplars. The number of exemplars, ie of clusters, is influenced by the input preferences value. If the preferences are not passed as arguments, they will be set to the median of the input similarities.[0]

However, when I began tinkering with this value, I found that a very narrow range of values was giving me either too many clusters (preference=-11.13) or too few clusters (preference=-11.11).

Is there some way to determine what a 'reasonable' value of the preference parameter should be? And why would it be that I'm unable to obtain a non-extreme number of clusters?

Similar questions:

Affinity Propagation - Cluster Imbalance

Affinity Propagation preferences initialization

like image 230
nitrl Avatar asked Apr 24 '17 14:04

nitrl


1 Answers

You could try using sklearn.model_selection.GridSearchCV or sklearn.model_selection.RandomizedSearchCV.

You could define a custom error measure that encourages the hyper-parameter search to generate smaller clusters. Then you can search several values to find one that is good for your dataset based on a validation set.

More info: http://scikit-learn.org/stable/modules/grid_search.html

like image 143
Erotemic Avatar answered Oct 19 '22 01:10

Erotemic