Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering huge vector space

I'm doing some tests clustering a big number of very large sparse vectors representing term-frequency-inverse-document-frequency of various hypertextual documents. What algorithm would you suggest for clustering this data, taking into account the proportions of the data set? The dimension of the vectors would be > 3·105 and the number of vectors could be around 109. I've taken a look at dbscan and optics algorithms. The number of clusters is not known a priory. And a spatial index with such high dimensionality seems complicated.

like image 944
piotr Avatar asked Oct 08 '09 18:10

piotr


1 Answers

I've had almost as good of results with a simple K-means clustering as almost anything else, and it's definitely faster than most of the alternatives. I've also gotten good results with pair-wise agglomeration, but it's quite a bit slower. For K-means, you do have to start with some estimated number of clusters, but you can adjust it algorithmically as you go along. If you find two clusters with means that are too close together, you reduce the number of clusters. If you find clusters with too large a range of variation, you try more clusters. I've found sqrt(N) to be a reasonable starting point -- but I'm usually starting with more like 10^7 documents rather than 10^9. For 10^9, it might make sense to reduce that somewhat.

If it were up to me, however, I'd think very hard about starting by reducing the dimensionality with something like Landmark MDS, then doing the clustering.

like image 80
Jerry Coffin Avatar answered Oct 14 '22 02:10

Jerry Coffin