Clustering huge vector space

Question

I'm doing some tests clustering a big number of very large sparse vectors representing term-frequency-inverse-document-frequency of various hypertextual documents. What algorithm would you suggest for clustering this data, taking into account the proportions of the data set? The dimension of the vectors would be > 3·10⁵ and the number of vectors could be around 10⁹. I've taken a look at dbscan and optics algorithms. The number of clusters is not known a priory. And a spatial index with such high dimensionality seems complicated.

Jerry Coffin · Accepted Answer

I've had almost as good of results with a simple K-means clustering as almost anything else, and it's definitely faster than most of the alternatives. I've also gotten good results with pair-wise agglomeration, but it's quite a bit slower. For K-means, you do have to start with some estimated number of clusters, but you can adjust it algorithmically as you go along. If you find two clusters with means that are too close together, you reduce the number of clusters. If you find clusters with too large a range of variation, you try more clusters. I've found sqrt(N) to be a reasonable starting point -- but I'm usually starting with more like 10^7 documents rather than 10^9. For 10^9, it might make sense to reduce that somewhat.

If it were up to me, however, I'd think very hard about starting by reducing the dimensionality with something like Landmark MDS, then doing the clustering.

Clustering huge vector space

Tags:

algorithm

cluster-analysis

piotr

1 Answers

Jerry Coffin

Recent Activity

Donate For Us

Clustering huge vector space

Tags:

algorithm

cluster-analysis

piotr

1 Answers

Jerry Coffin

Related questions

Recent Activity

Donate For Us