I'm trying to cluster a large (Gigabyte) dataset. In order to cluster, you need distance of every point to every other point, so you end up with a N^2 sized distance matrix, which in case of my dataset would be on the order of exabytes. Pdist in Matlab blows up instantly of course ;)
Is there a way to cluster subsets of the large data first, and then maybe do some merging of similar clusters?
I don't know if this helps any, but the data are fixed length binary strings, so I'm calculating their distances using Hamming distance (Distance=string1 XOR string2).
A simplified version of the nice method from Tabei et al., Single versus Multiple Sorting in All Pairs Similarity Search, say for pairs with Hammingdist 1:
This misses the fraction of e.g. 32/128 of the nearby pairs which have Hammingdist( left 32 ) 1 + Hammingdist( the rest ) 0. If you really want these, repeat the above with "first 32" -> "last 32".
The method can be extended. Take for example Hammingdist <= 2 on 4 32-bit words; the mismatches must split like one of 2000 0200 0020 0002 1100 1010 1001 0110 0101 0011, so 2 of the words must be 0, sort the same.
(Btw, sketchsort-0.0.7.tar is 99 % src/boost/, build/, .svn/ .)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With