Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering on a large dataset

I'm trying to cluster a large (Gigabyte) dataset. In order to cluster, you need distance of every point to every other point, so you end up with a N^2 sized distance matrix, which in case of my dataset would be on the order of exabytes. Pdist in Matlab blows up instantly of course ;)

Is there a way to cluster subsets of the large data first, and then maybe do some merging of similar clusters?

I don't know if this helps any, but the data are fixed length binary strings, so I'm calculating their distances using Hamming distance (Distance=string1 XOR string2).

like image 243
Marcin Avatar asked Nov 15 '22 01:11

Marcin


1 Answers

A simplified version of the nice method from Tabei et al., Single versus Multiple Sorting in All Pairs Similarity Search, say for pairs with Hammingdist 1:

  • sort all the bit strings on the first 32 bits
  • look at blocks of strings where the first 32 bits are all the same; these blocks will be relatively small
  • pdist each of these blocks for Hammingdist( left 32 ) 0 + Hammingdist( the rest ) <= 1.

This misses the fraction of e.g. 32/128 of the nearby pairs which have Hammingdist( left 32 ) 1 + Hammingdist( the rest ) 0. If you really want these, repeat the above with "first 32" -> "last 32".

The method can be extended. Take for example Hammingdist <= 2 on 4 32-bit words; the mismatches must split like one of 2000 0200 0020 0002 1100 1010 1001 0110 0101 0011, so 2 of the words must be 0, sort the same.

(Btw, sketchsort-0.0.7.tar is 99 % src/boost/, build/, .svn/ .)

like image 137
denis Avatar answered Dec 19 '22 19:12

denis