Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Clustering a sparse dataset of binary vectors

If I have a sparse dataset where each data is described by a vector of 1000 elements, each element of this vector can be either 0 or 1 (a lot of 0 and some 1), do you know any distance function that could help me to cluster them ? Is something like euclidean distance convenient in this case ? I would like to know if there is a simple convenient distance metric for such a situation, to try on my data.

Thanks

like image 466
shn Avatar asked Dec 20 '11 08:12

shn


2 Answers

Your question doesn't have one answer. There are best-practices depending on the domain.

Once you decide on the similarity metric, the clustering is usually done by averaging or by finding a medoid. See these papers on clustering binary data for algorithm examples:

  • Carlos Ordonez. Clustering Binary Data Streams with K-means. PDF
  • Tao Li. A General Model for Clustering Binary Data. PDF

For ideas on similarity measures see this online "tool for measuring similarity between binary strings". They mention: Sokal-Michener, Jaccard, Russell-Rao, Hamann, Sorensen, antiDice, Sneath-Sokal, Rodger-Tanimoto, Ochiai, Yule, Anderberg, Kulczynski, Pearson's Phi, and Gower2, Dot Product, Cosine Coefficient, Hamming Distance. They also cite these papers:

  • Luke, B. T., Clustering Binary Objects
  • Lin, D., An Information-Theoretic Definition of Similarity.
  • Toit, du S.H.C.; Steyn, A.G.W.; Stumpf, R.H.; Graphical Exploratory Data Analysis; Chapter 3, p. 77, 1986; Springer-Verlag.

(I personally like the cosine. There is also KL-divergence, and its Jensen distance counterpart.)

like image 69
cyborg Avatar answered Sep 27 '22 18:09

cyborg


Have a look at distance functions used for sparse text vectors, such as Cosine Distance and for comparing sets, such as the Jaccard distance.

like image 28
Has QUIT--Anony-Mousse Avatar answered Sep 27 '22 17:09

Has QUIT--Anony-Mousse