Clustering a sparse dataset of binary vectors

Question

If I have a sparse dataset where each data is described by a vector of 1000 elements, each element of this vector can be either 0 or 1 (a lot of 0 and some 1), do you know any distance function that could help me to cluster them ? Is something like euclidean distance convenient in this case ? I would like to know if there is a simple convenient distance metric for such a situation, to try on my data.

Thanks

cyborg · Accepted Answer

Your question doesn't have one answer. There are best-practices depending on the domain.

Once you decide on the similarity metric, the clustering is usually done by averaging or by finding a medoid. See these papers on clustering binary data for algorithm examples:

Carlos Ordonez. Clustering Binary Data Streams with K-means. PDF

Tao Li. A General Model for Clustering Binary Data. PDF

For ideas on similarity measures see this online "tool for measuring similarity between binary strings". They mention: Sokal-Michener, Jaccard, Russell-Rao, Hamann, Sorensen, antiDice, Sneath-Sokal, Rodger-Tanimoto, Ochiai, Yule, Anderberg, Kulczynski, Pearson's Phi, and Gower2, Dot Product, Cosine Coefficient, Hamming Distance. They also cite these papers:

Luke, B. T., Clustering Binary Objects

Lin, D., An Information-Theoretic Definition of Similarity.

Toit, du S.H.C.; Steyn, A.G.W.; Stumpf, R.H.; Graphical Exploratory Data Analysis; Chapter 3, p. 77, 1986; Springer-Verlag.

(I personally like the cosine. There is also KL-divergence, and its Jensen distance counterpart.)

Has QUIT--Anony-Mousse · Answer

Have a look at distance functions used for sparse text vectors, such as Cosine Distance and for comparing sets, such as the Jaccard distance.

Clustering a sparse dataset of binary vectors

Tags:

distance

cluster-analysis

sparse-matrix

shn

2 Answers

cyborg

Has QUIT--Anony-Mousse

Recent Activity

Donate For Us

Clustering a sparse dataset of binary vectors

Tags:

distance

cluster-analysis

sparse-matrix

shn

2 Answers

cyborg

Has QUIT--Anony-Mousse

Related questions

Recent Activity

Donate For Us