If I have a sparse dataset where each data is described by a vector of 1000 elements, each element of this vector can be either 0 or 1 (a lot of 0 and some 1), do you know any distance function that could help me to cluster them ? Is something like euclidean distance convenient in this case ? I would like to know if there is a simple convenient distance metric for such a situation, to try on my data.
Thanks
Your question doesn't have one answer. There are best-practices depending on the domain.
Once you decide on the similarity metric, the clustering is usually done by averaging or by finding a medoid. See these papers on clustering binary data for algorithm examples:
- Carlos Ordonez. Clustering Binary Data Streams with K-means. PDF
- Tao Li. A General Model for Clustering Binary Data. PDF
For ideas on similarity measures see this online "tool for measuring similarity between binary strings". They mention: Sokal-Michener, Jaccard, Russell-Rao, Hamann, Sorensen, antiDice, Sneath-Sokal, Rodger-Tanimoto, Ochiai, Yule, Anderberg, Kulczynski, Pearson's Phi, and Gower2, Dot Product, Cosine Coefficient, Hamming Distance. They also cite these papers:
- Luke, B. T., Clustering Binary Objects
- Lin, D., An Information-Theoretic Definition of Similarity.
- Toit, du S.H.C.; Steyn, A.G.W.; Stumpf, R.H.; Graphical Exploratory Data Analysis; Chapter 3, p. 77, 1986; Springer-Verlag.
(I personally like the cosine. There is also KL-divergence, and its Jensen distance counterpart.)
Have a look at distance functions used for sparse text vectors, such as Cosine Distance and for comparing sets, such as the Jaccard distance.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With