Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to calculate different well-known similarity or distance measures between two vectors in R?

I want to compute the similarity (distance) between two vectors:

v1 <- c(1, 0.5, 0, 0.1)
v2 <- c(0.7, 1, 0.2, 0.1)

I just want to know if a package is available for calculating different well-known similarity (distance) measures in R? For example, "Resnik", "Lin", "Rel", "Jiang",...

The implementation of these method is not hard, but I really think it must be defined in some packages in R.

After some googling I found a package "GOSemSim", which contains most measures, but it's specific to the biomedical application and I can't use them for computing the similarity between two vectors.

like image 215
Amir H. Jadidinejad Avatar asked Jan 05 '14 02:01

Amir H. Jadidinejad


People also ask

How do you compare similarity between two vectors?

Cosine similarity measures the similarity between two vectors of an inner product space. It is measured by the cosine of the angle between two vectors and determines whether two vectors are pointing in roughly the same direction. It is often used to measure document similarity in text analysis.

What are the different methods of calculating similarity and dissimilarity?

Similarity/Dissimilarity for Simple Attributesd(p, q) = d(q,p) for all p and q, d(p, r) ≤ d(p, q) + d(q, r) for all p, q, and r, where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

Which distance measure is used to measure similarity or dissimilarity among the observations for creating different clusters?

Pearson correlation. Pearson correlation is widely used in clustering gene expression data [33,36,40]. This similarity measure calculates the similarity between the shapes of two gene expression patterns.

What is the difference between similarity and distance measures?

When you are measuring by distance, the most closely related points will have the lowest distance, but when you are measuring by similarity, the most closely related points will have the highest similarity.


2 Answers

"proxy" is a general library for distance and similarity measures. The following methods are supported:

"Jaccard" "Kulczynski1" "Kulczynski2" "Mountford" "Fager" "Russel" "simple matching" "Hamman" "Faith"
"Tanimoto" "Dice" "Phi" "Stiles" "Michael" "Mozley" "Yule" "Yule2" "Ochiai"
"Simpson" "Braun-Blanquet" "cosine" "eJaccard" "fJaccard" "correlation" "Chi-squared" "Phi-squared" "Tschuprow"
"Cramer" "Pearson" "Gower" "Euclidean" "Mahalanobis" "Bhjattacharyya" "Manhattan" "supremum" "Minkowski"
"Canberra" "Wave" "divergence" "Kullback" "Bray" "Soergel" "Levenshtein" "Podani" "Chord"
"Geodesic" "Whittaker" "Hellinger"

Check the following example:

x <- c(1,2,3,4,5)
y <- c(4,5,6,7,8)
l <- list(x, y)
simil(l, method="cosine")

The output is a similarity matrix between the elements of "l":

      1
2     0.978232

The only problem I have is that for some methods (such as: "Jaccard"), the following error is occurred:

simil(l, method="Jaccard")
Error in n - d : 'n' is missing
like image 176
Amir H. Jadidinejad Avatar answered Oct 16 '22 17:10

Amir H. Jadidinejad


The dist function supports via its method argument: "euclidean", "maximum", "manhattan", "canberra", "binary" or "minkowski". See ?dist

like image 2
G. Grothendieck Avatar answered Oct 16 '22 17:10

G. Grothendieck