Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Calculating Hamming distance for two vectors in R?

Tags:

r

I'm trying just to calculate the Hamming distance between two vectors in R. I'm currently attempting to use the "e1071" package, and the hamming.distance function, as follows:

library(e1071)
H <- hamming.distance(X) 

Where X is a data.frame with 2 rows and (in my particular data) 667 columns, and every observation is 0 or 1.

Initially I got the error:

Error: evaluation nested too deeply: infinite recursion / options(expressions=)?

After some research, it appeared that one fix might be increasing the basic option in R. This I did via options(expressions=5000), and then tried varying values in place of the 5000. But this only produced the error:

Error: C stack usage is too close to the limit

I'm not much of a programmer, and the fixes for this most recent error appear to have to do with something inside the package e1071 possibly not being called correctly (or at the right time).

Any ideas on what I'm doing wrong? I eventually want the Hamming distances between a large number of vectors, and this was just a starting point. If this has to do with memory allocation, any suggestions for how to deal with it?

like image 231
Savage Henry Avatar asked Jan 25 '12 02:01

Savage Henry


2 Answers

I don't know how hamming.distance works internally, but a simple way to calculate the distance for 2 vectors is just

sum(x1 != x2)

or, in this case,

sum(X[1,] != X[2,])

If the total number of vectors is not too large (up to, say, a few thousand), you could implement this in a nested loop:

n <- nrow(X)
m <- matrix(nrow=n, ncol=n)
for(i in seq_len(n - 1))
    for(j in seq(i, n))
        m[j, i] <- m[i, j] <- sum(X[i,] != X[j,])

Caveat: untested.

like image 132
Hong Ooi Avatar answered Oct 04 '22 21:10

Hong Ooi


WARNING ABOUT USING HAMMING.DISTANCE FROM PACKAGE e1071!

This package's implementation forces the objects being compared to booleans with as.logical. This means that values of 0 will be FALSE and any non-zero values will be TRUE. This means that for the sequence: 0 1 2 compared to 0 1 1 the hamming distance will be reported as 0 instead of the correct value of 1 -- this package will treat 1 and 2 as equal since as.logical(1) == as.logical(2).

Here is the faulty (in my view) implementation:

    > library("e1071", lib.loc="C:/Program Files/R/R-2.15.3/library")
    Loading required package: class
    > hamming.distance
    function (x, y) 
    {
        z <- NULL
        if (is.vector(x) && is.vector(y)) {
            z <- sum(as.logical(x) != as.logical(y))
        }
        else {
            z <- matrix(0, nrow = nrow(x), ncol = nrow(x))
            for (k in 1:(nrow(x) - 1)) {
                for (l in (k + 1):nrow(x)) {
                    z[k, l] <- hamming.distance(x[k, ], x[l, ])
                    z[l, k] <- z[k, l]
                }
            }
            dimnames(z) <- list(dimnames(x)[[1]], dimnames(x)[[1]])
        }
        z
    }
    <environment: namespace:e1071>

My recommendation: DO NOT USE. Hamming distance is trivial to implement as noted several times above.

like image 39
PeterVermont Avatar answered Oct 04 '22 21:10

PeterVermont