Computing pairwise Hamming distance between all rows of two integer matrices/data frames

Question

I have two data frames, df1 with reference data and df2 with new data. For each row in df2, I need to find the best (and the second best) matching row to df1 in terms of hamming distance.

I used e1071 package to compute hamming distance. Hamming distance between two vectors x and y can be computed as for example:

x <- c(356739, 324074, 904133, 1025460, 433677, 110525, 576942, 526518, 299386,
       92497, 977385, 27563, 429551, 307757, 267970, 181157, 3796, 679012, 711274,
       24197, 610187, 402471, 157122, 866381, 582868, 878)

y <- c(356739, 324042, 904133, 959893, 433677, 110269, 576942, 2230, 267130,
       92496, 960747, 28587, 429551, 438825, 267970, 181157, 36564, 677220,
       711274, 24485, 610187, 404519, 157122, 866413, 718036, 876)

xm <- sapply(x, intToBits)
ym <- sapply(y, intToBits)

distance <- sum(sapply(1:ncol(xm), function(i) hamming.distance(xm[,i], ym[,i])))

and the resulting distance is 25. Yet I need to do this for all rows of df1 and df2. A trivial method takes a double loop nest and looks terribly slow.

Any ideas how to do this more efficiently? In the end I need to append to df2:

a column with the row id from df1 that gives the lowest distance;
a column with the lowest distance;
a column with the row id from df1 that gives the 2nd lowest distance;
a column with the second lowest distance.

Thanks.

Zheyuan Li · Accepted Answer

Please don't be surprised why I take another section. This part gives something relevant. It is not what OP asks for, but may help any readers.

General hamming distance computation

In the previous answer, I start from a function hmd0 that computes hamming distance between two integer vectors of the same length. This means if we have 2 integer vectors:

set.seed(0)
x <- sample(1:100, 6)
y <- sample(1:100, 6)

we will end up with a scalar:

hmd0(x,y)
# 13

What if we want to compute pairwise hamming distance of two vectors?

In fact, a simple modification to our function hmd will do:

hamming.distance <- function(x, y, pairwise = TRUE) {
  nx <- length(x)
  ny <- length(y)
  rawx <- intToBits(x)
  rawy <- intToBits(y)
  if (nx == 1 && ny == 1) return(sum(as.logical(xor(intToBits(x),intToBits(y)))))
  if (nx < ny) {
    ## pivoting
    tmp <- rawx; rawx <- rawy; rawy <- tmp
    tmp <- nx; nx <- ny; ny <- tmp
    }
  if (nx %% ny) stop("unconformable length!") else {
    bits <- length(intToBits(0)) ## 32-bit or 64 bit?
    result <- unname(tapply(as.logical(xor(rawx,rawy)), rep(1:ny, each = bits), sum))
    }
  if (pairwise) result else sum(result)
  }

Now

hamming.distance(x, y, pairwise = TRUE)
# [1] 0 3 3 2 5 0
hamming.distance(x, y, pairwise = FALSE)
# [1] 13

Hamming distance matrix

If we want to compute the hamming distance matrix, for example,

set.seed(1)
x <- sample(1:100, 5)
y <- sample(1:100, 7)

The distance matrix between x and y is:

outer(x, y, hamming.distance)  ## pairwise argument has no effect here

#      [,1] [,2] [,3] [,4] [,5] [,6] [,7]
# [1,]    2    3    4    3    4    4    2
# [2,]    7    6    3    4    3    3    3
# [3,]    4    5    4    3    6    4    2
# [4,]    2    3    2    5    6    4    2
# [5,]    4    3    4    3    2    0    2

We can also do:

outer(x, x, hamming.distance)

#     [,1] [,2] [,3] [,4] [,5]
# [1,]    0    5    2    2    4
# [2,]    5    0    3    5    3
# [3,]    2    3    0    2    4
# [4,]    2    5    2    0    4
# [5,]    4    3    4    4    0

In the latter situation, we end up with a symmetric matrix with 0 on the diagonal. Using outer is inefficient here, but it is still more efficient than writing R loops. Since our hamming.distance is written in R code, I would stay with using outer. In my answer to this question, I demonstrate the idea of using compiled code. This of course requires writing a C version of hamming.distance, but I will not show it here.

Computing pairwise Hamming distance between all rows of two integer matrices/data frames

Tags:

r

sapply

tapply

apply

hamming-distance

alaj

1 Answers

Zheyuan Li

Recent Activity

Donate For Us

Computing pairwise Hamming distance between all rows of two integer matrices/data frames

Tags:

r

sapply

tapply

apply

hamming-distance

alaj

1 Answers

Zheyuan Li

Related questions

Recent Activity

Donate For Us