Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Efficiently compute histogram of pairwise differences in a large vector in R?

I am working with a large vector of integers in R (around 10 million integers), and I need to find every distinct pair of integers from this vector that differ by 500 or less and make a histogram of their differences (i.e. for each pair, the second one minus the first one).

Here's the totally-unvectorized code for doing what I want extremely slowly:

# Generate some random example data
V <- round(rnorm(100) * 1000)

# Prepare the histogram
my.hist <- rep(0, 500)
names(my.hist) <- as.character(seq(1,500))
for (x1 in V) {
    for (x2 in V) {
        difference = x2 - x1
        if (difference > 0 && difference <= 500) {
            my.hist[difference] = my.hist[difference] + 1
        }
    }
}

(Assume that every integer is unique, so that difference > 0 bit is OK. This is allowed because I actually don't care about any cases where the difference is zero.)

Here's some code that vectorizes the inner loop:

my.hist2 <- rep(0, 500)
names(my.hist2) <- as.character(seq(1,500))
for (x1 in V) {
    differences <- V[V > x1 & V <= x1+500] - x1
    difftable <- table(differences)
    my.hist2[names(difftable)] = my.hist2[names(difftable)] + difftable
}

This is certainly faster than the first version. However even this variant is already too slow when V contains only 500000 elements (half a million).

I can do this without any explicit loops as follows:

X <- combn(V, 2)
# X is a matrix with two rows where each column represents a pair
diffs <- abs(X[2,] - X[1,])
my.hist3 <- table(diffs[diffs <= 500])

However the matrix X will contain 10e6 * (10e6 - 1) / 2, or about 50,000,000,000,000 columns, which might be a problem.

So is there a way to do this without explicit looping (too slow) or expanding a matrix of all the pairs (too big)?

If you're wondering why I need to do this, I'm implementing this: http://biowhat.ucsd.edu/homer/chipseq/qc.html#Sequencing_Fragment_Length_Estimation

like image 462
Ryan C. Thompson Avatar asked Mar 02 '12 01:03

Ryan C. Thompson


1 Answers

One possible improvement is to sort the data: the pairs (i,j) for which the distance is under 500 will then be close to the diagonal, and you do not have to explore all the values.

The code would look like this (it is still very slow).

n <- 1e5
k <- 500
V <- round(rnorm(n) * n * 10)
V <- as.integer(V)
V <- sort(V)
h <- rep(0,k)

for(i in 1:(n-1)) {
  for(j in (i+1):n) {
    d <- V[j] - V[i]
    if( d > k ) break
    if( d > 0 ) h[d] <- h[d]+1
  }
}

Edit: If you want something much faster, you can write the loop in C. It takes 1s for your 10 million elements.

n <- 10e6
k <- 500
V <- round(rnorm(n) * n * 10)
V <- as.integer(V)
V <- sort(V)
h <- rep(0,k)

library(inline)
sig <- signature(n="integer", v="integer", k="integer", h="integer")
code <- "
  for( int i = 0; i < (*n) - 1; i++ ) {
    for( int j = i + 1; j < *n; j++ ) {
      int d = v[j] - v[i];
      if( d > *k ) break;
      if( d > 0 ) h[d-1]++;
    }
  }
"
f <- cfunction( sig, code, convention=".C" )
h <- f(n,V,k,h)$h
like image 113
Vincent Zoonekynd Avatar answered Oct 22 '22 16:10

Vincent Zoonekynd