Efficiently compute histogram of pairwise differences in a large vector in R?

Question

I am working with a large vector of integers in R (around 10 million integers), and I need to find every distinct pair of integers from this vector that differ by 500 or less and make a histogram of their differences (i.e. for each pair, the second one minus the first one).

Here's the totally-unvectorized code for doing what I want extremely slowly:

# Generate some random example data
V <- round(rnorm(100) * 1000)

# Prepare the histogram
my.hist <- rep(0, 500)
names(my.hist) <- as.character(seq(1,500))
for (x1 in V) {
    for (x2 in V) {
        difference = x2 - x1
        if (difference > 0 && difference <= 500) {
            my.hist[difference] = my.hist[difference] + 1
        }
    }
}

(Assume that every integer is unique, so that difference > 0 bit is OK. This is allowed because I actually don't care about any cases where the difference is zero.)

Here's some code that vectorizes the inner loop:

my.hist2 <- rep(0, 500)
names(my.hist2) <- as.character(seq(1,500))
for (x1 in V) {
    differences <- V[V > x1 & V <= x1+500] - x1
    difftable <- table(differences)
    my.hist2[names(difftable)] = my.hist2[names(difftable)] + difftable
}

This is certainly faster than the first version. However even this variant is already too slow when V contains only 500000 elements (half a million).

I can do this without any explicit loops as follows:

X <- combn(V, 2)
# X is a matrix with two rows where each column represents a pair
diffs <- abs(X[2,] - X[1,])
my.hist3 <- table(diffs[diffs <= 500])

However the matrix X will contain 10e6 * (10e6 - 1) / 2, or about 50,000,000,000,000 columns, which might be a problem.

So is there a way to do this without explicit looping (too slow) or expanding a matrix of all the pairs (too big)?

If you're wondering why I need to do this, I'm implementing this: http://biowhat.ucsd.edu/homer/chipseq/qc.html#Sequencing_Fragment_Length_Estimation

Vincent Zoonekynd · Accepted Answer

One possible improvement is to sort the data: the pairs (i,j) for which the distance is under 500 will then be close to the diagonal, and you do not have to explore all the values.

The code would look like this (it is still very slow).

n <- 1e5
k <- 500
V <- round(rnorm(n) * n * 10)
V <- as.integer(V)
V <- sort(V)
h <- rep(0,k)

for(i in 1:(n-1)) {
  for(j in (i+1):n) {
    d <- V[j] - V[i]
    if( d > k ) break
    if( d > 0 ) h[d] <- h[d]+1
  }
}

Edit: If you want something much faster, you can write the loop in C. It takes 1s for your 10 million elements.

n <- 10e6
k <- 500
V <- round(rnorm(n) * n * 10)
V <- as.integer(V)
V <- sort(V)
h <- rep(0,k)

library(inline)
sig <- signature(n="integer", v="integer", k="integer", h="integer")
code <- "
  for( int i = 0; i < (*n) - 1; i++ ) {
    for( int j = i + 1; j < *n; j++ ) {
      int d = v[j] - v[i];
      if( d > *k ) break;
      if( d > 0 ) h[d-1]++;
    }
  }
"
f <- cfunction( sig, code, convention=".C" )
h <- f(n,V,k,h)$h

Efficiently compute histogram of pairwise differences in a large vector in R?

Tags:

performance

iteration

r

nested-loops

Ryan C. Thompson

1 Answers

Vincent Zoonekynd

Recent Activity

Donate For Us

Efficiently compute histogram of pairwise differences in a large vector in R?

Tags:

performance

iteration

r

nested-loops

Ryan C. Thompson

1 Answers

Vincent Zoonekynd

Related questions

Recent Activity

Donate For Us