I am working with a large vector of integers in R (around 10 million integers), and I need to find every distinct pair of integers from this vector that differ by 500 or less and make a histogram of their differences (i.e. for each pair, the second one minus the first one).
Here's the totally-unvectorized code for doing what I want extremely slowly:
# Generate some random example data
V <- round(rnorm(100) * 1000)
# Prepare the histogram
my.hist <- rep(0, 500)
names(my.hist) <- as.character(seq(1,500))
for (x1 in V) {
for (x2 in V) {
difference = x2 - x1
if (difference > 0 && difference <= 500) {
my.hist[difference] = my.hist[difference] + 1
}
}
}
(Assume that every integer is unique, so that difference > 0
bit is OK. This is allowed because I actually don't care about any cases where the difference is zero.)
Here's some code that vectorizes the inner loop:
my.hist2 <- rep(0, 500)
names(my.hist2) <- as.character(seq(1,500))
for (x1 in V) {
differences <- V[V > x1 & V <= x1+500] - x1
difftable <- table(differences)
my.hist2[names(difftable)] = my.hist2[names(difftable)] + difftable
}
This is certainly faster than the first version. However even this variant is already too slow when V
contains only 500000 elements (half a million).
I can do this without any explicit loops as follows:
X <- combn(V, 2)
# X is a matrix with two rows where each column represents a pair
diffs <- abs(X[2,] - X[1,])
my.hist3 <- table(diffs[diffs <= 500])
However the matrix X will contain 10e6 * (10e6 - 1) / 2
, or about 50,000,000,000,000 columns, which might be a problem.
So is there a way to do this without explicit looping (too slow) or expanding a matrix of all the pairs (too big)?
If you're wondering why I need to do this, I'm implementing this: http://biowhat.ucsd.edu/homer/chipseq/qc.html#Sequencing_Fragment_Length_Estimation
One possible improvement is to sort the data: the pairs (i,j) for which the distance is under 500 will then be close to the diagonal, and you do not have to explore all the values.
The code would look like this (it is still very slow).
n <- 1e5
k <- 500
V <- round(rnorm(n) * n * 10)
V <- as.integer(V)
V <- sort(V)
h <- rep(0,k)
for(i in 1:(n-1)) {
for(j in (i+1):n) {
d <- V[j] - V[i]
if( d > k ) break
if( d > 0 ) h[d] <- h[d]+1
}
}
Edit: If you want something much faster, you can write the loop in C. It takes 1s for your 10 million elements.
n <- 10e6
k <- 500
V <- round(rnorm(n) * n * 10)
V <- as.integer(V)
V <- sort(V)
h <- rep(0,k)
library(inline)
sig <- signature(n="integer", v="integer", k="integer", h="integer")
code <- "
for( int i = 0; i < (*n) - 1; i++ ) {
for( int j = i + 1; j < *n; j++ ) {
int d = v[j] - v[i];
if( d > *k ) break;
if( d > 0 ) h[d-1]++;
}
}
"
f <- cfunction( sig, code, convention=".C" )
h <- f(n,V,k,h)$h
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With