<p>In R, I have a large data.table. For every row, I want to count rows with a similar value of x1 (+/- some tolerance, tol). I can get this to work using adply, but it's too slow. It seems like the sort of thing data.table would be good for - in fact, I'm already using data.table for part of the computation.</p> <p>Is there a way to do this entirely with data.table? Here is an example:</p> <pre class="prettyprint"><code>library(data.table) library(plyr) my.df = data.table(x1 = 1:1000, x2 = 4:1003) tol = 3 adply(my.df, 1, function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .N]) </code></pre> <p>Results:</p> <pre class="prettyprint"><code> x1 x2 V1 1: 1 4 3 2: 2 5 4 3: 3 6 5 4: 4 7 5 5: 5 8 5 --- 996: 996 999 5 997: 997 1000 5 998: 998 1001 5 999: 999 1002 4 1000: 1000 1003 3 </code></pre> <h3>Update:</h3> <p>Here's a sample dataset that is a little closer to my real data:</p> <pre class="prettyprint"><code>set.seed(10) x = seq(1,100000000,100000) x = x + sample(1:50000, length(x), replace=T) x2 = x + sample(1:50000, length(x), replace=T) my.df = data.table(x1 = x, x2 = x2) setkey(my.df,x1) tol = 100000 og = function(my.df) { adply(my.df, 1, function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .N]) } microbenchmark(r_ed <- ed(copy(my.df)), r_ar <- ar(copy(my.df)), r_og <- og(copy(my.df)), times = 1) Unit: milliseconds expr min lq median uq max neval r_ed <- ed(copy(my.df)) 8.553137 8.553137 8.553137 8.553137 8.553137 1 r_ar <- ar(copy(my.df)) 10.229438 10.229438 10.229438 10.229438 10.229438 1 r_og <- og(copy(my.df)) 1424.472844 1424.472844 1424.472844 1424.472844 1424.472844 1 </code></pre> <p>Obviously, solutions from both @eddi and @Arun are much faster than mine. Now I just have to try to understand rolls.</p>

<h3>See @eddi's answer for a faster solution (to this particular problem). It also works when <code>x1</code> is not an integer.</h3> <p>The algorithm you're looking for is <strong>Interval Tree</strong>. And there's a bioconductor package called <strong>IRanges</strong> that accomplishes this task. It's hard to beat that.</p> <pre class="prettyprint"><code>require(IRanges) require(data.table) my.df[, res := countOverlaps(IRanges(my.df$x1, width=1), IRanges(my.df$x1-tol+1, my.df$x1+tol-1))] </code></pre> <hr> <h3>Some explanation:</h3> <p>If you break down the code, you can write it in three lines:</p> <pre class="prettyprint"><code>ir1 <- IRanges(my.df$x1, width=1) ir2 <- IRanges(my.df$x1-tol+1, my.df$x1+tol-1) cnt <- countOverlaps(ir1, ir2) </code></pre> <p>What we essentially do is to is to create two "ranges" (just type <code>ir1</code> and <code>ir2</code> to see how they are). Then we ask, for each entry in <code>ir1</code> how many do they overlap in <code>ir2</code> (this is the "interval tree" part). And this is very efficient. Implicitly the argument <code>type</code> to <code>countOverlaps</code>, by default is "type = any". You can explore the other types if you want. It's extremely useful. Also of relevance is <code>findOverlaps</code> function.</p> <p>Note: There can be faster solutions (in fact there is, see @eddi's) for this particular case, where width of ir1 = 1. But for problems where widths are variable and/or > 1, this should be the fastest.</p> <hr> <h3>Benchmarking:</h3> <pre class="prettyprint"><code>ag <- function(my.df) my.df[, res := sum(abs(my.df$x1-x1) < tol), by=x1] ro <- function(my.df) { my.df[,res:= { y = my.df$x1 sum(y > (x1 - tol) & y < (x1 + tol)) }, by=x1] } ar <- function(my.df) { my.df[, res := countOverlaps(IRanges(my.df$x1, width=1), IRanges(my.df$x1-tol+1, my.df$x1+tol-1))] } require(microbenchmark) microbenchmark(r1 <- ag(copy(my.df)), r2 <- ro(copy(my.df)), r3 <- ar(copy(my.df)), times=100) Unit: milliseconds expr min lq median uq max neval r1 <- ag(copy(my.df)) 33.15940 39.63531 41.61555 44.56616 208.99067 100 r2 <- ro(copy(my.df)) 69.35311 76.66642 80.23917 84.67419 344.82031 100 r3 <- ar(copy(my.df)) 11.22027 12.14113 13.21196 14.72830 48.61417 100 <~~~ identical(r1, r2) # TRUE identical(r1, r3) # TRUE </code></pre>

<p>Here is a pure data.table solution:</p> <pre class="prettyprint"><code>my.df[, res:=sum(my.df$x1 > (x1 - tol) & my.df$x1 < (x1 + tol)), by=x1] my.df <- adply(my.df, 1, function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .N]) identical(my.df[,res],my.df[,V1]) #[1] TRUE </code></pre> <p>However, this will still be relatively slow if you have many unique <code>x1</code>. After all, you need to do a huge number of comparisons and I can't think of a way to avoid that right now.</p>

R grouping by condition in data.table

Tags:

r

data.table

grouping

In R, I have a large data.table. For every row, I want to count rows with a similar value of x1 (+/- some tolerance, tol). I can get this to work using adply, but it's too slow. It seems like the sort of thing data.table would be good for - in fact, I'm already using data.table for part of the computation.

Is there a way to do this entirely with data.table? Here is an example:

library(data.table)
library(plyr)
my.df = data.table(x1 = 1:1000,
                   x2 = 4:1003)
tol = 3
adply(my.df, 1, function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .N])

Results:

        x1   x2 V1
   1:    1    4  3
   2:    2    5  4
   3:    3    6  5
   4:    4    7  5
   5:    5    8  5
  ---             
 996:  996  999  5
 997:  997 1000  5
 998:  998 1001  5
 999:  999 1002  4
1000: 1000 1003  3

Update:

Here's a sample dataset that is a little closer to my real data:

set.seed(10)
x = seq(1,100000000,100000)
x = x + sample(1:50000, length(x), replace=T)
x2 = x + sample(1:50000, length(x), replace=T)
my.df = data.table(x1 = x,
                   x2 = x2)
setkey(my.df,x1)
tol = 100000

og = function(my.df) {
  adply(my.df, 1, function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .N])
}

microbenchmark(r_ed <- ed(copy(my.df)),
               r_ar <- ar(copy(my.df)),
               r_og <- og(copy(my.df)),
               times = 1)

Unit: milliseconds
                    expr         min          lq      median          uq         max neval
 r_ed <- ed(copy(my.df))    8.553137    8.553137    8.553137    8.553137    8.553137     1
 r_ar <- ar(copy(my.df))   10.229438   10.229438   10.229438   10.229438   10.229438     1
 r_og <- og(copy(my.df)) 1424.472844 1424.472844 1424.472844 1424.472844 1424.472844     1

Obviously, solutions from both @eddi and @Arun are much faster than mine. Now I just have to try to understand rolls.

461

asked Aug 08 '13 06:08

benjamin

3 Answers

See @eddi's answer for a faster solution (to this particular problem). It also works when `x1` is not an integer.

The algorithm you're looking for is Interval Tree. And there's a bioconductor package called IRanges that accomplishes this task. It's hard to beat that.

require(IRanges)
require(data.table)
my.df[, res := countOverlaps(IRanges(my.df$x1, width=1), 
           IRanges(my.df$x1-tol+1, my.df$x1+tol-1))]

Some explanation:

If you break down the code, you can write it in three lines:

ir1 <- IRanges(my.df$x1, width=1)
ir2 <- IRanges(my.df$x1-tol+1, my.df$x1+tol-1)
cnt <- countOverlaps(ir1, ir2)

What we essentially do is to is to create two "ranges" (just type ir1 and ir2 to see how they are). Then we ask, for each entry in ir1 how many do they overlap in ir2 (this is the "interval tree" part). And this is very efficient. Implicitly the argument type to countOverlaps, by default is "type = any". You can explore the other types if you want. It's extremely useful. Also of relevance is findOverlaps function.

Note: There can be faster solutions (in fact there is, see @eddi's) for this particular case, where width of ir1 = 1. But for problems where widths are variable and/or > 1, this should be the fastest.

Benchmarking:

ag <- function(my.df) my.df[, res := sum(abs(my.df$x1-x1) < tol), by=x1]
ro <- function(my.df) {
            my.df[,res:= { y = my.df$x1
            sum(y > (x1 - tol) & y < (x1 + tol))
            }, by=x1]
      }
ar <- function(my.df) {
           my.df[, res := countOverlaps(IRanges(my.df$x1, width=1), 
            IRanges(my.df$x1-tol+1, my.df$x1+tol-1))]
      }


require(microbenchmark)
microbenchmark(r1 <- ag(copy(my.df)), r2 <- ro(copy(my.df)), 
               r3 <- ar(copy(my.df)), times=100)

Unit: milliseconds
                  expr      min       lq   median       uq       max neval
 r1 <- ag(copy(my.df)) 33.15940 39.63531 41.61555 44.56616 208.99067   100
 r2 <- ro(copy(my.df)) 69.35311 76.66642 80.23917 84.67419 344.82031   100
 r3 <- ar(copy(my.df)) 11.22027 12.14113 13.21196 14.72830  48.61417   100 <~~~

identical(r1, r2) # TRUE
identical(r1, r3) # TRUE

125

answered Oct 12 '22 10:10

Arun

Here's a faster data.table solution. The idea is to use the rolling merge functionality of data.table, but before we do that we need to modify the data slightly and make the column x1 numeric instead of integer. This is because OP is using strict inequality and to use rolling joins with that we're going to have to decrease the tolerance by a tiny amount, making it a floating point number.

my.df[, x1 := as.numeric(x1)]

# set the key to x1 for the merges and to sort
# (note, if data already sorted can make this step instantaneous using setattr)
setkey(my.df, x1)

# and now we're going to do two rolling merges, one with the upper bound
# and one with lower, then get the index of the match and subtract the ends
# (+1, to get the count)
my.df[, res := my.df[J(x1 + tol - 1e-6), list(ind = .I), roll = Inf]$ind -
               my.df[J(x1 - tol + 1e-6), list(ind = .I), roll = -Inf]$ind + 1]


# and here's the bench vs @Arun's solution
ed = function(my.df) {
  my.df[, x1 := as.numeric(x1)]
  setkey(my.df, x1)
  my.df[, res := my.df[J(x1 + tol - 1e-6), list(ind = .I), roll = Inf]$ind -
                 my.df[J(x1 - tol + 1e-6), list(ind = .I), roll = -Inf]$ind + 1]
}

microbenchmark(ed(copy(my.df)), ar(copy(my.df)))
#Unit: milliseconds
#            expr       min       lq   median       uq      max neval
# ed(copy(my.df))  7.297928 10.09947 10.87561 11.80083 23.05907   100
# ar(copy(my.df)) 10.825521 15.38151 16.36115 18.15350 21.98761   100

Note: as both Arun and Matthew pointed out, if x1 is integer, one doesn't have to convert to numeric and subtract a small amount from tol and can use tol - 1L instead of tol - 1e-6 above.

answered Oct 12 '22 09:10

eddi

Here is a pure data.table solution:

my.df[, res:=sum(my.df$x1 > (x1 - tol) & my.df$x1 < (x1 + tol)), by=x1]

my.df <- adply(my.df, 1, 
           function(df) my.df[x1 > (df$x1 - tol) & x1 < (df$x1 + tol), .N])

identical(my.df[,res],my.df[,V1])
#[1] TRUE

However, this will still be relatively slow if you have many unique x1. After all, you need to do a huge number of comparisons and I can't think of a way to avoid that right now.

answered Oct 12 '22 08:10

Roland

Related questions
                            
                                function for weighted least squares estimates
                            
                                Cleaning up function list in an R package with lots of functions
                            
                                How to calculate correlation of two variables in a huge data set in R?
                            
                                Any way to extend the line in the legend?
                            
                                R: Getting attribute values as a vector
                            
                                Add a new level to a factor and substitute existing one
                            
                                Reading csv file, having numbers and strings in one column
                            
                                How to generate random numbers with a specified lognormal distribution in R?
                            
                                Split vector with overlapping samples in R
                            
                                how to get all subsets up to N in R
                            
                                R: ggplot : How do you plot a square-matrix(not symmetric) as a heatmap?
                            
                                What's the fastest way to apply t.test to each column of a large matrix?
                            
                                using anonymous functions in R with multiple arguments
                            
                                How to be alerted about the ongoing progress of a loop/lapply
                            
                                Guess correct column storage mode from data.frame of strings
                            
                                replace the first N dots of a string
                            
                                Aggregating in R over 80K unique ID's
                            
                                Replicate vector in R
                            
                                read.xls - read in variable-length list of sheets, with their names
                            
                                Subsetting a dataframe by the amount of repetition [duplicate]

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

R grouping by condition in data.table

Tags:

r

data.table

grouping

Update:

benjamin

People also ask

3 Answers

See @eddi's answer for a faster solution (to this particular problem). It also works when `x1` is not an integer.

Some explanation:

Benchmarking:

Arun

eddi

Roland

Recent Activity

Donate For Us

R grouping by condition in data.table

Tags:

r

data.table

grouping

Update:

benjamin

People also ask

3 Answers

See @eddi's answer for a faster solution (to this particular problem). It also works when x1 is not an integer.

Some explanation:

Benchmarking:

Arun

eddi

Roland

Related questions

Recent Activity

Donate For Us

See @eddi's answer for a faster solution (to this particular problem). It also works when `x1` is not an integer.