I have a data.table with some values ('value'), lower limits ('min_val') and upper limits ('max_val'): <pre class="prettyprint"><code> | value | min_val | max_val | 1: | 94.001 | 94.00 | 94.02 | 2: | 94.002 | 94.00 | 94.03 | 3: | 94.003 | 94.01 | 94.04 | 4: | 95 | 94.98 | 95.02 | </code></pre> I want to calculate the count of rows where value > min_val & value < max_val for each line for the values from the whole dt. <pre class="prettyprint"><code> | value | min_val | max_val | count | 1: | 94.001 | 94.00 | 94.02 | 1 | #(num of value(s) > 94.00 & < 94.02) 2: | 94.002 | 94.00 | 94.03 | 2 | 3: | 94.003 | 94.01 | 94.04 | 2 | 4: | 95 | 94.98 | 95.02 | 1 | </code></pre> I've tried <code>dt[, count := nrow(dt[value > dt[min_val] & value < dt[max_val]])]</code> but I'm on the wrong path.

The OP has disclosed that his production dataset consists of 81 million rows. Unfortunately, r2evans' benchmark used only the 4 rows sample dataset provided with the question and has neglected henrik's suggestion. To find the best approach for a large dataset like OP's one I find it worthwhile to run benchmarks with varying and realistic problem sizes in order to measure run time as well as memory consumption. Run time and memory consumption may be depend on <ol> <li>the number of values,</li> <li>the number of intervals,</li> <li>and the number of values which fall into each interval.</li> </ol> Items 1 and 2 are linked as the number of values and the number of intervals is given by the number of rows in the dataset. So, there are two size parameter we can vary, the number of rows <code>n</code> and the number values within each interval <code>m</code>. In order to have reproducible and predictable benchmark data, the benchmark datasets are constructed by <pre class="prettyprint"><code>d <- data.table(value = as.double(seq(n)))[, min_val := value - m][, max_val := value + m][, count := -1L] </code></pre> In case of <code>n <- 4</code> and <code>m <- 1</code> <code>d</code> becomes <pre class="prettyprint"><code> value min_val max_val count <num> <num> <num> <int> 1: 1 0 2 -1 2: 2 1 3 -1 3: 3 2 4 -1 4: 4 3 5 -1 </code></pre> In order to create equal conditions for each benchmark run the <code>count</code> column is pre-allocated with some dummy data. The benchmark includes <ul> <li> henrik's suggestion to aggregate in a non-equi self join,</li> <li>the 3 approaches in r2evans' 3 answer,</li> <li>the answer by ThomasIsCoding,</li> <li>the answer by Alexandre Léonard </li> </ul> Edit: A 3rd set of benchmark runs compares <ul> <li> henrik's aggregate in a non-equi self join, and</li> <li> chinsoon12's Rcpp solution.</li> </ul> Unfortunately, TarJae's answer did not work for me. <pre class="prettyprint"><code>library(data.table) library(bench) library(ggplot2) options(datatable.print.class = TRUE) bm1 <- press( n = 2^c(2, 10, 13), m = 2^c(0, 9, 12), { d <- data.table(value = as.double(seq(n)))[, min_val := value - m][, max_val := value + m][, count := -1L] mark( henrik = { d[ , count := d[d, on = .(value > min_val, value < max_val), .N, by = .EACHI]$N] }, r2evans0 = { d[, count := rowSums(outer(seq_len(.N), value, function(i, v) {min_val[i] < v & v < max_val[i];}))] }, r2evans1 = { d[, count := mapply(function(mi,ma) sum(mi < value & value < ma), min_val, max_val)] }, r2evans2 = { d[, count := rowSums(outer(min_val, d$value, `<`) & outer(max_val, d$value, `>`)), by = .(g = seq_len(nrow(d)) %/% 100)] }, Thomas = { d[, count := colSums(outer(value, min_val, ">") & outer(value, max_val, "<"))] }, Alexandre = { d[, count := lapply( # seq.int(1, nrow(d)), seq_len(nrow(d)), function(i) sum(d[, value] > d[i, min_val] & d[, value] < d[i, max_val]) )] }, min_iterations = 3 ) } ) autoplot(bm1) </code></pre> <img src="https://i.stack.imgur.com/mBtiG.png" alt="enter image description here"> Please, note the logarithmic time scale. The chart exhibits that <ul> <li>Alexandre's approach is up to a magnitude slower than any of the other solutions (and may be omitted from further runs),</li> <li>with increasing number of rows <code>n</code> henrik's approach becomes neck and neck with <code>r2evans1</code> (worth to be investigated further),</li> <li>the number of values in each interval <code>m</code> seems to have no or little effect on run time.</li> </ul> The latter can be verified by changing the facets and plotting the median times for different <code>m</code> in one facet: <pre class="prettyprint"><code>ggplot(bm1) + aes(x = median, y = expression, color = as.factor(m)) + geom_point() + facet_wrap(vars(n)) </code></pre> <img src="https://i.stack.imgur.com/uA9sa.png" alt="enter image description here"> In the next chart below, plotting <code>mem_alloc</code> instead of <code>median</code> times exhibits that <ul> <li> <code>m</code> has no effect on memory allocation (with one exception),</li> <li>for large <code>n</code>, henrik's approach needs magnitudes less memory than any other approach:</li> </ul> <img src="https://i.stack.imgur.com/n2GM7.png" alt="enter image description here"> Please, note the log scale. <h3>2nd set of benchmark runs</h3> Based on the previous results, the next set of benchmark runs varies only size parameter <code>n</code> while <code>m</code> is kept constant. Alexandre's approach is omitted as it is too slow. <code>n</code> is varied from <code>2^10</code> (1024) to <code>2^14</code> (16384) with <code>m = 1.0</code>. Unfortunately, the run with <code>n = 2^15</code> was aborted due to lack of memory on my PC. <pre class="prettyprint"><code>autoplot(bm2) </code></pre> <img src="https://i.stack.imgur.com/B7XSY.png" alt="enter image description here"> henrik's approach has taken lead in terms of speed for the <code>2^14</code> (16384) rows case. To identify if this indicates a trend, run time vs problem size <code>n</code> is plotted by <pre class="prettyprint"><code>ggplot(bm2) + aes(x = n, y = median, color = expression, group = attr(expression, "description"), label = ifelse(n == max(n), attr(expression, "description"), "")) + geom_point() + geom_smooth(method = "lm", se = FALSE) + scale_x_continuous(trans = scales::log2_trans(), expand = expansion(mult = c(0.05, 0.1))) + ggrepel::geom_label_repel(direction = "y", nudge_x = 1000) </code></pre> <img src="https://i.stack.imgur.com/KXqSo.png" alt="enter image description here"> henrik's approach seems to have a high overhead but gains a speed advantage with increasing problem size. Also with respect to memory allocation, henrik's approach aggregating in a non-equi self join seems to need substantially less memory than the other approaches. More importantly, memory allocation increases less steep with problem size which indicates that this approach can handle much larger problem sizes when available computer memory is a limiting factor. <img src="https://i.stack.imgur.com/LysLY.png" alt="enter image description here"> <h3>EDIT: 3rd set of benchmark runs</h3> This set of benchmark runs compares henrik's aggregate in a non-equi self join with chinsoon12's new Rcpp solution. Due to the much smaller memory footprint of both approaches the problem size can be increased up to <code>2^18</code> (262144) rows before hitting the 16 GB memory limit on my Windows PC. <pre class="prettyprint"><code>library(Rcpp) bm4 <- press( n = 2^(10:18), { m <- 1. d <- data.table(value = as.double(seq(n)))[, min_val := value - m][, max_val := value + m][, count := -1L] mark( henrik = { d[ , count := d[d, on = .(value > min_val, value < max_val), .N, by = .EACHI]$N] }, chinsoon = { cppFunction("IntegerVector inrange(NumericVector v, NumericVector minv, NumericVector maxv) { int n = v.size(); IntegerVector res(n); for (int r=0; r<n; r++) { for (int i=0; i<n; i++) { if (v[i] > minv[r] && v[i] < maxv[r]) { res[r]++; } } } return res; }") d[, count := inrange(value, min_val, max_val)] }, min_iterations = 3 ) } ) </code></pre> The next two charts show the median run time and the memory allocation vs problem size, resp. (please, note the log scales): <img src="https://i.stack.imgur.com/AFJjH.png" alt="enter image description here"> <img src="https://i.stack.imgur.com/mGF85.png" alt="enter image description here"> Results for <code>n = 2^18</code> (262144): <pre class="prettyprint"><code>setDT(bm4)[n == 2^18, .(expression = unique(attr(expression, "description")), n, median, mem_alloc)] </code></pre> <blockquote> <pre class="prettyprint"><code> expression n median mem_alloc <char> <num> <bench_time> <bench_bytes> 1: henrik 262144 17.47s 12.14MB 2: chinsoon 262144 1.03m 2.06MB </code></pre> </blockquote> Apparently, chinsoon's approach is faster for problem sizes up to <code>2^16</code> (65536) while henrik's approach ist faster for larger problem sizes (and seems to have a more linear time behaviour). For problem size <code>n = 2^18</code>, henrik's approach is almost 4 times faster than chinsoon's. On the other hand, henrik's approach allocates much more memory than chinsoon's. For problem size <code>n = 2^18</code>, henrik's approach allocates about 6 time more memory than chinsoon's. Apparently, this ratio is constant for increasing problem size. So, there is a tradeoff between speed (henrik's approach) and memory requirement (chinsoon's approach) depending on problem size. Your mileage may vary.

We can try <code>outer</code> like this <pre class="prettyprint"><code>> setDT(df)[, count := colSums(outer(value, min_val, ">") & outer(value, max_val, "<"))][] value min_val max_val count 1: 94.001 94.00 94.02 3 2: 94.002 94.00 94.03 3 3: 94.003 94.01 94.04 0 4: 95.000 94.98 95.02 1 </code></pre> <h3>data</h3> <pre class="prettyprint"><code>> dput(df) structure(list(value = c(94.001, 94.002, 94.003, 95), min_val = c(94, 94, 94.01, 94.98), max_val = c(94.02, 94.03, 94.04, 95.02)), class = "data.frame", row.names = c(NA, -4L)) </code></pre>

Count number of rows within certain range

Q: How do I count rows in Excel with specific data?

If you need a quick way to count rows that contain data, select all the cells in the first column of that data (it may not be column A). Just click the column header. The status bar, in the lower-right corner of your Excel window, will tell you the row count.

Q: How do I count cells within a range in Excel?

Use the COUNT function to get the number of entries in a number field that is in a range or array of numbers. For example, you can enter the following formula to count the numbers in the range A1:A20: =COUNT(A1:A20). In this example, if five of the cells in the range contain numbers, the result is 5.

Q: How do you Countif in range?

Using COUNTIFS to Count Between two Numbers First, enter the “=COUNTIS(“ in cell C1. After that, refer to the range from where you want to count the values. Next, you need to specify the upper number using greater than and equal sign. From here, again you need to refer to the range of numbers in the criteria2.

Tags:

performance

dataframe

range

r

data.table

I have a data.table with some values ('value'), lower limits ('min_val') and upper limits ('max_val'):

   | value | min_val | max_val |
1: | 94.001 | 94.00 | 94.02 |
2: | 94.002 | 94.00 | 94.03 |
3: | 94.003 | 94.01 | 94.04 |
4: | 95 | 94.98 | 95.02 |

I want to calculate the count of rows where value > min_val & value < max_val for each line for the values from the whole dt.

   | value | min_val | max_val | count |
1: | 94.001 | 94.00 | 94.02 |  1       |   #(num of value(s) > 94.00 &  < 94.02)
2: | 94.002 | 94.00 | 94.03 |  2       |
3: | 94.003 | 94.01 | 94.04 |  2       |
4: | 95     | 94.98 | 95.02 |  1       |

I've tried dt[, count := nrow(dt[value > dt[min_val] & value < dt[max_val]])] but I'm on the wrong path.

785

asked Aug 27 '21 18:08

iuliux

2 Answers

The OP has disclosed that his production dataset consists of 81 million rows. Unfortunately, r2evans' benchmark used only the 4 rows sample dataset provided with the question and has neglected henrik's suggestion. To find the best approach for a large dataset like OP's one I find it worthwhile to run benchmarks with varying and realistic problem sizes in order to measure run time as well as memory consumption.

Run time and memory consumption may be depend on

the number of values,
the number of intervals,
and the number of values which fall into each interval.

Items 1 and 2 are linked as the number of values and the number of intervals is given by the number of rows in the dataset. So, there are two size parameter we can vary, the number of rows n and the number values within each interval m. In order to have reproducible and predictable benchmark data, the benchmark datasets are constructed by

d <- data.table(value = as.double(seq(n)))[, min_val := value - m][, max_val := value + m][, count := -1L]

In case of n <- 4 and m <- 1 d becomes

   value min_val max_val count
   <num>   <num>   <num> <int>
1:     1       0       2    -1
2:     2       1       3    -1
3:     3       2       4    -1
4:     4       3       5    -1

In order to create equal conditions for each benchmark run the count column is pre-allocated with some dummy data.

The benchmark includes

henrik's suggestion to aggregate in a non-equi self join,
the 3 approaches in r2evans' 3 answer,
the answer by ThomasIsCoding,
the answer by Alexandre Léonard

Edit: A 3rd set of benchmark runs compares

henrik's aggregate in a non-equi self join, and
chinsoon12's Rcpp solution.

Unfortunately, TarJae's answer did not work for me.

library(data.table)
library(bench)
library(ggplot2)
options(datatable.print.class = TRUE)

bm1 <- press(
  n = 2^c(2, 10, 13),
  m = 2^c(0, 9, 12),
  {
    d <- data.table(value = as.double(seq(n)))[, min_val := value - m][, max_val := value + m][, count := -1L]
    mark(
      henrik = {
        d[ , count := d[d, on = .(value > min_val, value < max_val), .N, by = .EACHI]$N]
      },
      r2evans0 = {
        d[, count := rowSums(outer(seq_len(.N), value, function(i, v) {min_val[i] < v & v < max_val[i];}))]
      },
      r2evans1 = {
        d[, count := mapply(function(mi,ma) sum(mi < value & value < ma), min_val, max_val)]
      },
      r2evans2 = {
        d[, count := rowSums(outer(min_val, d$value, `<`) &
                               outer(max_val, d$value, `>`)),
          by = .(g = seq_len(nrow(d)) %/% 100)]
      },
      Thomas = {
        d[, count := colSums(outer(value, min_val, ">") & outer(value, max_val, "<"))]
      },
      Alexandre = {
        d[, count := lapply(
          # seq.int(1, nrow(d)),
          seq_len(nrow(d)),
          function(i) sum(d[, value] > d[i, min_val] & d[, value] < d[i, max_val])
        )]
      },
      min_iterations = 3
    )
  }
)

autoplot(bm1)

enter image description here

Please, note the logarithmic time scale.

The chart exhibits that

Alexandre's approach is up to a magnitude slower than any of the other solutions (and may be omitted from further runs),
with increasing number of rows n henrik's approach becomes neck and neck with r2evans1 (worth to be investigated further),
the number of values in each interval m seems to have no or little effect on run time.

The latter can be verified by changing the facets and plotting the median times for different m in one facet:

ggplot(bm1) +
  aes(x = median, y = expression, color = as.factor(m)) +
  geom_point() + 
  facet_wrap(vars(n))

enter image description here

In the next chart below, plotting mem_alloc instead of median times exhibits that

m has no effect on memory allocation (with one exception),
for large n, henrik's approach needs magnitudes less memory than any other approach:

enter image description here

Please, note the log scale.

2nd set of benchmark runs

Based on the previous results, the next set of benchmark runs varies only size parameter n while m is kept constant. Alexandre's approach is omitted as it is too slow.

n is varied from 2^10 (1024) to 2^14 (16384) with m = 1.0. Unfortunately, the run with n = 2^15 was aborted due to lack of memory on my PC.

autoplot(bm2)

enter image description here

henrik's approach has taken lead in terms of speed for the 2^14 (16384) rows case.

To identify if this indicates a trend, run time vs problem size n is plotted by

ggplot(bm2) + 
  aes(x = n, y = median, color = expression, group = attr(expression, "description"), 
      label = ifelse(n == max(n), attr(expression, "description"), "")) +
  geom_point() +
  geom_smooth(method = "lm", se = FALSE) +
  scale_x_continuous(trans = scales::log2_trans(), expand = expansion(mult = c(0.05, 0.1))) + 
  ggrepel::geom_label_repel(direction = "y", nudge_x = 1000)

enter image description here

henrik's approach seems to have a high overhead but gains a speed advantage with increasing problem size.

Also with respect to memory allocation, henrik's approach aggregating in a non-equi self join seems to need substantially less memory than the other approaches. More importantly, memory allocation increases less steep with problem size which indicates that this approach can handle much larger problem sizes when available computer memory is a limiting factor.

enter image description here

EDIT: 3rd set of benchmark runs

This set of benchmark runs compares henrik's aggregate in a non-equi self join with chinsoon12's new Rcpp solution.

Due to the much smaller memory footprint of both approaches the problem size can be increased up to 2^18 (262144) rows before hitting the 16 GB memory limit on my Windows PC.

library(Rcpp)
bm4 <- press(
  n = 2^(10:18),
  {
    m <- 1.
    d <- data.table(value = as.double(seq(n)))[, min_val := value - m][, max_val := value + m][, count := -1L]
    mark(
      henrik = {
        d[ , count := d[d, on = .(value > min_val, value < max_val), .N, by = .EACHI]$N]
      },
      chinsoon = {
        cppFunction("IntegerVector inrange(NumericVector v, NumericVector minv, NumericVector maxv) {
    int n = v.size();
    IntegerVector res(n);
    
    for (int r=0; r<n; r++) {
        for (int i=0; i<n; i++) {
            if (v[i] > minv[r] && v[i] < maxv[r]) {
                res[r]++;
            }
        }
    }
    
    return res;
}")
        d[, count := inrange(value, min_val, max_val)]
      },
      min_iterations = 3
    )
  }
)

The next two charts show the median run time and the memory allocation vs problem size, resp. (please, note the log scales):

enter image description here

Results for n = 2^18 (262144):

setDT(bm4)[n == 2^18, .(expression = unique(attr(expression, "description")), 
                        n, median, mem_alloc)]

   expression      n       median     mem_alloc
       <char>  <num> <bench_time> <bench_bytes>
1:     henrik 262144       17.47s       12.14MB
2:   chinsoon 262144        1.03m        2.06MB

Apparently, chinsoon's approach is faster for problem sizes up to 2^16 (65536) while henrik's approach ist faster for larger problem sizes (and seems to have a more linear time behaviour). For problem size n = 2^18, henrik's approach is almost 4 times faster than chinsoon's.

On the other hand, henrik's approach allocates much more memory than chinsoon's. For problem size n = 2^18, henrik's approach allocates about 6 time more memory than chinsoon's. Apparently, this ratio is constant for increasing problem size.

So, there is a tradeoff between speed (henrik's approach) and memory requirement (chinsoon's approach) depending on problem size. Your mileage may vary.

155

answered Oct 08 '22 06:10

Uwe

We can try outer like this

> setDT(df)[, count := colSums(outer(value, min_val, ">") & outer(value, max_val, "<"))][]
    value min_val max_val count
1: 94.001   94.00   94.02     3
2: 94.002   94.00   94.03     3
3: 94.003   94.01   94.04     0
4: 95.000   94.98   95.02     1

data

> dput(df)
structure(list(value = c(94.001, 94.002, 94.003, 95), min_val = c(94,
94, 94.01, 94.98), max_val = c(94.02, 94.03, 94.04, 95.02)), class = "data.frame", row.names = c(NA,
-4L))

answered Oct 08 '22 04:10

ThomasIsCoding

Related questions
                            
                                Remove the last comma in a string
                            
                                How to control ggplot's plotting area proportions instead of fitting them to devices in R?
                            
                                R - Extend a multi-dimensional array
                            
                                How to move cells with a value row-wise to the left in a dataframe [duplicate]
                            
                                Count rows between NA's
                            
                                Closest point to a path
                            
                                R implementation for Finding the longest common starting substrings in a set of strings
                            
                                Convert integers to decimal values
                            
                                Use string to select column per row in dplyr (or base R)
                            
                                Adding new column with conditional values using ifelse
                            
                                Return last match from vector
                            
                                extract weekdays from a set of dates in R
                            
                                Changing temporary directory in R [duplicate]
                            
                                How to match a string and white space in R
                            
                                R - Identify a sequence of row elements by groups in a dataframe
                            
                                Generate a sequence of numbers between values of vector
                            
                                Sum elements across a list of data.frames
                            
                                Adjust size of Shiny progress bar and center it
                            
                                Take every nth row from a file with groups and n is a given in a column
                            
                                Select highest values in a dataframe by group

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

Count number of rows within certain range

Tags:

performance

dataframe

range

r

data.table

iuliux

People also ask

2 Answers

2nd set of benchmark runs

EDIT: 3rd set of benchmark runs

Uwe

data

ThomasIsCoding

Recent Activity

Donate For Us