Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

What's a tidyverse approach to iterating over rows in a data frame when vectorisation is not feasible?

I want to know the best way to iterate over rows of a data frame when the value of a variable at row n depends on the value of variable(s) at row n-1 and/or n-2. Ideally I would like to do this in a "tidyverse" way, perhaps with purrr::pmap().

For example, say I have this data frame:

library(dplyr)

x <- tibble(t = c(1:10),
            a = c(seq(100, 140, 10), rep(NA_real_, 5)),
            b = c(runif(5), rep(NA_real_, 5)),
            c = c(runif(5), rep(NA_real_, 5)))

x
#> # A tibble: 10 x 4
#>        t     a      b         c
#>    <int> <dbl>  <dbl>     <dbl>
#>  1     1   100  0.750  0.900   
#>  2     2   110  0.898  0.657   
#>  3     3   120  0.731  0.000137
#>  4     4   130  0.208  0.696   
#>  5     5   140  0.670  0.882   
#>  6     6    NA NA     NA       
#>  7     7    NA NA     NA       
#>  8     8    NA NA     NA       
#>  9     9    NA NA     NA       
#> 10    10    NA NA     NA

I have known values up to time (t) = 5. Beyond that, I wish to project values, using the following formulae:

a = lag(a) * 1.1
b = a * lag(b)
c = b * lag(a, 2)

This code achieves the desired output, but it's a clunky, horrible for loop that scales poorly to larger datasets:

for(i in 1:nrow(x)) {
  x <- x %>%
    mutate(a = if_else(!is.na(a), a, lag(a, 1) * 1.1),
           b = if_else(!is.na(b), b, a * lag(b, 1)),
           c = if_else(!is.na(c), c, b * lag(a, 2)))
}

x
#> # A tibble: 10 x 4
#>        t     a        b        c
#>    <int> <dbl>    <dbl>    <dbl>
#>  1     1  100  7.50e- 1 9.00e- 1
#>  2     2  110  8.98e- 1 6.57e- 1
#>  3     3  120  7.31e- 1 1.37e- 4
#>  4     4  130  2.08e- 1 6.96e- 1
#>  5     5  140  6.70e- 1 8.82e- 1
#>  6     6  154  1.03e+ 2 1.34e+ 4
#>  7     7  169. 1.75e+ 4 2.45e+ 6
#>  8     8  186. 3.26e+ 6 5.02e+ 8
#>  9     9  205. 6.68e+ 8 1.13e+11
#> 10    10  225. 1.51e+11 2.80e+13
like image 200
Matt Cowgill Avatar asked Sep 01 '19 02:09

Matt Cowgill


3 Answers

I think that for this sort of intrinsically iterative process it is genuinely hard to beat a for loop. The method proposed by @Shree depends on the NAs being continuous and starting in a known spot.

Here is my mild improvement on your loop, which I think is more readable and about 2.5 times the speed and will probably scale up better than your approach which combines vectorized operations with the loop. By moving out of the tidyverse altogether and embracing a rowwise loop that really works on each row one at a time, we get some efficiencies on both counts:

method_peter <- function(x){
  for(i in 2:nrow(x)){
    x[i, "a"] <- ifelse(is.na(x[i, "a"]), x[i - 1, "a"] * 1.1,       x[i, "a"])
    x[i, "b"] <- ifelse(is.na(x[i, "b"]), x[i, "a"] * x[i - 1, "b"], x[i, "b"])
    x[i, "c"] <- ifelse(is.na(x[i, "c"]), x[i, "b"] * x[i - 2, "a"], x[i, "c"])
  }
  return(x)
}

There's doubtless more efficiencies possible, and of course this is an ideal candidate to rewrite it in C++ :).

This is about twice as fast as your method as seen by this:

method_matt <- function(x){
  for(i in 1:nrow(x)) {
    x <- x %>%
      mutate(a = if_else(!is.na(a), a, lag(a, 1) * 1.1),
             b = if_else(!is.na(b), b, a * lag(b, 1)),
             c = if_else(!is.na(c), c, b * lag(a, 2)))
  }
  return(x)
}

set.seed(123)
x <- tibble(t = c(1:10),
            a = c(seq(100, 140, 10), rep(NA_real_, 5)),
            b = c(runif(5), rep(NA_real_, 5)),
            c = c(runif(5), rep(NA_real_, 5)))

stopifnot(identical(method_matt(x), method_peter(x)))

library(microbenchmark)
microbenchmark(
  method_matt(x),
  method_peter(x)
)

which returns:

Unit: milliseconds
            expr     min       lq     mean   median      uq     max neval
  method_matt(x) 24.1975 25.50925 30.64438 26.33310 31.8681 74.5093   100
 method_peter(x) 10.0005 10.56050 13.33751 11.06495 13.5913 42.0568   100

@Shree's method is much faster again and is ideal for the example data, but I'm not sure it is flexible enough to work in all your use cases.

I would like to see a tidyverse solution if there is one.

like image 144
Peter Ellis Avatar answered Oct 17 '22 05:10

Peter Ellis


Edit: Added tidyverse approach

Here's a readable and flexible tidyverse approach. The downside is that it is very slow.

accumutate <- function(df, ...){
  df %>% group_by(row_number()) %>%
    nest() %>%
    pull(data) %>%
    accumulate(function(x,y){bind_rows(x,y) %>% mutate(!!!enquos(...)) }) %>%
    .[[length(.)]]
}

x %>%
  accumutate(a = ifelse(is.na(a), 1.1 * lag(a,1), a)) %>%
  accumutate(b = ifelse(is.na(b), a * lag(b), b)) %>%
  accumutate(c = ifelse(is.na(c),b * lag(a, 2), c))

#> # A tibble: 10 x 4
#>        t     a        b        c
#>    <int> <dbl>    <dbl>    <dbl>
#>  1     1  100  2.88e- 1 4.56e- 2
#>  2     2  110  7.88e- 1 5.28e- 1
#>  3     3  120  4.09e- 1 8.92e- 1
#>  4     4  130  8.83e- 1 5.51e- 1
#>  5     5  140  9.40e- 1 4.57e- 1
#>  6     6  154  1.45e+ 2 1.88e+ 4
#>  7     7  169. 2.45e+ 4 3.43e+ 6
#>  8     8  186. 4.57e+ 6 7.04e+ 8
#>  9     9  205. 9.37e+ 8 1.59e+11
#> 10    10  225. 2.11e+11 3.94e+13

Created on 2020-10-07 by the reprex package (v0.3.0)


Here's another approach that you might find interesting. It's not concise or especially readable, but it's tidyverse (or at least functionally) inspired. And it performs fairly well.

It uses a semigroup pattern, converting the mutate expressions into binary functions, creating corresponding lists and then using accumulate.

library(tidyverse)
library(dplyr)
library(microbenchmark)
options(width =100)

set.seed(123)

# Create the data frame
x <- tibble(t = c(1:100),
            a = c(seq(100, 140, 10), rep(NA_real_,100- 5)),
            b = c(runif(5), rep(NA_real_, 100-5)),
            c = c(runif(5), rep(NA_real_, 100-5)))

a_mappend <- function(a1, a2) {
  ifelse(is.na(a2), a1 * 1.1, a2)
}

b_mappend <- function(ab1, ab2) {
    list(a = ab2$a, b =  ifelse(is.na(ab2$b), ab2$a * ab1$b,ab2$b))
}

c_mappend <- function(abc12, abc23) {
  list(abc1 = list(a = abc12$abc2$a, b = abc12$abc2$b, c = abc12$abc2$c),
       abc2 = list(a = abc23$abc2$a, b = abc23$abc2$b, c = ifelse(is.na(abc23$abc2$c),abc12$abc1$a * abc23$abc2$b,abc23$abc2$c)))
}

method_ian <- function(x) {
  x %>%
    mutate(a = accumulate(a, a_mappend)) %>%
    mutate(b = list(a, b) %>% 
                  pmap(~ list(a = .x, b = .y)) %>% 
                  accumulate(b_mappend) %>% map_dbl(~ .x$b)) %>%
    mutate(c = list(a, b, c, c(a[-1], NA), c(b[-1], NA), c(c[-1], NA)) %>%
                  pmap(~ list(abc1 = list(a = ..1, b = ..2, c = ..3),
                              abc2 = list(a = ..4, b = ..5, c = ..6))) %>% 
                  accumulate(c_mappend) %>% map_dbl(~ .x$abc1$c))
}


method_matt <- function(x){
  for(i in 1:nrow(x)) {
    x <- x %>%
      mutate(a = if_else(!is.na(a), a, lag(a, 1) * 1.1),
             b = if_else(!is.na(b), b, a * lag(b, 1)),
             c = if_else(!is.na(c), c, b * lag(a, 2)))
  }
  return(x)
}

method_peter <- function(x){
  for(i in 2:nrow(x)){
    x[i, "a"] <- ifelse(is.na(x[i, "a"]), x[i - 1, "a"] * 1.1,       x[i, "a"])
    x[i, "b"] <- ifelse(is.na(x[i, "b"]), x[i, "a"] * x[i - 1, "b"], x[i, "b"])
    x[i, "c"] <- ifelse(is.na(x[i, "c"]), x[i, "b"] * x[i - 2, "a"], x[i, "c"])
  }
  return(x)
}

stopifnot(identical(method_matt(x), method_ian(x)))

microbenchmark( method_matt(x), method_peter(x), method_ian(x))
#> Unit: milliseconds
#>             expr       min        lq      mean    median        uq       max neval
#>   method_matt(x) 324.90086 330.93192 337.46518 334.55447 338.38461 426.30457   100
#>  method_peter(x) 208.27498 211.60526 213.59438 212.66088 214.36421 242.59854   100
#>    method_ian(x)  13.06774  13.43105  14.30003  13.86428  14.32263  19.54843   100

Created on 2020-10-06 by the reprex package (v0.3.0)

like image 3
Ian Avatar answered Oct 17 '22 07:10

Ian


I don't think there's any simple way in tidyverse to do calculations with row-dependencies. Something with Reduce or gather + spread could be possible but I don't expect them to score poits on readability.

Anyways, on the bright side, your calculations are vectorizable using dplyr and zoo packages -

x %>% 
  mutate(
    a = ifelse(is.na(a), na.locf(a) * 1.1^(t-5), a),
    b = ifelse(is.na(b), na.locf(b) * c(rep(1, 5), cumprod(a[6:n()])), b),
    c = ifelse(is.na(c), b * lag(a, 2), c)
  )

 # A tibble: 10 x 4
 t     a        b        c
 <int> <dbl>    <dbl>    <dbl>
 1     1  100  1.85e- 1 9.43e- 1
 2     2  110  7.02e- 1 1.29e- 1
 3     3  120  5.73e- 1 8.33e- 1
 4     4  130  1.68e- 1 4.68e- 1
 5     5  140  9.44e- 1 5.50e- 1
 6     6  154  1.45e+ 2 1.89e+ 4
 7     7  169. 2.46e+ 4 3.45e+ 6
 8     8  186. 4.59e+ 6 7.07e+ 8
 9     9  205. 9.40e+ 8 1.59e+11
10    10  225. 2.12e+11 3.95e+13

Data -

set.seed(2)
x <- tibble(t = c(1:10),
            a = c(seq(100, 140, 10), rep(NA_real_, 5)),
            b = c(runif(5), rep(NA_real_, 5)),
            c = c(runif(5), rep(NA_real_, 5)))
like image 2
Shree Avatar answered Oct 17 '22 07:10

Shree