Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to add calculated columns to nested data frames (list columns) using purrr

Tags:

r

dplyr

purrr

I would like to perform calculations on a nested data frame (stored as a list-column), and add the calculated variable back to each dataframe using purrr functions. I'll use this result to join to other data, and keeping it compact helps me to organize and examine it better. I can do this in a couple of steps, but it seems like there may be a solution I haven't come across. If there is a solution out there, I haven't been able to find it easily.

Load libraries. example requires the following packages (available on CRAN):

library(dplyr)
library(purrr)
library(RcppRoll) # to calculate rolling mean

Example data with 3 subjects, and repeated measurements over time:

test <- data_frame(
  id= rep(1:3, each=20),
  time = rep(1:20, 3),
  var1 = rnorm(60, mean=10, sd=3),
  var2 = rnorm(60, mean=95, sd=5)
  )

Store the data as nested dataframe:

t_nest <- test %>% nest(-id)

     id              data
  <int>            <list>
1     1 <tibble [20 x 3]>
2     2 <tibble [20 x 3]>
3     3 <tibble [20 x 3]>

Perform calculations. I will calculate multiple new variables based on the data, although a solution for just one could be expanded later. The result of each calculation will be a numeric vector, same length as the input (n=20):

t1 <- t_nest %>% 
  mutate(var1_rollmean4 = map(data, ~RcppRoll::roll_mean(.$var1, n=4, align="right", fill=NA)),
         var2_delta4 = map(data, ~(.$var2 - lag(.$var2, 3))*0.095),
         var3 = map2(var1_rollmean4, var2_delta4, ~.x -.y))

     id              data var1_rollmean4 var2_delta4       var3
  <int>            <list>         <list>      <list>     <list>
1     1 <tibble [20 x 3]>     <dbl [20]>  <dbl [20]> <dbl [20]>
2     2 <tibble [20 x 3]>     <dbl [20]>  <dbl [20]> <dbl [20]>
3     3 <tibble [20 x 3]>     <dbl [20]>  <dbl [20]> <dbl [20]>

my solution is to unnest this data, and then nest again. There doesn't seem to be anything wrong with this, but seems like a better solution may exist.

t1 %>% unnest %>% 
  nest(-id)

     id              data
  <int>            <list>
1     1 <tibble [20 x 6]>
2     2 <tibble [20 x 6]>
3     3 <tibble [20 x 6]>

This other solution (from SO 42028710) is close, but not quite because it is a list rather than nested dataframes:

map_df(t_nest$data, ~ mutate(.x, var1calc = .$var1*100))   

I've found quite a bit of helpful information using the purrr Cheatsheet but can't quite find the answer.

like image 274
Matt L. Avatar asked Sep 26 '17 21:09

Matt L.


2 Answers

You can wrap another mutate when mapping through the data column and add the columns in each nested tibble:

t11 <- t_nest %>% 
    mutate(data = map(data, 
        ~ mutate(.x, 
            var1_rollmean4 = RcppRoll::roll_mean(var1, n=4, align="right", fill=NA),
            var2_delta4 = (var2 - lag(var2, 3))*0.095,
            var3 = var1_rollmean4 - var2_delta4
        )
   ))

t11
# A tibble: 3 x 2
#     id              data
#  <int>            <list>
#1     1 <tibble [20 x 6]>
#2     2 <tibble [20 x 6]>
#3     3 <tibble [20 x 6]>

unnest-nest method, and then reorder the columns inside:

nest_unnest <- t1 %>% 
    unnest %>% nest(-id) %>% 
    mutate(data = map(data, ~ select(.x, time, var1, var2, var1_rollmean4, var2_delta4, var3)))

identical(nest_unnest, t11)
# [1] TRUE
like image 80
Psidom Avatar answered Nov 15 '22 13:11

Psidom


It seems like for what you're trying to do, nesting is not necessary

library(tidyverse)
library(zoo)
test %>%
  group_by(id) %>%
  mutate(var1_rollmean4 = rollapplyr(var1, 4, mean, fill=NA),
       var2_delta4 = (var2 - lag(var2, 3))*0.095,
         var3 = (var1_rollmean4 - var2_delta4))

# A tibble: 60 x 7
# Groups:   id [3]
      # id  time      var1      var2 var1_rollmean4 var2_delta4      var3
   # <int> <int>     <dbl>     <dbl>          <dbl>       <dbl>     <dbl>
 # 1     1     1  9.865199  96.45723             NA          NA        NA
 # 2     1     2  9.951429  92.78354             NA          NA        NA
 # 3     1     3 12.831509  95.00553             NA          NA        NA
 # 4     1     4 12.463664  95.37171      11.277950 -0.10312483 11.381075
 # 5     1     5 11.781704  92.05240      11.757076 -0.06945881 11.826535
 # 6     1     6 12.756932  92.15666      12.458452 -0.27064269 12.729095
 # 7     1     7 12.346409  94.32411      12.337177 -0.09952197 12.436699
 # 8     1     8 10.223695 100.89043      11.777185  0.83961377 10.937571
 # 9     1     9  4.031945  87.38217       9.839745 -0.45357658 10.293322
# 10     1    10 11.859477  97.96973       9.615382  0.34633428  9.269047
# ... with 50 more rows

Edit You could nest the result with %>% nest(-id) still

If you still prefer to nest or are nesting for other reasons, it would go like

t1 <- t_nest %>% 
        mutate(data = map(data, ~.x %>% mutate(...)))

That is, you mutate on .x within the map statement. This will treat data as a data.frame and mutate will column-bind results to it.

like image 27
CPak Avatar answered Nov 15 '22 15:11

CPak