Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Easier way to calculate change in variable using dplyr?

Tags:

r

dplyr

I am trying to find an easier way to calculate change in a variable (represented by a column) in a data frame using dplyr. My toy data set is something like this

structure(list(CAR = structure(c(1L, 2L, 3L, 4L, 5L, 6L, 1L, 
2L, 3L, 4L, 5L, 6L, 1L, 2L, 3L, 4L, 5L, 6L), .Label = c("a", 
"b", "c", "d", "e", "f"), class = "factor"), TIME = c(0L, 0L, 
0L, 0L, 0L, 0L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L
), VAR = c(20L, 30L, 40L, 50L, 60L, 70L, 30L, 40L, 50L, 60L, 
70L, 80L, 40L, 50L, 60L, 70L, 80L, 90L)), .Names = c("CAR", "TIME", 
"VAR"), class = "data.frame", row.names = c(NA, -18L))

which looks like

   CAR TIME VAR
1    a    0  20
2    b    0  30
3    c    0  40
4    d    0  50
5    e    0  60
6    f    0  70
7    a    1  30
8    b    1  40
9    c    1  50
10   d    1  60
11   e    1  70
12   f    1  80
13   a    2  40
14   b    2  50
15   c    2  60
16   d    2  70
17   e    2  80
18   f    2  90

I am trying to calculate change in VAR between TIME equal to 0 and other times, e.g., 1,2 for each CAR.

This is what I do, which seems a very convoluted way, first I get values of VAR at TIME equals to 0

library(dplyr)
X <- local_test %>% filter(TIME == 0)  %>% group_by(CAR)  %>% mutate(baseline_VAR = VAR)

X looks like

Source: local data frame [6 x 4]
Groups: CAR

  CAR TIME VAR baseline_VAR
1   a    0  20           20
2   b    0  30           30
3   c    0  40           40
4   d    0  50           50
5   e    0  60           60
6   f    0  70           70

then, I do a left_join with the original data frame local_test

Y  <- left_join(local_test, X, by = c("CAR"))

Y looks like

   CAR TIME.x VAR.x TIME.y VAR.y baseline_VAR
1    a      0    20      0    20           20
2    b      0    30      0    30           30
3    c      0    40      0    40           40
4    d      0    50      0    50           50
5    e      0    60      0    60           60
6    f      0    70      0    70           70
7    a      1    30      0    20           20
8    b      1    40      0    30           30
9    c      1    50      0    40           40
10   d      1    60      0    50           50
11   e      1    70      0    60           60
12   f      1    80      0    70           70
13   a      2    40      0    20           20
14   b      2    50      0    30           30
15   c      2    60      0    40           40
16   d      2    70      0    50           50
17   e      2    80      0    60           60
18   f      2    90      0    70           70

finally, I add a column in Y, which calculates the change in VAR between two different TIME for CAR

Y %>% group_by(CAR) %>% mutate(change_VAR = VAR.x - baseline_VAR)

Final Y looks like

Source: local data frame [18 x 7]
Groups: CAR

   CAR TIME.x VAR.x TIME.y VAR.y baseline_VAR change_VAR
1    a      0    20      0    20           20          0
2    b      0    30      0    30           30          0
3    c      0    40      0    40           40          0
4    d      0    50      0    50           50          0
5    e      0    60      0    60           60          0
6    f      0    70      0    70           70          0
7    a      1    30      0    20           20         10
8    b      1    40      0    30           30         10
9    c      1    50      0    40           40         10
10   d      1    60      0    50           50         10
11   e      1    70      0    60           60         10
12   f      1    80      0    70           70         10
13   a      2    40      0    20           20         20
14   b      2    50      0    30           30         20
15   c      2    60      0    40           40         20
16   d      2    70      0    50           50         20
17   e      2    80      0    60           60         20
18   f      2    90      0    70           70         20

This seems like a lot of extra work, with extra columns being added to the original data frame. I need to do this operation for a large data frame repeatedly. Is there an easier (one-step) way to compute change_VAR?

Thanks!

like image 262
Satya Avatar asked Oct 22 '25 15:10

Satya


2 Answers

This could be done by taking the difference of 'VAR' with the min of 'VAR' grouped by 'CAR'.

local_test %>%
     group_by(CAR) %>%
     mutate(change_VAR= VAR- min(VAR))

Or if the base value of 'VAR' is when 'TIME' is 0 (assuming there are no duplicates of 'TIME' per each group), we subset the 'VAR' for TIME 0 and get the difference.

local_test %>% 
      group_by(CAR) %>%
      mutate(change_VAR= VAR- VAR[TIME==0])
like image 110
akrun Avatar answered Oct 24 '25 05:10

akrun


group_by(DF, CAR) %>%
  arrange(CAR, TIME) %>%
  mutate(change_VAR = VAR-VAR[1]) %>%
  ungroup() %>%
  arrange(TIME, CAR)

I did a little more arranging that akrun, but I'm not assuming the data are properly ordered up front. Arranging also ensures that the earliest observation is at position 1 in each group.

EDIT: If you're interested in this kind of thing. Here's the speed comparison between mine and Akrun's answers. (short answer, go with Akrun)

microbenchmark(
  akrun1 =  DF %>%
              group_by(CAR) %>%
              mutate(change_VAR= VAR- min(VAR)),
  akrun2 = DF %>% 
             group_by(CAR) %>%
             mutate(change_VAR= VAR- VAR[TIME==0]),
  Benjamin = group_by(DF, CAR) %>%
               arrange(CAR, TIME) %>%
               mutate(change_VAR = VAR-VAR[1]) %>%
               ungroup() %>%
               arrange(TIME, CAR))
unit: microseconds
     expr      min        lq      mean   median        uq      max neval cld
   akrun1  887.360  903.1950  975.6354  930.467  954.5125 3246.223   100  a 
   akrun2  888.240  901.1425  947.5220  913.899  952.7530 3002.536   100  a 
 Benjamin 1584.697 1604.7835 1720.8034 1662.993 1692.1705 3765.561   100   b
like image 37
Benjamin Avatar answered Oct 24 '25 04:10

Benjamin