Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How can I calculate the percentage change within a group for multiple columns in R?

I have a data frame with an ID column, a date column (12 months for each ID), and I have 23 numeric variables. I would like to obtain the percentage change by month within each ID. I am using the quantmod package in order to obtain the percent change.

Here is an example with only three columns (for simplicity):

ID Date V1 V2 V3
1  Jan   2  3  5
1  Feb   3  4  6
1  Mar   7  8  9
2  Jan   1  1  1
2  Feb   2  3  4
2  Mar   7  8   8

I tried to use dplyr and the summarise_each function, but that was unsuccessful. More specifically, I tried the following (train is the name of the data set):

library(dplyr)
library(quantmod)

group1<-group_by(train,EXAMID)

foo<-function(x){
  return(Delt(x))
}

summarise_each(group1,funs(foo))

I also tried to use the do function in dplyr, but I was not successful with that either (having a bad night I guess!).

I think that the issue is the Delt function. When I replace Delt with the sum function:

foo<-function(x){
      return(sum(x))
    }
summarise_each(group1,funs(foo))

The result is that every variable is summed across the date for each ID. So how can about the percentage change month-over-month for each ID?

like image 432
mmmmmmmmmm Avatar asked Jul 11 '15 01:07

mmmmmmmmmm


People also ask

How do you find the percentage of a group of data in R?

To calculate percent, we need to divide the counts by the count sums for each sample, and then multiply by 100. This can also be done using the function decostand from the vegan package with method = "total" .

How do you calculate percentage change between two columns?

To find the percentage difference in excel, first, find the difference between the two numbers and divide this difference with the base value. After obtaining the results, multiply the decimal number by 100; this result will represent the percentage difference.

How do you find the percentage difference between two groups?

To calculate the percentage difference between two numbers, a and b , perform the following calculations: Find the absolute difference between two numbers: |a - b| Find the average of those two numbers: (a + b) / 2. Divide the difference by the average: |a - b| / ((a + b) / 2)


2 Answers

The issue you are running into is because your data is not formatted in a "tidy" way. You have observations (V1:V3) that are in columns creating a "wide" data frame. The "tidyverse" works best with long format. The good news is with the gather() function you can get exactly what you need. Here's a solution using the "tidyverse".


library(tidyverse)

# Recreate data set
df <- tribble(
    ~ID, ~Date, ~V1, ~V2, ~V3,
    1,  "Jan",   2,  3,  5,
    1,  "Feb",   3,  4,  6,
    1,  "Mar",   7,  8,  9,
    2,  "Jan",   1,  1,  1,
    2,  "Feb",   2,  3,  4,
    2,  "Mar",   7,  8,  8
)
df
#> # A tibble: 6 × 5
#>      ID  Date    V1    V2    V3
#>   <dbl> <chr> <dbl> <dbl> <dbl>
#> 1     1   Jan     2     3     5
#> 2     1   Feb     3     4     6
#> 3     1   Mar     7     8     9
#> 4     2   Jan     1     1     1
#> 5     2   Feb     2     3     4
#> 6     2   Mar     7     8     8

# Gather and calculate percent change
df %>%
    gather(key = key, value = value, V1:V3) %>%
    group_by(ID, key) %>%
    mutate(lag = lag(value)) %>%
    mutate(pct.change = (value - lag) / lag)
#> Source: local data frame [18 x 6]
#> Groups: ID, key [6]
#> 
#>       ID  Date   key value   lag pct.change
#>    <dbl> <chr> <chr> <dbl> <dbl>      <dbl>
#> 1      1   Jan    V1     2    NA         NA
#> 2      1   Feb    V1     3     2  0.5000000
#> 3      1   Mar    V1     7     3  1.3333333
#> 4      2   Jan    V1     1    NA         NA
#> 5      2   Feb    V1     2     1  1.0000000
#> 6      2   Mar    V1     7     2  2.5000000
#> 7      1   Jan    V2     3    NA         NA
#> 8      1   Feb    V2     4     3  0.3333333
#> 9      1   Mar    V2     8     4  1.0000000
#> 10     2   Jan    V2     1    NA         NA
#> 11     2   Feb    V2     3     1  2.0000000
#> 12     2   Mar    V2     8     3  1.6666667
#> 13     1   Jan    V3     5    NA         NA
#> 14     1   Feb    V3     6     5  0.2000000
#> 15     1   Mar    V3     9     6  0.5000000
#> 16     2   Jan    V3     1    NA         NA
#> 17     2   Feb    V3     4     1  3.0000000
#> 18     2   Mar    V3     8     4  1.0000000
like image 134
Matt Dancho Avatar answered Sep 19 '22 19:09

Matt Dancho


How about using pct <- function(x) x/lag(x)? (or (x/lag(x)-1)*100, or however you wish to specify pct change exactly) e.g.,

pct(1:3)
[1]  NA 2.0 1.5

Edit: Adding Frank's suggestion

pct <- function(x) {x/lag(x)}

dt %>% group_by(ID) %>% mutate_each(funs(pct), c(V1, V2, V3))

ID Date       V1       V2  V3
1  Jan       NA       NA  NA
1  Feb 1.500000 1.333333 1.2
1  Mar 2.333333 2.000000 1.5
2  Jan       NA       NA  NA
2  Feb 2.000000 3.000000 4.0
2  Mar 3.500000 2.666667 2.0
like image 28
dzeltzer Avatar answered Sep 22 '22 19:09

dzeltzer