Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Mutate across multiple columns to create new variable sets

I have a panel dataset at the country and year level, and I'd like to create a two new variables based on existing ones.

year country var1 var2 var3 var 4 mean_var1 relmean_var1
1910 GER 1 4 10 6 3 0.333
1911 GER 2 3 11 7 1.5 1.3333
1910 FRA 5 6 8 9 3 1.66667
1911 FRA 1 4 10 9 1.5 .66667

What I'd like to do is create two new variables set : (1) a variable set of the average for each year (across countries) and (2) a variable set of the country value relative to the year-average. For example, for var1(1) would yield mean_var1 and (2) relmean_var1 and I'd want these for all the other variables. In total, there are over 1000 variables in the dataset, but I would only apply this function to about 6.

I have code that works for the first part, but I'd like to combine it as efficiently as possible with the second.

library(dplyr)
library(purrr)
df<- df%>% 
            group_by(year) %>%
            mutate_at(.funs = list(mean = ~mean(.)), .vars = c("var1", "var1", "var1", "var4"))

This code yields new variables called var1_mean (I would prefer mean_var1: how do I change this name?)

For the second step, I've tried:

df <- df %>%
map2_dfr(.x = d.test %>%
            select(var1, var2),
          .y = d.test %>%
            select(var1_mean, var2_mean), 
          ~ .x / .y) %>%
   setNames(c("relmean_var1", "relmean_var2"))

and I get errors

""Error in select(., var1, var2) : object 'd.test' not found."

. (I got this set up from this question)

I also tried:

 map2(var1, var1_mean, ~ df[[.x]] / df[[.y]]) %>% 
   set_names(cols) %>% 
   bind_cols(df, .)

And got

"Error in map2(var1, var1_mean, ~df[[.x]]/df[[.y]]) : object 'var1' not found

What's the best way to combine these two goals? Ideally with the naming scheme mean_var1 for (1) and relmean_var1 for (2)

Edit: input dataframe should look like this:

data <- tibble::tribble(
  ~year, ~country, ~var1, ~var2, ~var3, ~var.4,
  1910L,    "GER",    1L,    4L,   10L,     6L,
  1911L,    "GER",    2L,    3L,   11L,     7L,
  1910L,    "FRA",    5L,    6L,    8L,     9L,
  1911L,    "FRA",    1L,    4L,   10L,     9L
)

output dataframe should look like this (for all variables, just showing var1 as an example, but should be the same format for var2 through var4):

datanew  <- tibble::tribble(
  ~year, ~country, ~var1, ~var2, ~var3, ~var.4, ~mean_var1 , ~relmean_var1
  1910L,    "GER",    1L,    4L,   10L,     6L,     3L,        .3333L,
  1911L,    "GER",    2L,    3L,   11L,     7L,     1.5L,     1.3333L,
  1910L,    "FRA",    5L,    6L,    8L,     9L,     3L,       1.6667L,
  1911L,    "FRA",    1L,    4L,   10L,     9L      1.5L,      .6667L,
)
like image 592
PierreRoubaix Avatar asked Nov 24 '21 13:11

PierreRoubaix


2 Answers

This might be easier in long format, but here's an option you can pursue as wide data.

Using the latest version of dplyr you can mutate across and include .names argument to define how your want your new columns to look.

library(tidyverse)

my_col <- c("var1", "var2", "var3", "var4")

df %>%
  group_by(year) %>%
  mutate(across(my_col, mean, .names = "mean_{col}")) %>%
  mutate(across(my_col, .names = "relmean_{col}") / across(paste0("mean_", my_col)))

Output

   year country  var1  var2  var3  var4 mean_var1 mean_var2 mean_var3 mean_var4 relmean_var1 relmean_var2 relmean_var3 relmean_var4
  <int> <chr>   <int> <int> <int> <int>     <dbl>     <dbl>     <dbl>     <dbl>        <dbl>        <dbl>        <dbl>        <dbl>
1  1910 GER         1     4    10     6       3         5         9         7.5        0.333        0.8          1.11         0.8  
2  1911 GER         2     3    11     7       1.5       3.5      10.5       8          1.33         0.857        1.05         0.875
3  1910 FRA         5     6     8     9       3         5         9         7.5        1.67         1.2          0.889        1.2  
4  1911 FRA         1     4    10     9       1.5       3.5      10.5       8          0.667        1.14         0.952        1.12
like image 192
Ben Avatar answered Nov 04 '22 11:11

Ben


library(tidyverse)

data <- tibble::tribble(
  ~year, ~country, ~var1, ~var2, ~var3, ~var.4,
  1910L,    "GER",    1L,    2L,   10L,     6L,
  1911L,    "GER",    2L,    3L,   11L,     7L,
  1910L,    "FRA",    5L,    6L,    8L,     9L,
  1911L,    "FRA",    1L,    3L,   10L,     9L
)

data_long <-
  data %>%
  pivot_longer(-c(year, country))

data_long
#> # A tibble: 16 x 4
#>     year country name  value
#>    <int> <chr>   <chr> <int>
#>  1  1910 GER     var1      1
#>  2  1910 GER     var2      2
#>  3  1910 GER     var3     10
#>  4  1910 GER     var.4     6
#>  5  1911 GER     var1      2
#>  6  1911 GER     var2      3
#>  7  1911 GER     var3     11
#>  8  1911 GER     var.4     7
#>  9  1910 FRA     var1      5
#> 10  1910 FRA     var2      6
#> 11  1910 FRA     var3      8
#> 12  1910 FRA     var.4     9
#> 13  1911 FRA     var1      1
#> 14  1911 FRA     var2      3
#> 15  1911 FRA     var3     10
#> 16  1911 FRA     var.4     9

means_country <-
  data_long %>%
  group_by(country) %>%
  summarise(mean_country_value = mean(value))

means_years <-
  data_long %>%
  group_by(year) %>%
  summarise(mean_year_value = mean(value))

data %>%
  left_join(means_country) %>%
  left_join(means_years)
#> Joining, by = "country"
#> Joining, by = "year"
#> # A tibble: 4 x 8
#>    year country  var1  var2  var3 var.4 mean_country_value mean_year_value
#>   <int> <chr>   <int> <int> <int> <int>              <dbl>           <dbl>
#> 1  1910 GER         1     2    10     6               5.25            5.88
#> 2  1911 GER         2     3    11     7               5.25            5.75
#> 3  1910 FRA         5     6     8     9               6.38            5.88
#> 4  1911 FRA         1     3    10     9               6.38            5.75

Created on 2021-11-24 by the reprex package (v2.0.1)

like image 22
danlooo Avatar answered Nov 04 '22 12:11

danlooo